Anthropic's AI model, Claude 3.5 Sonnet launched. How does it compare to ChatGPT-4o, Gemini?

2 months ago 34

OpenAI’s top rival, Anthropic, recently launched its newest AI model, Claude 3.5 Sonnet. Anthropic claims this new model surpasses well-known competitors like OpenAI’s GPT-4o and Google’s Gemini-1.5 Pro. We explain how it has performed on benchmark tests and take a look at whether the results should be taken as an accurate indicator of practical usefulness read more

Anthropic's AI model, Claude 3.5 Sonnet launched. How does it compare to ChatGPT-4o, Gemini?

Image source: Pixabay/Anthropic.com

Anthropic, a leading contender in the AI space, has just rolled out its latest innovation: Claude 3.5 Sonnet. This marks the inaugural release in the highly anticipated Claude 3.5 series. Anthropic claims this new model surpasses well-known competitors like OpenAI’s GPT-4o and Google’s Gemini-1.5 Pro.

We explain how well the new AI model has performed on benchmark tests. We also take a look at whether the test results should be taken as an accurate indicator of practical usefulness.

About Claude 3.5 Sonnet

It’s a large language model (LLM) developed by Anthropic, part of their family of generative pre-trained transformers. These models excel at predicting the next word in a sequence based on extensive pre-training with vast amounts of text. Claude 3.5 Sonnet builds on the foundation laid by Claude 3 Sonnet, which made its debut in March this year.

Claude 3.5 Sonnet boasts a significant performance boost, operating at twice the speed of its predecessor, Claude 3 Opus. This leap, paired with a more budget-friendly pricing model, positions Claude 3.5 Sonnet as the go-to solution for intricate tasks, including context-aware customer support and managing multi-step workflows, according to Anthropic’s official statement.

Claude 3.5 sonnet vs ChatGPT-4o vs Gemini 1.5 Pro

Anthropic revealed Claude 3.5 Sonnet’s scores (compared with those of its peers) on benchmark tests in a post on social media platform X.

Note that concepts like “0-shot”, “5-shot”, and chain of thought (CoT) are used here. In simple terms, it refers to how many calculation or deduction cycles the model went through to arrive at the answer.

Claude 3.5 vs ChatGPT 4o vs Gemini 1.5 ProImage source: Anthropic.com

Here’s what each of the tests measure, and a comparative of how the three AIs performed.

Graduate-level reasoning
GQPA (Graduate-level Physics Questions and Answers): This benchmark assesses an AI’s capability to answer complex physics questions at a graduate level, testing its advanced reasoning and problem-solving skills within the physics domain. The test was introduced by Rein et al. in a paper titled “GPQA: A Graduate-Level Google-Proof Q&A Benchmark”.

Diamond: This benchmark evaluates an AI’s high-level reasoning across a range of topics, encompassing academic, professional, common sense, and general knowledge domains. It assesses the model’s ability to understand and solve intricate problems that require deep knowledge and critical thinking.

  • Claude 3.5 Sonnet: 59.4% (0-shot CoT)

  • GPT-4o: 53.6% (0-shot CoT)

  • Gemini 1.5 Pro: Not available

Undergraduate-level knowledge
MMLU (Massive Multitask Language Understanding): This benchmark evaluates a model’s understanding across a wide array of undergraduate-level subjects, spanning the humanities, sciences, and social sciences. It measures the model’s breadth of knowledge and its ability to handle diverse topics. The test was introduced by Hendrycks et al. in a paper titled ‘Measuring Massive Multitask Language Understanding".

  • Claude 3.5 Sonnet: 88.7% (5-shot), 88.3% (0-shot CoT)

  • GPT-4o: 88.7% (0-shot CoT)

  • Gemini 1.5 Pro: 85.9% (5-shot)

Code
HumanEval: This benchmark assesses an AI model’s proficiency in generating correct and functional code snippets from natural language descriptions of programming tasks. It tests the model’s grasp of programming languages and its ability to solve problems in software development.

  • Claude 3.5 Sonnet: 92.0% (0-shot)

  • GPT-4o: 90.2% (0-shot)

  • Gemini 1.5 Pro: 84.1% (0-shot)

Multilingual math
MGSM (Multilingual Grade School Math): This benchmark evaluates a model’s capacity to solve grade school-level math problems presented in various languages. It tests both the model’s mathematical reasoning and its ability to understand and respond in different linguistic contexts. It was introduced by Shi et al. in a paper titled “Language Models are Multilingual Chain-of-Thought Reasoners”.

  • Claude 3.5 Sonnet: 91.6% (0-shot CoT)

  • GPT-4o: 90.5% (0-shot CoT)

  • Gemini 1.5 Pro: 87.5% (8-shot)

Reasoning over text
DROP (Discrete Reasoning Over Paragraphs): This benchmark measures an AI’s ability to perform discrete reasoning tasks, such as information extraction and arithmetic operations, on paragraphs of text. The F1 score, a common metric in machine learning for assessing classification tasks, is used to evaluate the model’s accuracy in these tasks, balancing precision and recall. It was introduced by Dua et al. in a paper titled “DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs”.

  • Claude 3.5 Sonnet: 87.1% (3-shot)

  • GPT-4o: 83.4% (3-shot)

  • Gemini 1.5 Pro: 74.9% (Variable shots)

Mixed evaluations
BIG-Bench-Hard: Part of the Beyond the Imitation Game Benchmark (BIG-Bench) suite, this benchmark focuses on exceptionally challenging tasks that demand deep reasoning, understanding, and creativity. These tasks push the limits of AI capabilities across diverse domains. This was introduced by Suzgun et al. in a paper titled “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”.

  • Claude 3.5 Sonnet: 93.1% (3-shot CoT)

  • GPT-4o: Not available

  • Gemini 1.5 Pro: 89.2% (3-shot CoT)

Math problem-solving
MATH: This benchmark tests an AI model’s ability to solve math problems ranging from high school to college level. It assesses the model’s understanding of mathematical concepts, problem-solving skills, and capacity to perform complex calculations.

  • Claude 3.5 Sonnet: 71.1% (0-shot CoT)

  • GPT-4o: 76.6% (0-shot CoT)

  • Gemini 1.5 Pro: 67.7% (4-shot)

Grade school math
GSM8K (Grade School Math 8K): This benchmark evaluates an AI model’s proficiency in solving a wide range of math problems typically encountered in grade school (K-8). It measures the model’s grasp of basic arithmetic, geometry, and word problems.

  • Claude 3.5 Sonnet: 96.4% (0-shot CoT)

  • GPT-4o: 90.8% (11-shot)

  • Gemini 1.5 Pro: Not available

The results

Claude 3.5 Sonnet generally outperforms GPT-4o and Gemini 1.5 Pro across most benchmarks.

ChatGPT logo is seen in this illustration taken 28 September, 2023. Image used for representational purposes/ReutersClaude 3.5 Sonnet has outperformed ChatGPT-4o. Image used for representational purposes/Reuters

It shows superior performance in graduate-level reasoning, coding (HumanEval), multilingual math, reasoning over text, mixed evaluations, and grade school math.

GPT-4o performs very well in undergraduate-level knowledge, coding (HumanEval), and math problem-solving.

Gemini 1.5 Pro shows strong performance in mixed evaluations and reasoning over text.

Taking benchmark test scores with a pinch of salt

Most benchmarks are designed to push the limits of a model only on one singular task at a time, something that rarely happens in real-life scenarios. Real-life applications often involve complex, context-dependent tasks that benchmarks might not fully encapsulate. They are typically simplified and controlled, whereas real-world scenarios can be much more intricate.

Benchmarks usually measure a model’s performance in isolation. However, real-life usefulness often involves interacting with humans, understanding context, and dynamically adapting responses, aspects that benchmarks might not fully capture.

Real-world environments are ever-changing, and benchmarks are static. A model’s ability to adapt and learn continuously in a dynamic setting is crucial but not typically measured by standard benchmarks.

While Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro perform differently across benchmarks, their practical effectiveness will ultimately depend on their performance in real-world scenarios and how well they meet the specific needs of their intended applications. Who comes out on top will also depend on how much are OpenAI, Perplexity, and Google spending on the computing needs for retraining their models and to run inferencing.

Read Entire Article