I. Introduction
In today's era of rapid AI development, various big language models are constantly iterated and updated. Today, we will evaluate five top big models in depth: ChatGPT o3-mini, Grok3 thinking, Claude3.7 thinking, Deepseek-r1, and Gemini-2.0-Pro, and compare their performances in different scenarios in all aspects.
II. Comparison of in-depth evaluation and analysis
to answer the same question using each model in ShirtAI separately.ShirtAI has free unlimited access to GPT Plus, Claude Pro, Grok Super, and Deepseek full-blooded versions, and the official website is one click away:www.lsshirtai.com
Title 1:Workers in a tea factory have to fill a rectangular tea box of length and width 20 cm and height 10 cm into a square cardboard box with a prism length of 30 cm (measured from the inside). What is the maximum number of boxes that can fit in a carton? How can it fit?
Conclusion:The answer is 6 boxes, and the claude-3.7-thinking reasoning model wins hands down, fast and accurate.deepseek-r1 is the slowest but has the correct answer, and Grok3 deepthinking and O3-mini have the wrong answer.
Title 2:The function $$f(x) = e^x + ax^2 - x.$$ is known (1) Discuss the monotonicity of $f(x)$ when $a = 1$; (2) When $x \geq 0$, $f(x) \geq \ frac{1}{2}x^3 + 1$, find the range of values of $a$.
Conclusion:All the models give the correct answer, but the o3-mini is better in terms of speed.
In addition, we conducted other tests with the following results:
test scenario | ChatGPT o3-mini | Grok3 thinking | Claude3.7 thinking | Deepseek-r1 | Gemini-2.0-Pro |
---|---|---|---|---|---|
complex mathematical problem (Bayes' theorem) |
Basic explanations are clear, but depth and detail are lacking, and cases are simple | Explanations are vivid and introduce intuitive visualization analogies, but rigorous derivation is slightly lacking | The most systematic proof process with in-depth explanations of concepts, detailed medical screening cases, and clear calculations | Mathematical derivations are most rigorous, formulas are beautifully laid out, but case explanations are relatively academic | Balances theory and practice, but not as good as Claude and Deepseek on specific details |
coding skills (Rapid Sort) |
Basic functionality is implemented correctly, but code efficiency and boundary handling are poor | Correct algorithm, slightly redundant code structure, practical optimization suggestions | The code is clear and easy to read, detailed comments, explanation of each step of the idea, complexity analysis of a comprehensive | The code is the most streamlined and efficient, with optimal boundary condition handling and in-depth complexity analysis | Provides multiple implementations, including in-place sorting and functional programming, with certain boundary cases under-considered |
Creative Writing (2050) |
The story flows well but is rather bland, and the futuristic technological elements favor common imagery | Good at building a grand worldview, bold technology portrayal, slightly weak character emotion portrayal | The plot is rich and vivid, the characters are three-dimensional, and the technological details are both forward-looking and sensible, incorporating emotional elements | Accurate but slightly stereotypical tech details, not enough storytelling | Narrative structure is complete, technology and social issues are well integrated, innovation is slightly lacking |
logical inference (Prisoner's Dilemma) |
Accurate explanation of underlying concepts, but not enough in-depth analysis | The analysis is most in-depth, introducing an evolutionary game theory perspective to discuss equilibrium strategies for repeated games | Theoretical explanations are the clearest, logical derivations are rigorous, and real-life examples from multiple fields are provided | Mathematical models are most rigorously constructed, but examples are slightly academic | Balancing theory and practical application with a wide variety of case studies |
Overall, the advantages and disadvantages of the models are compared as follows:
mould | dominance | inferior | Most Applicable Scenarios |
---|---|---|---|
ChatGPT o3-mini | - Best performance in lightweight models - fast response time - Accuracy in dealing with basic issues |
- Limited capacity for complex reasoning - Deep thinking is not as functional as other models |
- Everyday Simple Questions and Answers - Basic content creation - Lightweight application scenarios |
Grok3 thinking | - Transparency in the thinking process - Outstanding logical reasoning skills - Explain concepts in a lively and interesting way |
- Slightly poor Chinese language skills - Insufficient depth in certain specialized areas |
- Complex reasoning that requires seeing the thought process - Innovative thinking stimulation |
Claude3.7 thinking | - The most balanced combination of competencies - Precise command following - Creativity and logic go hand in hand - Minimal hallucinations |
- Vertical-specific specialization is slightly weaker than specialized models | - Content creation that requires a balance of creativity and accuracy - Complex command tasks |
Deepseek-r1 | - Extremely strong code and math skills - Best understood in Chinese - Rigorous academic reasoning |
- Creative Writing is Relatively Stereotypical - Generic representations are not as vivid as other models |
- Programming Development - Scientific research in mathematics - Chinese Academic Content Generation |
Gemini-2.0-Pro | - Wide-ranging knowledge - Strong multimodal understanding - Abundance of practical cases |
- Lack of depth in some complex reasoning scenarios | - Multimodal interactions that require the combination of images - Knowledge-intensive questions and answers |
III. Comparison of model basics
Model name | development company | Release time | Model size | Charges |
---|---|---|---|---|
ChatGPT o3-mini | OpenAI | July 2024 | About 7 billion parameters | Free and Plus paid versions |
Grok3 thinking | xAI | July 2024 | undisclosed | xAI members |
Claude3.7 thinking | Anthropic | August 2024 | undisclosed | Partially free, Claude Pro paid |
Deepseek-r1 | search in depth | May 2024 | 236 billion parameters | free (of charge) |
Gemini-2.0-Pro | Google Internet company | May 2024 | undisclosed | Partially free, premium version paid |
IV. Comparative table of core competencies
capability dimension | ChatGPT o3-mini | Grok3 thinking | Claude3.7 thinking | Deepseek-r1 | Gemini-2.0-Pro |
---|---|---|---|---|---|
General Questions and Answers | 4 | 5 | 5 | 4 | 4 |
coding skills | 3 | 4 | 5 | 5 | 4 |
mathematical reasoning | 3 | 4 | 4 | 5 | 4 |
logical thinking | 3 | 5 | 5 | 4 | 4 |
Creative Writing | 4 | 4 | 5 | 3 | 4 |
command following | 4 | 4 | 5 | 4 | 4 |
Chinese Language Proficiency | 4 | 3 | 4 | 5 | 4 |
Depth of thought | 3 | 5 | 5 | 4 | 4 |
illusionist control | 3 | 3 | 5 | 4 | 4 |
v. synthesis of conclusions
After a full range of reviews, we came to the following conclusions:
- Best Overall: Claude 3.7 thinking, excelled in most tests, especially in creative writing, command following and illusion control
- Best Professional Competence: Deepseek-r1 was the best in code, math and Chinese professional content
- Best thinking process: grok3 thinking and claude3.7 thinking are the most transparent in terms of demonstrating the thinking process
- Best lightweight app: ChatGPT o3-mini has the best price/performance ratio among lightweight apps
- Best Multimodal: Gemini-2.0-Pro Leads in Handling Multimodal Content
Which model to choose should ultimately be based on your specific usage scenario. If you are looking for a fully balanced experience, Claude 3.7 is a good choice; for programming and math needs, Deepseek-r1 is worth considering; and if you need a lightweight daily assistant, ChatGPT o3-mini can also meet the basic needs.
Additional resources have been prepared to help you explore your modeling potential. To master the big model cue word technique and interact with models efficiently, click on the link:Large Model Prompt Word Tips , here are practical strategies to help you unlock the model's powerful features.
If you want to use GPT Plus, Claude Pro, Grok Super official paid exclusive account, you will not recharge yourself can contact our professional team (wx: f15303420735)