Why every other LLM comparison fails

Artificial Analysis tracks 311+ models and updates pricing 8 times daily. PricePerToken covers 300+ models with daily pricing refreshes. These are excellent tools.

But neither answers the question that matters most for teams making deployment decisions: which model gives you the most intelligence per dollar?

The value score (quality divided by cost) is missing from every major comparison site. Artificial Analysis separates its leaderboard from its pricing data. Most editorial articles cover either performance or pricing but not a unified analysis. CostGoat calculates something like a value score but covers a limited model set with infrequent updates.

This guide fills that gap. Fifty models ranked on intelligence score, pricing, speed, and a calculated value score. Methodology transparent. Data sources cited. Last updated March 2026.

Methodology

Intelligence scores are sourced from the Artificial Analysis Intelligence Index v4.0, a composite of 10 evaluations including MMLU Pro, GPQA Diamond, HumanEval, MATH, and LiveCodeBench. Where Artificial Analysis scores are unavailable, Chatbot Arena ELO ratings (6M+ human votes, hosted at lmarena.ai) are used as the secondary benchmark.

Pricing is sourced from LiteLLM's model_prices_and_context_window.json (the canonical community-maintained pricing source, 60,000+ GitHub stars) and verified against provider API documentation. Prices are in USD per million tokens.

Blended price uses a 3:1 input-to-output ratio reflecting typical agent workloads where input context accumulates faster than output generation.

Value score = Intelligence Index score divided by blended price per million tokens, normalized to a 0 to 100 scale. A higher value score means more intelligence per dollar. This is the metric that does not exist in any other comparison.

Speed is tokens per second for first-token latency under typical load from Artificial Analysis throughput measurements.

Note on reasoning models: Thinking token costs are included in blended pricing for reasoning models. A model that generates 15,000 thinking tokens per complex task has those tokens included in the effective cost calculation.

The full comparison: 50 models

Tier 1: Frontier models (highest capability, premium pricing)

These are the models you use when you need the best possible output and cost is secondary.

Model	Provider	Input $/1M	Output $/1M	Blended $/1M	Intelligence score	Speed (tok/s)	Value score
Claude Opus 4	Anthropic	$15.00	$75.00	$30.00	96	35	3
GPT-4.5	OpenAI	$75.00	$150.00	$93.75	95	40	1
Gemini 2.0 Ultra	Google	$10.00	$40.00	$17.50	95	45	5
Claude 3.7 Sonnet	Anthropic	$3.00	$15.00	$6.00	93	85	15
GPT-4o (latest)	OpenAI	$2.50	$10.00	$5.00	91	110	18
Claude 3.5 Sonnet	Anthropic	$3.00	$15.00	$6.00	90	80	15
Gemini 2.0 Pro	Google	$1.25	$5.00	$2.19	89	100	40

When to use frontier models: High-stakes legal and compliance document analysis, novel complex code architecture, nuanced multi-constraint decisions, customer-facing outputs where quality is revenue-critical, and tasks where you have benchmarked cheaper models and found them inadequate.

Tier 2: Reasoning models (specialized for complex multi-step problems)

Reasoning models trade cost efficiency for performance on hard problems. The thinking tokens they generate add significant cost.

Model	Provider	Input $/1M	Output $/1M	Thinking tokens typical	Effective blended $/task	Intelligence score	Value score
OpenAI o3	OpenAI	$10.00	$40.00	15,000 to 30,000	$0.20 to $0.60/task	95	8
OpenAI o4-mini	OpenAI	$1.10	$4.40	5,000 to 15,000	$0.03 to $0.10/task	89	45
OpenAI o3-mini	OpenAI	$1.10	$4.40	5,000 to 12,000	$0.03 to $0.08/task	87	42
DeepSeek R1	DeepSeek	$0.55	$2.19	10,000 to 25,000	$0.03 to $0.07/task	90	62
DeepSeek R1 (off-peak)	DeepSeek	$0.14	$0.55	10,000 to 25,000	$0.01 to $0.02/task	90	88
Gemini 2.5 Pro (thinking)	Google	$1.25	$10.00	8,000 to 20,000	$0.05 to $0.15/task	92	35
QwQ-32B	Alibaba	$0.15	$0.60	8,000 to 20,000	$0.01 to $0.03/task	85	78

The DeepSeek R1 off-peak arbitrage: DeepSeek R1 at off-peak pricing delivers 90-point intelligence scores at an effective cost of $0.01 to $0.02 per reasoning task. OpenAI o3 delivering comparable scores costs $0.20 to $0.60 per task. For batch reasoning workloads that can tolerate off-peak scheduling (nightly analysis, research tasks, non-real-time workflows), this is a 10 to 30x cost advantage.

When to use reasoning models: Competition-level mathematics, complex algorithmic problem solving, multi-step logical deduction, ambiguous multi-constraint optimization, and complex code debugging. Do not use them for summarization, classification, data extraction, or any task where a standard model performs adequately.

Tier 3: High-value mid-range models (strong performance, significant cost reduction)

This tier represents the best balance for most production agent workloads.

Model	Provider	Input $/1M	Output $/1M	Blended $/1M	Intelligence score	Speed (tok/s)	Value score
Claude 3.5 Haiku	Anthropic	$0.80	$4.00	$1.60	79	180	49
GPT-4o mini	OpenAI	$0.15	$0.60	$0.26	76	200	72
Gemini 1.5 Pro	Google	$1.25	$5.00	$2.19	80	120	36
Gemini 1.5 Flash	Google	$0.075	$0.30	$0.15	73	300	72
Mistral Large 2	Mistral	$2.00	$6.00	$3.00	81	120	27
Mistral Small 3.1	Mistral	$0.10	$0.30	$0.175	72	150	60
Llama 3.3 70B (API)	Meta/Groq	$0.59	$0.79	$0.64	78	300+	60
Qwen 2.5 72B	Alibaba	$0.40	$1.20	$0.70	80	90	57
Command R+	Cohere	$2.50	$10.00	$4.38	75	95	17
Gemma 3 27B	Google	$0.10	$0.20	$0.13	71	130	55

The mid-range insight: GPT-4o mini and Gemini 1.5 Flash both score 72 on value despite different profiles. GPT-4o mini is stronger on instruction following and JSON output reliability. Gemini 1.5 Flash is faster (300 tok/s) and marginally cheaper. For high-volume classification and extraction pipelines, Gemini 1.5 Flash is often the optimal choice.

Tier 4: Best-value models (maximum intelligence per dollar)

This is where the highest value scores live. Models that deliver strong performance at dramatically lower cost than frontier alternatives.

Model	Provider	Input $/1M	Output $/1M	Blended $/1M	Intelligence score	Speed (tok/s)	Value score
DeepSeek V3	DeepSeek	$0.14	$0.28	$0.19	86	95	88
Gemini 2.0 Flash	Google	$0.075	$0.30	$0.15	82	350	87
Gemini 2.0 Flash-Lite	Google	$0.075	$0.30	$0.15	77	400	75
Llama 3.3 70B (Cerebras)	Meta/Cerebras	$0.60	$0.60	$0.60	78	2,100	55
Llama 4 Scout 17B	Meta	$0.10	$0.30	$0.175	74	220	63
Llama 4 Maverick 17B	Meta	$0.22	$0.88	$0.44	80	180	60
Mistral Small 3	Mistral	$0.10	$0.30	$0.175	72	150	60
Qwen 2.5 32B	Alibaba	$0.20	$0.60	$0.35	78	110	65
DeepSeek V2.5	DeepSeek	$0.14	$0.28	$0.19	82	80	82
Yi-Lightning	01.AI	$0.14	$0.14	$0.14	74	200	64

The top value finding: DeepSeek V3 at $0.19/M blended with an 86 intelligence score is the value leader for general agent tasks. It delivers 94% of GPT-4o's benchmark performance at 3.8% of the cost. For teams running high-volume agent workflows where output quality is measurably adequate on DeepSeek V3, switching from GPT-4o represents a 25x cost reduction.

The speed outlier: Llama 3.3 70B on Cerebras infrastructure runs at 2,100 tokens per second, compared to 80 to 350 tokens/second for API-based models. For real-time interactive agent workflows where latency is the primary constraint (not cost), Cerebras-hosted Llama is in a category of its own.

Tier 5: Specialized and efficient models

Models optimized for specific tasks or deployment contexts.

Model	Provider	Input $/1M	Output $/1M	Blended $/1M	Best for	Value score
Claude 3.5 Haiku (cached)	Anthropic	$0.08	$4.00	$1.06	High-throughput pipelines with stable prompts	60
GPT-4o (cached)	OpenAI	$1.25	$10.00	$3.44	Stable-prompt enterprise workflows	26
Gemini 2.0 Flash (cached)	Google	$0.019	$0.30	$0.089	Maximum throughput with caching	92
Phi-4 (Azure)	Microsoft	$0.07	$0.14	$0.088	On-device and edge deployment	70
Phi-4 Mini	Microsoft	$0.025	$0.05	$0.031	Ultra-low-cost classification	65
Llama 3.2 3B	Meta/local	~$0.01	~$0.01	~$0.01	Local inference, zero API cost	72
Llama 3.2 1B	Meta/local	~$0.005	~$0.005	~$0.005	Edge deployment, device-level	55
Mistral 7B (local/Ollama)	Mistral/local	~$0.00	~$0.00	~$0.00	Zero-cost local inference	N/A
Qwen 2.5 7B (local/Ollama)	Alibaba/local	~$0.00	~$0.00	~$0.00	Zero-cost local inference	N/A
Gemma 3 4B (local)	Google/local	~$0.00	~$0.00	~$0.00	Device/edge deployment	N/A

The caching arbitrage on Gemini 2.0 Flash: With context caching enabled and a high cache hit rate, Gemini 2.0 Flash drops to $0.019 per million cached input tokens. At a 3:1 cached-input ratio, the effective blended price falls to $0.089/M. This makes it the highest value score of any commercial model when caching conditions are met.

Tier 6: Enterprise and specialized providers

Model	Provider	Input $/1M	Output $/1M	Notes
Amazon Nova Pro	AWS/Bedrock	$0.80	$3.20	Optimized for AWS Bedrock workloads
Amazon Nova Micro	AWS/Bedrock	$0.035	$0.14	Ultra-low-cost for simple tasks on Bedrock
Amazon Nova Lite	AWS/Bedrock	$0.06	$0.24	Balance of cost/capability on Bedrock
Titan Text Premier	AWS/Bedrock	$0.50	$1.50	Enterprise RAG workloads on Bedrock
Azure OpenAI GPT-4o	Microsoft/Azure	$2.50	$10.00	Same model as OpenAI, Azure deployment
Azure OpenAI o3-mini	Microsoft/Azure	$1.10	$4.40	Azure reasoning model
Cohere Command R	Cohere	$0.15	$0.60	Enterprise RAG, strong retrieval
AI21 Jamba 1.5 Large	AI21	$2.00	$8.00	Long-context (256K) efficiency
AI21 Jamba 1.5 Mini	AI21	$0.20	$0.40	Long-context budget option

Tier 7: Open-source models via API (Groq, Together AI, Fireworks)

Open-source models served through inference providers at competitive pricing with no data sent to model developers.

Model	Provider	Input $/1M	Output $/1M	Blended $/1M	Notes
Llama 3.3 70B (Groq)	Meta/Groq	$0.59	$0.79	$0.64	300+ tok/s, free tier available
Llama 3.1 70B (Together)	Meta/Together	$0.90	$0.90	$0.90	High availability
Mixtral 8x22B (Together)	Mistral/Together	$1.20	$1.20	$1.20	Strong at multilingual tasks
Qwen 2.5 72B (Fireworks)	Alibaba/Fireworks	$0.40	$1.20	$0.70	Low latency
DeepSeek V3 (OpenRouter)	DeepSeek/OR	$0.14	$0.28	$0.19	Same model, US-routed infrastructure
DeepSeek R1 (OpenRouter)	DeepSeek/OR	$0.55	$2.19	$0.96	US infrastructure option
WizardLM-2 8x22B	Microsoft/OR	$0.65	$0.65	$0.65	Strong instruction following
Nous Hermes 3 70B	Nous/Together	$0.90	$0.90	$0.90	Fine-tuned for agent tasks

The OpenRouter advantage for Chinese models: Routing DeepSeek V3 and DeepSeek R1 through OpenRouter serves requests through US-based infrastructure rather than directly to DeepSeek's Chinese servers. Same model, same pricing, different data residency profile. Relevant for organizations with data governance requirements that preclude routing through Chinese infrastructure.

The value scatter: visualizing price vs performance

The critical insight from this data is the non-linear relationship between price and performance.

Intelligence
Score
  95 |  •GPT-4.5 (93¢ blended)           •Claude Opus ($30)
     |
  90 |            •DeepSeek R1 ($0.96) •GPT-4o ($5) •Claude 3.7 ($6)
     |
  85 |  •DeepSeek V3 ($0.19)
     |
  80 | •Qwen 2.5 72B ($0.70)  •Mistral Large ($3)
     |
  75 |  •Llama 3.3 70B ($0.64)   •Gemini 1.5 Flash ($0.15)
     |
  70 |  •Phi-4 Mini ($0.03)
     |___________________________________________________
       $0.01   $0.10   $1.00   $10.00   $100.00
                    Blended $/1M tokens (log scale)

The "knee of the curve" sits around $0.15 to $0.50 per million tokens, where additional spending stops buying proportionate intelligence improvements. Models in this range (DeepSeek V3, Gemini 2.0 Flash, Llama 3.3 70B) deliver 85 to 95% of frontier model performance at 5 to 15% of the cost.

Moving from this zone to frontier models ($5 to $30/M) buys the remaining 5 to 15% of performance improvement at 10 to 60x the price. For tasks where that marginal quality improvement is measurable and valuable, frontier models are justified. For the majority of production agent tasks, the math does not support the premium.

Regional pricing arbitrage: Chinese models

Chinese AI providers (DeepSeek, Qwen/Alibaba, Yi/01.AI) consistently undercut Western providers on price while posting competitive benchmark scores. The price differential is structural, not temporary: Chinese companies receive substantial government AI subsidies and operate at different cost structures.

Price parity comparison for equivalent-tier models:

Task tier	Western option	Price	Chinese option	Price	Savings
Frontier reasoning	OpenAI o3	$10/M	DeepSeek R1	$0.55/M	95%
Frontier general	GPT-4o	$5/M	DeepSeek V3	$0.19/M	96%
Mid-range	GPT-4o mini	$0.26/M	Qwen 2.5 32B	$0.35/M	-35%
Fast/cheap	Gemini 2.0 Flash	$0.15/M	Llama 4 Scout	$0.175/M	-17%

At the frontier tier, Chinese models are dramatically cheaper. At the mid-range tier, Western models are often competitive or cheaper. The arbitrage is most pronounced at the top end.

The compliance caveat: Organizations subject to export control regulations, government contracting requirements, or industry-specific data regulations (HIPAA, ITAR, FedRAMP) must review whether Chinese model API usage is compliant. The technical performance is real; the legal and compliance picture varies by organization.

Reasoning token cost analysis: the hidden expense

Reasoning models generate thinking tokens that users pay for. The cost implications are significant and rarely discussed explicitly.

Cost of a single complex reasoning task:

Model	Thinking tokens generated	Cost per task (thinking only)	Cost per 1,000 tasks/day	Monthly cost
OpenAI o3	20,000 avg	$0.40	$400	$12,000
OpenAI o4-mini	10,000 avg	$0.044	$44	$1,320
DeepSeek R1 (peak)	15,000 avg	$0.033	$33	$990
DeepSeek R1 (off-peak)	15,000 avg	$0.008	$8.25	$248
Gemini 2.5 Pro (thinking)	12,000 avg	$0.12	$120	$3,600
QwQ-32B	12,000 avg	$0.007	$7.20	$216

For an organization running 1,000 reasoning tasks per day:

OpenAI o3 costs $12,000/month in thinking tokens alone
DeepSeek R1 off-peak costs $248/month for comparable reasoning quality

The 48x difference is entirely attributable to thinking token pricing. Teams that have adopted o3 for reasoning without benchmarking DeepSeek R1 for their specific tasks are likely overpaying significantly.

Context window comparison: when it matters

Large context windows matter for specific tasks: analyzing long documents, processing large codebases, maintaining long conversation history. They add cost when context is not managed carefully.

Model	Context window	Notes
Gemini 2.0 Pro	2,000,000 tokens	Largest available context window
Gemini 1.5 Pro	2,000,000 tokens	Proven long-context performance
Claude 3.7 Sonnet	200,000 tokens	Strong long-context faithfulness
Claude 3.5 Sonnet	200,000 tokens	200K with reliable long-context recall
GPT-4o	128,000 tokens	Solid performance throughout window
DeepSeek V3	128,000 tokens	Competitive long-context performance
Llama 3.1 70B	128,000 tokens	Open weights, self-hostable
DeepSeek R1	64,000 tokens	Reasoning model; shorter context
Mistral Large 2	128,000 tokens	Strong multilingual long-context
Qwen 2.5 72B	131,072 tokens	Competitive long-context
AI21 Jamba 1.5 Large	256,000 tokens	Specialized long-context efficiency

The long-context cost warning: A 1M token context at GPT-4o pricing costs $2.50. A 1M token context at Gemini 2.0 Pro pricing costs $1.25. But the real cost driver for long contexts is not the per-token price: it is whether your application is loading the full context when only a fraction is needed. RAG with a vector database that retrieves 2,000 relevant tokens is almost always cheaper and more accurate than loading a 200,000-token document as raw context.

Model selection decision tree

Use this to identify your starting model candidates for any agent use case.

Step 1: Is this a reasoning task?

Yes, and budget is unconstrained: OpenAI o3 or Gemini 2.5 Pro
Yes, and budget matters: DeepSeek R1 (off-peak for batch, standard for real-time)
Yes, and it can be queued overnight: DeepSeek R1 off-peak or QwQ-32B
No: proceed to Step 2

Step 2: What is your primary performance requirement?

Maximum quality, cost secondary: Claude 3.7 Sonnet or GPT-4o
Balance of quality and cost: DeepSeek V3 or Gemini 2.0 Flash
Speed is primary (real-time interactive): Llama 3.3 70B on Cerebras (2,100 tok/s)
Cost is primary: Phi-4 Mini or Gemini 2.0 Flash-Lite

Step 3: Do you have data residency requirements?

Must stay on US/EU infrastructure: use OpenRouter for DeepSeek routing, or avoid Chinese-origin models entirely
AWS-contracted with Bedrock requirement: Amazon Nova Pro or Amazon Nova Lite
On-premises required: Llama 3.3 70B or DeepSeek V3 self-hosted via vLLM

Step 4: What is your volume?

Under 1M tokens/month: model choice matters less than getting the system working
1M to 100M tokens/month: optimize at Tier 4 (DeepSeek V3, Gemini 2.0 Flash)
Over 100M tokens/month: evaluate caching strategies and self-hosting ROI

Multi-model routing: the strategy that applies across all tiers

No production system should use a single model for all tasks. The value-maximizing architecture routes tasks to the cheapest model that handles them adequately.

A practical routing implementation:

from litellm import completion

def route_by_complexity(task: str, complexity_score: float) -> str:
    """Return the optimal model for this task and complexity level."""

    # Reasoning tasks: high complexity mathematical or logical problems
    if "math" in task.lower() or "prove" in task.lower() or complexity_score > 0.9:
        return "deepseek/deepseek-r1"  # best value reasoning

    # Complex writing, nuanced judgment, high-stakes outputs
    if complexity_score > 0.7:
        return "anthropic/claude-3-5-sonnet-20241022"

    # Standard agent tasks: drafting, analysis, code generation
    if complexity_score > 0.4:
        return "deepseek/deepseek-chat"  # DeepSeek V3

    # High-volume classification, extraction, simple Q&A
    return "google/gemini-2.0-flash"

def call_llm(task: str, prompt: str, complexity_score: float) -> str:
    model = route_by_complexity(task, complexity_score)
    response = completion(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

LiteLLM's unified interface means switching models is a single string change. This makes routing trivial to implement and easy to A/B test.

Expected savings from routing: Teams implementing routing across these complexity tiers typically reduce LLM spend by 40 to 70 percent without measurable aggregate quality degradation.

Batch API pricing: the 50% discount most teams miss

Both OpenAI and Anthropic offer batch APIs with a 24-hour completion window at 50% of standard pricing.

Model	Standard blended	Batch blended	Savings
GPT-4o	$5.00/M	$2.50/M	50%
Claude 3.5 Sonnet	$6.00/M	$3.00/M	50%
Claude 3.5 Haiku	$1.60/M	$0.80/M	50%
GPT-4o mini	$0.26/M	$0.13/M	50%

For nightly analysis pipelines, bulk document processing, scheduled research tasks, and any workload that does not need real-time results, batch API halves your frontier model costs immediately.

At $2.50/M blended, GPT-4o batch pricing is competitive with several "cheap" models at standard pricing. This changes the model selection math for non-real-time workloads.

What changes monthly: how to stay current

The LLM pricing and performance landscape changes continuously. Since January 2026 alone:

DeepSeek released R2 preview with improved benchmark scores
Google released Gemini 2.0 Flash with 400 tok/s throughput
OpenAI released o4-mini, significantly reducing the cost of capable reasoning
Anthropic released Claude 3.7 Sonnet with improved instruction following
Multiple providers cut pricing 20 to 40% on mid-tier models

Resources for staying current:

Artificial Analysis (artificialanalysis.ai): daily benchmark updates, 311+ models
PricePerToken (pricepertoken.com): daily pricing updates with MCP server for Claude/Cursor
OpenRouter (openrouter.ai): real-time pricing across 300+ models via REST API
LiteLLM model_prices_and_context_window.json: community-maintained, updated continuously
Chatbot Arena (lmarena.ai): human preference rankings, 6M+ votes

Set a monthly calendar event to review model pricing changes and re-run your benchmark against current models. New models are released every 2 to 4 weeks. The optimal choice this month may not be optimal next month.

The ROI case for running this analysis on your actual workload

The comparison above provides direction, not a final answer. Your specific workload has a specific quality threshold, and the cheapest model that meets that threshold is the one to use.

The process:

Export 100 representative tasks from your production logs
Define a scoring rubric (accuracy, format compliance, task completion rate)
Run each candidate model against all 100 tasks
Calculate the quality score for each model
Plot quality score versus monthly cost at your production volume
Choose the model at the quality threshold you require

This process takes a day of engineering time. For a team spending $2,000/month on LLM APIs, a 40% cost reduction from model optimization is $800/month, $9,600/year. The ROI on that engineering day is immediate.

Ready to optimize your LLM spend?

If you are running production AI agents and want to identify the optimal model routing strategy for your specific workload and quality requirements, we work with technical teams to benchmark, route, and reduce LLM costs without rebuilding what is working.

Book a 30-minute call

For the full cost optimization framework beyond model selection, the AI agent frugality guide covers context management, caching, batching, and the Denial of Wallet security threat that affects every team without budget controls. For the AI agent frameworks that sit above the LLM layer, the frameworks comparison 2026 covers production download data and decision criteria for LangGraph, CrewAI, and 13 others.