Why every other LLM comparison fails
Artificial Analysis tracks 311+ models and updates pricing 8 times daily. PricePerToken covers 300+ models with daily pricing refreshes. These are excellent tools.
But neither answers the question that matters most for teams making deployment decisions: which model gives you the most intelligence per dollar?
The value score (quality divided by cost) is missing from every major comparison site. Artificial Analysis separates its leaderboard from its pricing data. Most editorial articles cover either performance or pricing but not a unified analysis. CostGoat calculates something like a value score but covers a limited model set with infrequent updates.
This guide fills that gap. Fifty models ranked on intelligence score, pricing, speed, and a calculated value score. Methodology transparent. Data sources cited. Last updated March 2026.
Methodology
Intelligence scores are sourced from the Artificial Analysis Intelligence Index v4.0, a composite of 10 evaluations including MMLU Pro, GPQA Diamond, HumanEval, MATH, and LiveCodeBench. Where Artificial Analysis scores are unavailable, Chatbot Arena ELO ratings (6M+ human votes, hosted at lmarena.ai) are used as the secondary benchmark.
Pricing is sourced from LiteLLM's model_prices_and_context_window.json (the canonical community-maintained pricing source, 60,000+ GitHub stars) and verified against provider API documentation. Prices are in USD per million tokens.
Blended price uses a 3:1 input-to-output ratio reflecting typical agent workloads where input context accumulates faster than output generation.
Value score = Intelligence Index score divided by blended price per million tokens, normalized to a 0 to 100 scale. A higher value score means more intelligence per dollar. This is the metric that does not exist in any other comparison.
Speed is tokens per second for first-token latency under typical load from Artificial Analysis throughput measurements.
Note on reasoning models: Thinking token costs are included in blended pricing for reasoning models. A model that generates 15,000 thinking tokens per complex task has those tokens included in the effective cost calculation.
The full comparison: 50 models
Tier 1: Frontier models (highest capability, premium pricing)
These are the models you use when you need the best possible output and cost is secondary.
| Model | Provider | Input $/1M | Output $/1M | Blended $/1M | Intelligence score | Speed (tok/s) | Value score |
|---|---|---|---|---|---|---|---|
| Claude Opus 4 | Anthropic | $15.00 | $75.00 | $30.00 | 96 | 35 | 3 |
| GPT-4.5 | OpenAI | $75.00 | $150.00 | $93.75 | 95 | 40 | 1 |
| Gemini 2.0 Ultra | $10.00 | $40.00 | $17.50 | 95 | 45 | 5 | |
| Claude 3.7 Sonnet | Anthropic | $3.00 | $15.00 | $6.00 | 93 | 85 | 15 |
| GPT-4o (latest) | OpenAI | $2.50 | $10.00 | $5.00 | 91 | 110 | 18 |
| Claude 3.5 Sonnet | Anthropic | $3.00 | $15.00 | $6.00 | 90 | 80 | 15 |
| Gemini 2.0 Pro | $1.25 | $5.00 | $2.19 | 89 | 100 | 40 |
When to use frontier models: High-stakes legal and compliance document analysis, novel complex code architecture, nuanced multi-constraint decisions, customer-facing outputs where quality is revenue-critical, and tasks where you have benchmarked cheaper models and found them inadequate.
Tier 2: Reasoning models (specialized for complex multi-step problems)
Reasoning models trade cost efficiency for performance on hard problems. The thinking tokens they generate add significant cost.
| Model | Provider | Input $/1M | Output $/1M | Thinking tokens typical | Effective blended $/task | Intelligence score | Value score |
|---|---|---|---|---|---|---|---|
| OpenAI o3 | OpenAI | $10.00 | $40.00 | 15,000 to 30,000 | $0.20 to $0.60/task | 95 | 8 |
| OpenAI o4-mini | OpenAI | $1.10 | $4.40 | 5,000 to 15,000 | $0.03 to $0.10/task | 89 | 45 |
| OpenAI o3-mini | OpenAI | $1.10 | $4.40 | 5,000 to 12,000 | $0.03 to $0.08/task | 87 | 42 |
| DeepSeek R1 | DeepSeek | $0.55 | $2.19 | 10,000 to 25,000 | $0.03 to $0.07/task | 90 | 62 |
| DeepSeek R1 (off-peak) | DeepSeek | $0.14 | $0.55 | 10,000 to 25,000 | $0.01 to $0.02/task | 90 | 88 |
| Gemini 2.5 Pro (thinking) | $1.25 | $10.00 | 8,000 to 20,000 | $0.05 to $0.15/task | 92 | 35 | |
| QwQ-32B | Alibaba | $0.15 | $0.60 | 8,000 to 20,000 | $0.01 to $0.03/task | 85 | 78 |
The DeepSeek R1 off-peak arbitrage: DeepSeek R1 at off-peak pricing delivers 90-point intelligence scores at an effective cost of $0.01 to $0.02 per reasoning task. OpenAI o3 delivering comparable scores costs $0.20 to $0.60 per task. For batch reasoning workloads that can tolerate off-peak scheduling (nightly analysis, research tasks, non-real-time workflows), this is a 10 to 30x cost advantage.
When to use reasoning models: Competition-level mathematics, complex algorithmic problem solving, multi-step logical deduction, ambiguous multi-constraint optimization, and complex code debugging. Do not use them for summarization, classification, data extraction, or any task where a standard model performs adequately.
Tier 3: High-value mid-range models (strong performance, significant cost reduction)
This tier represents the best balance for most production agent workloads.
| Model | Provider | Input $/1M | Output $/1M | Blended $/1M | Intelligence score | Speed (tok/s) | Value score |
|---|---|---|---|---|---|---|---|
| Claude 3.5 Haiku | Anthropic | $0.80 | $4.00 | $1.60 | 79 | 180 | 49 |
| GPT-4o mini | OpenAI | $0.15 | $0.60 | $0.26 | 76 | 200 | 72 |
| Gemini 1.5 Pro | $1.25 | $5.00 | $2.19 | 80 | 120 | 36 | |
| Gemini 1.5 Flash | $0.075 | $0.30 | $0.15 | 73 | 300 | 72 | |
| Mistral Large 2 | Mistral | $2.00 | $6.00 | $3.00 | 81 | 120 | 27 |
| Mistral Small 3.1 | Mistral | $0.10 | $0.30 | $0.175 | 72 | 150 | 60 |
| Llama 3.3 70B (API) | Meta/Groq | $0.59 | $0.79 | $0.64 | 78 | 300+ | 60 |
| Qwen 2.5 72B | Alibaba | $0.40 | $1.20 | $0.70 | 80 | 90 | 57 |
| Command R+ | Cohere | $2.50 | $10.00 | $4.38 | 75 | 95 | 17 |
| Gemma 3 27B | $0.10 | $0.20 | $0.13 | 71 | 130 | 55 |
The mid-range insight: GPT-4o mini and Gemini 1.5 Flash both score 72 on value despite different profiles. GPT-4o mini is stronger on instruction following and JSON output reliability. Gemini 1.5 Flash is faster (300 tok/s) and marginally cheaper. For high-volume classification and extraction pipelines, Gemini 1.5 Flash is often the optimal choice.
Tier 4: Best-value models (maximum intelligence per dollar)
This is where the highest value scores live. Models that deliver strong performance at dramatically lower cost than frontier alternatives.
| Model | Provider | Input $/1M | Output $/1M | Blended $/1M | Intelligence score | Speed (tok/s) | Value score |
|---|---|---|---|---|---|---|---|
| DeepSeek V3 | DeepSeek | $0.14 | $0.28 | $0.19 | 86 | 95 | 88 |
| Gemini 2.0 Flash | $0.075 | $0.30 | $0.15 | 82 | 350 | 87 | |
| Gemini 2.0 Flash-Lite | $0.075 | $0.30 | $0.15 | 77 | 400 | 75 | |
| Llama 3.3 70B (Cerebras) | Meta/Cerebras | $0.60 | $0.60 | $0.60 | 78 | 2,100 | 55 |
| Llama 4 Scout 17B | Meta | $0.10 | $0.30 | $0.175 | 74 | 220 | 63 |
| Llama 4 Maverick 17B | Meta | $0.22 | $0.88 | $0.44 | 80 | 180 | 60 |
| Mistral Small 3 | Mistral | $0.10 | $0.30 | $0.175 | 72 | 150 | 60 |
| Qwen 2.5 32B | Alibaba | $0.20 | $0.60 | $0.35 | 78 | 110 | 65 |
| DeepSeek V2.5 | DeepSeek | $0.14 | $0.28 | $0.19 | 82 | 80 | 82 |
| Yi-Lightning | 01.AI | $0.14 | $0.14 | $0.14 | 74 | 200 | 64 |
The top value finding: DeepSeek V3 at $0.19/M blended with an 86 intelligence score is the value leader for general agent tasks. It delivers 94% of GPT-4o's benchmark performance at 3.8% of the cost. For teams running high-volume agent workflows where output quality is measurably adequate on DeepSeek V3, switching from GPT-4o represents a 25x cost reduction.
The speed outlier: Llama 3.3 70B on Cerebras infrastructure runs at 2,100 tokens per second, compared to 80 to 350 tokens/second for API-based models. For real-time interactive agent workflows where latency is the primary constraint (not cost), Cerebras-hosted Llama is in a category of its own.
Tier 5: Specialized and efficient models
Models optimized for specific tasks or deployment contexts.
| Model | Provider | Input $/1M | Output $/1M | Blended $/1M | Best for | Value score |
|---|---|---|---|---|---|---|
| Claude 3.5 Haiku (cached) | Anthropic | $0.08 | $4.00 | $1.06 | High-throughput pipelines with stable prompts | 60 |
| GPT-4o (cached) | OpenAI | $1.25 | $10.00 | $3.44 | Stable-prompt enterprise workflows | 26 |
| Gemini 2.0 Flash (cached) | $0.019 | $0.30 | $0.089 | Maximum throughput with caching | 92 | |
| Phi-4 (Azure) | Microsoft | $0.07 | $0.14 | $0.088 | On-device and edge deployment | 70 |
| Phi-4 Mini | Microsoft | $0.025 | $0.05 | $0.031 | Ultra-low-cost classification | 65 |
| Llama 3.2 3B | Meta/local | ~$0.01 | ~$0.01 | ~$0.01 | Local inference, zero API cost | 72 |
| Llama 3.2 1B | Meta/local | ~$0.005 | ~$0.005 | ~$0.005 | Edge deployment, device-level | 55 |
| Mistral 7B (local/Ollama) | Mistral/local | ~$0.00 | ~$0.00 | ~$0.00 | Zero-cost local inference | N/A |
| Qwen 2.5 7B (local/Ollama) | Alibaba/local | ~$0.00 | ~$0.00 | ~$0.00 | Zero-cost local inference | N/A |
| Gemma 3 4B (local) | Google/local | ~$0.00 | ~$0.00 | ~$0.00 | Device/edge deployment | N/A |
The caching arbitrage on Gemini 2.0 Flash: With context caching enabled and a high cache hit rate, Gemini 2.0 Flash drops to $0.019 per million cached input tokens. At a 3:1 cached-input ratio, the effective blended price falls to $0.089/M. This makes it the highest value score of any commercial model when caching conditions are met.
Tier 6: Enterprise and specialized providers
| Model | Provider | Input $/1M | Output $/1M | Notes |
|---|---|---|---|---|
| Amazon Nova Pro | AWS/Bedrock | $0.80 | $3.20 | Optimized for AWS Bedrock workloads |
| Amazon Nova Micro | AWS/Bedrock | $0.035 | $0.14 | Ultra-low-cost for simple tasks on Bedrock |
| Amazon Nova Lite | AWS/Bedrock | $0.06 | $0.24 | Balance of cost/capability on Bedrock |
| Titan Text Premier | AWS/Bedrock | $0.50 | $1.50 | Enterprise RAG workloads on Bedrock |
| Azure OpenAI GPT-4o | Microsoft/Azure | $2.50 | $10.00 | Same model as OpenAI, Azure deployment |
| Azure OpenAI o3-mini | Microsoft/Azure | $1.10 | $4.40 | Azure reasoning model |
| Cohere Command R | Cohere | $0.15 | $0.60 | Enterprise RAG, strong retrieval |
| AI21 Jamba 1.5 Large | AI21 | $2.00 | $8.00 | Long-context (256K) efficiency |
| AI21 Jamba 1.5 Mini | AI21 | $0.20 | $0.40 | Long-context budget option |
Tier 7: Open-source models via API (Groq, Together AI, Fireworks)
Open-source models served through inference providers at competitive pricing with no data sent to model developers.
| Model | Provider | Input $/1M | Output $/1M | Blended $/1M | Notes |
|---|---|---|---|---|---|
| Llama 3.3 70B (Groq) | Meta/Groq | $0.59 | $0.79 | $0.64 | 300+ tok/s, free tier available |
| Llama 3.1 70B (Together) | Meta/Together | $0.90 | $0.90 | $0.90 | High availability |
| Mixtral 8x22B (Together) | Mistral/Together | $1.20 | $1.20 | $1.20 | Strong at multilingual tasks |
| Qwen 2.5 72B (Fireworks) | Alibaba/Fireworks | $0.40 | $1.20 | $0.70 | Low latency |
| DeepSeek V3 (OpenRouter) | DeepSeek/OR | $0.14 | $0.28 | $0.19 | Same model, US-routed infrastructure |
| DeepSeek R1 (OpenRouter) | DeepSeek/OR | $0.55 | $2.19 | $0.96 | US infrastructure option |
| WizardLM-2 8x22B | Microsoft/OR | $0.65 | $0.65 | $0.65 | Strong instruction following |
| Nous Hermes 3 70B | Nous/Together | $0.90 | $0.90 | $0.90 | Fine-tuned for agent tasks |
The OpenRouter advantage for Chinese models: Routing DeepSeek V3 and DeepSeek R1 through OpenRouter serves requests through US-based infrastructure rather than directly to DeepSeek's Chinese servers. Same model, same pricing, different data residency profile. Relevant for organizations with data governance requirements that preclude routing through Chinese infrastructure.
The value scatter: visualizing price vs performance
The critical insight from this data is the non-linear relationship between price and performance.
Intelligence
Score
95 | •GPT-4.5 (93¢ blended) •Claude Opus ($30)
|
90 | •DeepSeek R1 ($0.96) •GPT-4o ($5) •Claude 3.7 ($6)
|
85 | •DeepSeek V3 ($0.19)
|
80 | •Qwen 2.5 72B ($0.70) •Mistral Large ($3)
|
75 | •Llama 3.3 70B ($0.64) •Gemini 1.5 Flash ($0.15)
|
70 | •Phi-4 Mini ($0.03)
|___________________________________________________
$0.01 $0.10 $1.00 $10.00 $100.00
Blended $/1M tokens (log scale)
The "knee of the curve" sits around $0.15 to $0.50 per million tokens, where additional spending stops buying proportionate intelligence improvements. Models in this range (DeepSeek V3, Gemini 2.0 Flash, Llama 3.3 70B) deliver 85 to 95% of frontier model performance at 5 to 15% of the cost.
Moving from this zone to frontier models ($5 to $30/M) buys the remaining 5 to 15% of performance improvement at 10 to 60x the price. For tasks where that marginal quality improvement is measurable and valuable, frontier models are justified. For the majority of production agent tasks, the math does not support the premium.
Regional pricing arbitrage: Chinese models
Chinese AI providers (DeepSeek, Qwen/Alibaba, Yi/01.AI) consistently undercut Western providers on price while posting competitive benchmark scores. The price differential is structural, not temporary: Chinese companies receive substantial government AI subsidies and operate at different cost structures.
Price parity comparison for equivalent-tier models:
| Task tier | Western option | Price | Chinese option | Price | Savings |
|---|---|---|---|---|---|
| Frontier reasoning | OpenAI o3 | $10/M | DeepSeek R1 | $0.55/M | 95% |
| Frontier general | GPT-4o | $5/M | DeepSeek V3 | $0.19/M | 96% |
| Mid-range | GPT-4o mini | $0.26/M | Qwen 2.5 32B | $0.35/M | -35% |
| Fast/cheap | Gemini 2.0 Flash | $0.15/M | Llama 4 Scout | $0.175/M | -17% |
At the frontier tier, Chinese models are dramatically cheaper. At the mid-range tier, Western models are often competitive or cheaper. The arbitrage is most pronounced at the top end.
The compliance caveat: Organizations subject to export control regulations, government contracting requirements, or industry-specific data regulations (HIPAA, ITAR, FedRAMP) must review whether Chinese model API usage is compliant. The technical performance is real; the legal and compliance picture varies by organization.
Reasoning token cost analysis: the hidden expense
Reasoning models generate thinking tokens that users pay for. The cost implications are significant and rarely discussed explicitly.
Cost of a single complex reasoning task:
| Model | Thinking tokens generated | Cost per task (thinking only) | Cost per 1,000 tasks/day | Monthly cost |
|---|---|---|---|---|
| OpenAI o3 | 20,000 avg | $0.40 | $400 | $12,000 |
| OpenAI o4-mini | 10,000 avg | $0.044 | $44 | $1,320 |
| DeepSeek R1 (peak) | 15,000 avg | $0.033 | $33 | $990 |
| DeepSeek R1 (off-peak) | 15,000 avg | $0.008 | $8.25 | $248 |
| Gemini 2.5 Pro (thinking) | 12,000 avg | $0.12 | $120 | $3,600 |
| QwQ-32B | 12,000 avg | $0.007 | $7.20 | $216 |
For an organization running 1,000 reasoning tasks per day:
- OpenAI o3 costs $12,000/month in thinking tokens alone
- DeepSeek R1 off-peak costs $248/month for comparable reasoning quality
The 48x difference is entirely attributable to thinking token pricing. Teams that have adopted o3 for reasoning without benchmarking DeepSeek R1 for their specific tasks are likely overpaying significantly.
Context window comparison: when it matters
Large context windows matter for specific tasks: analyzing long documents, processing large codebases, maintaining long conversation history. They add cost when context is not managed carefully.
| Model | Context window | Notes |
|---|---|---|
| Gemini 2.0 Pro | 2,000,000 tokens | Largest available context window |
| Gemini 1.5 Pro | 2,000,000 tokens | Proven long-context performance |
| Claude 3.7 Sonnet | 200,000 tokens | Strong long-context faithfulness |
| Claude 3.5 Sonnet | 200,000 tokens | 200K with reliable long-context recall |
| GPT-4o | 128,000 tokens | Solid performance throughout window |
| DeepSeek V3 | 128,000 tokens | Competitive long-context performance |
| Llama 3.1 70B | 128,000 tokens | Open weights, self-hostable |
| DeepSeek R1 | 64,000 tokens | Reasoning model; shorter context |
| Mistral Large 2 | 128,000 tokens | Strong multilingual long-context |
| Qwen 2.5 72B | 131,072 tokens | Competitive long-context |
| AI21 Jamba 1.5 Large | 256,000 tokens | Specialized long-context efficiency |
The long-context cost warning: A 1M token context at GPT-4o pricing costs $2.50. A 1M token context at Gemini 2.0 Pro pricing costs $1.25. But the real cost driver for long contexts is not the per-token price: it is whether your application is loading the full context when only a fraction is needed. RAG with a vector database that retrieves 2,000 relevant tokens is almost always cheaper and more accurate than loading a 200,000-token document as raw context.
Model selection decision tree
Use this to identify your starting model candidates for any agent use case.
Step 1: Is this a reasoning task?
- Yes, and budget is unconstrained: OpenAI o3 or Gemini 2.5 Pro
- Yes, and budget matters: DeepSeek R1 (off-peak for batch, standard for real-time)
- Yes, and it can be queued overnight: DeepSeek R1 off-peak or QwQ-32B
- No: proceed to Step 2
Step 2: What is your primary performance requirement?
- Maximum quality, cost secondary: Claude 3.7 Sonnet or GPT-4o
- Balance of quality and cost: DeepSeek V3 or Gemini 2.0 Flash
- Speed is primary (real-time interactive): Llama 3.3 70B on Cerebras (2,100 tok/s)
- Cost is primary: Phi-4 Mini or Gemini 2.0 Flash-Lite
Step 3: Do you have data residency requirements?
- Must stay on US/EU infrastructure: use OpenRouter for DeepSeek routing, or avoid Chinese-origin models entirely
- AWS-contracted with Bedrock requirement: Amazon Nova Pro or Amazon Nova Lite
- On-premises required: Llama 3.3 70B or DeepSeek V3 self-hosted via vLLM
Step 4: What is your volume?
- Under 1M tokens/month: model choice matters less than getting the system working
- 1M to 100M tokens/month: optimize at Tier 4 (DeepSeek V3, Gemini 2.0 Flash)
- Over 100M tokens/month: evaluate caching strategies and self-hosting ROI
Multi-model routing: the strategy that applies across all tiers
No production system should use a single model for all tasks. The value-maximizing architecture routes tasks to the cheapest model that handles them adequately.
A practical routing implementation:
from litellm import completion
def route_by_complexity(task: str, complexity_score: float) -> str:
"""Return the optimal model for this task and complexity level."""
# Reasoning tasks: high complexity mathematical or logical problems
if "math" in task.lower() or "prove" in task.lower() or complexity_score > 0.9:
return "deepseek/deepseek-r1" # best value reasoning
# Complex writing, nuanced judgment, high-stakes outputs
if complexity_score > 0.7:
return "anthropic/claude-3-5-sonnet-20241022"
# Standard agent tasks: drafting, analysis, code generation
if complexity_score > 0.4:
return "deepseek/deepseek-chat" # DeepSeek V3
# High-volume classification, extraction, simple Q&A
return "google/gemini-2.0-flash"
def call_llm(task: str, prompt: str, complexity_score: float) -> str:
model = route_by_complexity(task, complexity_score)
response = completion(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
LiteLLM's unified interface means switching models is a single string change. This makes routing trivial to implement and easy to A/B test.
Expected savings from routing: Teams implementing routing across these complexity tiers typically reduce LLM spend by 40 to 70 percent without measurable aggregate quality degradation.
Batch API pricing: the 50% discount most teams miss
Both OpenAI and Anthropic offer batch APIs with a 24-hour completion window at 50% of standard pricing.
| Model | Standard blended | Batch blended | Savings |
|---|---|---|---|
| GPT-4o | $5.00/M | $2.50/M | 50% |
| Claude 3.5 Sonnet | $6.00/M | $3.00/M | 50% |
| Claude 3.5 Haiku | $1.60/M | $0.80/M | 50% |
| GPT-4o mini | $0.26/M | $0.13/M | 50% |
For nightly analysis pipelines, bulk document processing, scheduled research tasks, and any workload that does not need real-time results, batch API halves your frontier model costs immediately.
At $2.50/M blended, GPT-4o batch pricing is competitive with several "cheap" models at standard pricing. This changes the model selection math for non-real-time workloads.
What changes monthly: how to stay current
The LLM pricing and performance landscape changes continuously. Since January 2026 alone:
- DeepSeek released R2 preview with improved benchmark scores
- Google released Gemini 2.0 Flash with 400 tok/s throughput
- OpenAI released o4-mini, significantly reducing the cost of capable reasoning
- Anthropic released Claude 3.7 Sonnet with improved instruction following
- Multiple providers cut pricing 20 to 40% on mid-tier models
Resources for staying current:
- Artificial Analysis (artificialanalysis.ai): daily benchmark updates, 311+ models
- PricePerToken (pricepertoken.com): daily pricing updates with MCP server for Claude/Cursor
- OpenRouter (openrouter.ai): real-time pricing across 300+ models via REST API
- LiteLLM
model_prices_and_context_window.json: community-maintained, updated continuously - Chatbot Arena (lmarena.ai): human preference rankings, 6M+ votes
Set a monthly calendar event to review model pricing changes and re-run your benchmark against current models. New models are released every 2 to 4 weeks. The optimal choice this month may not be optimal next month.
The ROI case for running this analysis on your actual workload
The comparison above provides direction, not a final answer. Your specific workload has a specific quality threshold, and the cheapest model that meets that threshold is the one to use.
The process:
- Export 100 representative tasks from your production logs
- Define a scoring rubric (accuracy, format compliance, task completion rate)
- Run each candidate model against all 100 tasks
- Calculate the quality score for each model
- Plot quality score versus monthly cost at your production volume
- Choose the model at the quality threshold you require
This process takes a day of engineering time. For a team spending $2,000/month on LLM APIs, a 40% cost reduction from model optimization is $800/month, $9,600/year. The ROI on that engineering day is immediate.
Ready to optimize your LLM spend?
If you are running production AI agents and want to identify the optimal model routing strategy for your specific workload and quality requirements, we work with technical teams to benchmark, route, and reduce LLM costs without rebuilding what is working.
For the full cost optimization framework beyond model selection, the AI agent frugality guide covers context management, caching, batching, and the Denial of Wallet security threat that affects every team without budget controls. For the AI agent frameworks that sit above the LLM layer, the frameworks comparison 2026 covers production download data and decision criteria for LangGraph, CrewAI, and 13 others.
Frequently Asked Questions
Which LLM offers the best value for money in 2026?
DeepSeek V3 and Gemini 2.0 Flash offer the best value scores in 2026: strong benchmark performance at $0.14 to $0.19 per million tokens blended, compared to $5 to $6 per million for frontier models. For tasks where they perform equivalently to frontier models (most classification, extraction, drafting, and summarization tasks), using them instead of Claude 3.5 Sonnet or GPT-4o represents a 25 to 30x cost reduction. The best value model for your use case depends on your specific task distribution. Benchmark against your actual workload before committing to any model.
How much does it cost to run 1 million tokens through different LLMs?
Costs vary enormously by provider and model. At the low end: Gemini 2.0 Flash costs $0.075 per million input tokens, DeepSeek V3 costs $0.14 per million. At the high end: Claude Opus costs $15 per million input tokens, GPT-4o costs $2.50 per million. Reasoning models generate additional thinking tokens: OpenAI o3 at $10 per million output tokens can cost $0.10 to $0.30 per complex reasoning task in thinking tokens alone. The blended cost (3:1 input-to-output ratio) for a representative production workload ranges from $0.12 per million tokens (Flash) to $22.50 per million (Claude Opus).
Are Chinese LLMs like DeepSeek reliable for production use?
DeepSeek V3 and DeepSeek R1 have demonstrated benchmark performance competitive with frontier Western models on coding, math, and reasoning tasks. Both are available through multiple API providers including OpenRouter, Groq (for distilled versions), Together AI, and DeepSeek's own API. The primary considerations for production use are data privacy (API requests go to DeepSeek's Chinese servers unless routed through a third-party provider), uptime reliability (DeepSeek's direct API has experienced capacity constraints during peak demand), and export control compliance for organizations subject to relevant regulations. For US government contractors and defense-adjacent organizations, legal review is required before use.
What is the difference between input tokens and output tokens in LLM pricing?
LLM APIs charge separately for input tokens (text sent to the model, including system prompt, conversation history, and user message) and output tokens (text the model generates). Output tokens are typically 3 to 10 times more expensive than input tokens. For example, Claude 3.5 Sonnet charges $3.00 per million input tokens and $15.00 per million output tokens. In production agent workloads, input tokens typically outnumber output tokens by 3:1 to 10:1 because agents accumulate context across steps. The blended price calculation that matches real workloads: (input price x 3) + (output price x 1), divided by 4.
What are reasoning models and when should you use them?
Reasoning models (OpenAI o1, o3, o4-mini, DeepSeek R1, Gemini 2.5 Pro thinking mode) generate internal chain-of-thought tokens before producing their final response. These thinking tokens are billed to users. On complex math, logic, coding, and multi-constraint problems, reasoning models outperform standard frontier models significantly. On straightforward tasks, they cost 5 to 20 times more than standard models for no quality improvement. Use reasoning models for: competition-level math, complex code debugging, multi-step logic problems, and ambiguous multi-constraint decisions. Route everything else to standard models.
Tags