Why LLM Router Design Matters More Than Model Choice

There is a common misconception in agentic AI: that performance is primarily a function of which model you use. In our experience running 40M+ agent calls, the router — the logic that decides which model handles which subtask — matters at least as much as the model itself.

The cost-quality tradeoff is non-linear

Frontier models (GPT-4o, Claude Opus, Gemini Ultra) are dramatically better than mid-tier models on complex reasoning tasks. They are marginally better — or sometimes worse due to overthinking — on simple, well-specified tasks like "extract the invoice number from this PDF" or "classify this email as spam/not-spam".

Routing every call to a frontier model is 8–12x more expensive than routing appropriately and produces measurably worse results on simple tasks due to latency and verbosity. Our router classifies tasks on two axes: complexity (simple / moderate / complex) and risk (low / medium / high). High-risk tasks always go to frontier models regardless of complexity. Low-risk simple tasks go to the cheapest capable model.

complexity: simple  + risk: low    → haiku / flash
complexity: moderate + risk: low    → sonnet / gpt-4o-mini
complexity: complex  + risk: any    → opus / gpt-4o
complexity: any      + risk: high   → opus / gpt-4o

The hidden cost of latency

In multi-step agentic workflows, latency compounds. A chain of 8 subtasks where each takes 3 seconds on average adds up to 24 seconds of wall-clock time if executed sequentially. Our router parallelizes independent subtasks and routes latency-sensitive tasks to models with lower time-to-first-token, even if they're slightly less capable.

Why LLM Router Design Matters More Than Model Choice

The cost-quality tradeoff is non-linear

The hidden cost of latency

More in AI

Emergent Delegation: When Agents Spawn Sub-Agents