There is a common misconception in agentic AI: that performance is primarily a function of which model you use. In our experience running 40M+ agent calls, the router — the logic that decides which model handles which subtask — matters at least as much as the model itself.
The cost-quality tradeoff is non-linear
Frontier models (GPT-4o, Claude Opus, Gemini Ultra) are dramatically better than mid-tier models on complex reasoning tasks. They are marginally better — or sometimes worse due to overthinking — on simple, well-specified tasks like "extract the invoice number from this PDF" or "classify this email as spam/not-spam".
Routing every call to a frontier model is 8–12x more expensive than routing appropriately and produces measurably worse results on simple tasks due to latency and verbosity. Our router classifies tasks on two axes: complexity (simple / moderate / complex) and risk (low / medium / high). High-risk tasks always go to frontier models regardless of complexity. Low-risk simple tasks go to the cheapest capable model.
complexity: simple + risk: low → haiku / flash complexity: moderate + risk: low → sonnet / gpt-4o-mini complexity: complex + risk: any → opus / gpt-4o complexity: any + risk: high → opus / gpt-4o
The hidden cost of latency
In multi-step agentic workflows, latency compounds. A chain of 8 subtasks where each takes 3 seconds on average adds up to 24 seconds of wall-clock time if executed sequentially. Our router parallelizes independent subtasks and routes latency-sensitive tasks to models with lower time-to-first-token, even if they're slightly less capable.