Infrastructure

Inference

Definition

The process of generating predictions or outputs from a trained model — in the agent context, the act of sending a prompt to a language model and receiving a response. Inference is the primary cost and latency driver in agent systems. Each agent loop iteration requires at least one inference call, and complex agents may make dozens per task. Inference can be served by: cloud APIs (OpenAI, Anthropic, Google), self-hosted models (vLLM, TGI, Ollama), or edge deployment (on-device models). The key metrics are: time-to-first-token (TTFT), tokens-per-second throughput, and cost-per-token.

Builder Context

Inference cost scales with agent complexity — an agent that makes 50 tool calls per task at $0.01 per call costs $0.50 per task before you count the tool execution costs. Optimize by: (1) reducing unnecessary iterations (better prompts = fewer loops), (2) using cheaper models for simple decisions (routing, classification) and expensive models for complex reasoning, (3) caching responses for repeated queries, (4) batching independent inference calls. For latency: streaming reduces perceived latency even if total time is the same. For cost: track cost-per-task, not cost-per-token.