Infrastructure

Model Serving

Definition

The infrastructure layer responsible for hosting trained models and serving inference requests at production scale. Model serving systems handle: request routing, batching (combining multiple requests for GPU efficiency), auto-scaling (adjusting compute based on demand), model loading/unloading, version management, and health monitoring. For agent systems that self-host models, the serving layer determines throughput, latency, and cost efficiency. Key technologies: vLLM (optimized LLM serving), TGI (Hugging Face), Triton (NVIDIA), and managed platforms (Together, Replicate, Modal).

Builder Context

Unless you have specific requirements (data sovereignty, custom models, cost optimization at scale), use managed inference APIs (OpenAI, Anthropic, Google) and focus your engineering on the agent logic rather than the serving infrastructure. Self-hosting makes sense when: you're running open-source models fine-tuned on your data, you need guaranteed latency SLAs, or your inference volume justifies dedicated GPU capacity (typically > $5K/month in API costs). If you do self-host: vLLM is the current performance leader for LLM serving.