Get started
Deploy your first vLLM worker in minutes.
Configuration
Configure your vLLM endpoint with environment variables.
Send requests
Send requests using Runpod’s native API.
OpenAI compatibility
Integrate vLLM with OpenAI-compatible tools.
What is vLLM?
vLLM is an open-source inference engine optimized for serving large language models. It maximizes throughput and minimizes latency through techniques like PagedAttention and continuous batching.- PagedAttention: Breaks KV cache into pages for efficient memory use, enabling higher concurrency and larger models on smaller GPUs.
- Continuous batching: Processes requests as they arrive rather than waiting for batches, keeping GPUs busy and reducing latency.
- OpenAI compatibility: Drop-in replacement for OpenAI’s API. Switch by changing the endpoint URL and API key.
- Hugging Face integration: Supports most models including Llama, Mistral, Qwen, Gemma, DeepSeek, and many more.
- Auto-scaling: Scales from zero to many workers based on demand, with per-second billing.
Deployment options
- Cached models (recommended): Fastest setup with lower storage costs. Best for most deployments.
- Baked-in models: Eliminates download time and reduces cold starts to seconds. Requires building a custom Docker image.
Configuration
Default settings work for many models, but some require additional environment variables (which map tovllm serve flags). Consult your model’s Hugging Face README and the vLLM documentation for model-specific requirements.