Skip to main content
vLLM workers deploy and serve large language models on Runpod Serverless with fast inference and automatic scaling. Deploy directly from the Runpod Hub or customize using the runpod-workers/worker-vllm repository as a base.

What is vLLM?

vLLM is an open-source inference engine optimized for serving large language models. It maximizes throughput and minimizes latency through techniques like PagedAttention and continuous batching.
  • PagedAttention: Breaks KV cache into pages for efficient memory use, enabling higher concurrency and larger models on smaller GPUs.
  • Continuous batching: Processes requests as they arrive rather than waiting for batches, keeping GPUs busy and reducing latency.
  • OpenAI compatibility: Drop-in replacement for OpenAI’s API. Switch by changing the endpoint URL and API key.
  • Hugging Face integration: Supports most models including Llama, Mistral, Qwen, Gemma, DeepSeek, and many more.
  • Auto-scaling: Scales from zero to many workers based on demand, with per-second billing.

Deployment options

  • Cached models (recommended): Fastest setup with lower storage costs. Best for most deployments.
  • Baked-in models: Eliminates download time and reduces cold starts to seconds. Requires building a custom Docker image.

Configuration

Default settings work for many models, but some require additional environment variables (which map to vllm serve flags). Consult your model’s Hugging Face README and the vLLM documentation for model-specific requirements.