Key features
vLLM workers offer several advantages that make them ideal for LLM deployment:- Pre-built optimization: The workers come with the vLLM inference engine pre-configured, which includes PagedAttention technology for optimized memory usage and faster inference.
- OpenAI API compatibility: They provide a drop-in replacement for OpenAI’s API, allowing you to use existing OpenAI client code by simply changing the endpoint URL and API key.
- Hugging Face integration: vLLM workers support most models available on Hugging Face, including popular options like Llama 2, Mistral, Gemma, and many others.
- Configurable environments: Extensive customization options through environment variables allow you to adjust model parameters, performance settings, and other behaviors.
- Auto-scaling architecture: Serverless automatically scales your endpoint from zero to many workers based on demand, billing on a per-second basis.
Deployment options
There are two ways to deploy a vLLM worker:Option 1: Quick deploy a vLLM endpoint
This is the simplest approach. Use Runpod’s UI to deploy a model directly from Hugging Face with minimal configuration. For step-by-step instructions, see Deploy a vLLM worker.Quick-deployed workers will download models during initialization, which can take some time depending on the model selected. If you plan to run a vLLM endpoint in production, it’s best to package your model into a Docker image ahead of time (using the Docker image method below), as this can significantly reduce cold start times.
Option 2: Deploy using a Docker image
Deploy a packaged vLLM worker image from GitHub or Docker Hub, configuring your endpoint using environment variables. Follow the instructions in the vLLM worker README to build a model into your worker image. You can add new functionality your vLLM worker deployment by customizing its handler function.Compatible models
You can deploy almost any model on Hugging Face as a vLLM worker. You can find a full list of supported models architectures on the GitHub README.How vLLM works
When deployed to a Serverless endpoint, vLLM workers:- Download and load the specified LLM from Hugging Face or other compatible sources.
- Optimize the model for inference using vLLM’s techniques like continuous batching and PagedAttention.
- Expose API endpoints for both OpenAI-compatible requests and Runpod’s native endpoint request format.
- Process incoming requests by dynamically allocating GPU resources.
- Scale workers up or down based on traffic patterns.
Use cases
vLLM workers are an effective choice for:- High-performance inference for text generation.
- Cost-effective scaling for LLM workloads.
- Integration with existing OpenAI-based applications.
- Deploying open-source models with commercial licenses.
- AI systems requiring both synchronous and streaming responses.
Performance considerations
The performance of vLLM workers depends on several factors:- GPU selection: Larger models require more VRAM (A10G or better recommended for 7B+ parameter models). For a list of available GPUs, see GPU types
- Model size: Affects both loading time and inference speed.
- Quantization: Options like AWQ or GPTQ can reduce memory requirements at a small quality cost.
- Batch size settings: Impact throughput and latency tradeoffs.
- Context length: Longer contexts require more memory and processing time.