vLLM worker overview
vLLM workers are specialized containers designed to efficiently deploy and serve large language models (LLMs) on RunPod's Serverless infrastructure. By leveraging RunPod's vLLM workers, you can quickly deploy state-of-the-art language models with optimized performance, flexible scaling, and cost-effective operation.
For detailed information on model compatibility and configuration options, check out the vLLM worker GitHub repository.
Key features
vLLM workers offer several advantages that make them ideal for LLM deployment:
- Pre-built optimization: The workers come with the vLLM inference engine pre-configured, which includes PagedAttention technology for optimized memory usage and faster inference.
- OpenAI API compatibility: They provide a drop-in replacement for OpenAI's API, allowing you to use existing OpenAI client code by simply changing the endpoint URL and API key.
- Hugging Face integration: vLLM workers support most models available on Hugging Face, including popular options like Llama 2, Mistral, Gemma, and many others.
- Configurable environments: Extensive customization options through environment variables allow you to adjust model parameters, performance settings, and other behaviors.
- Auto-scaling architecture: Serverless automatically scales your endpoint from zero to many workers based on demand, billing on a per-second basis.
Deployment options
There are two ways to deploy a vLLM worker:
Option 1: Quick deploy a vLLM endpoint
This is the simplest approach. Use RunPod's UI to deploy a model directly from Hugging Face with minimal configuration. For step-by-step instructions, see Deploy a vLLM worker.
Quick-deployed workers will download models during initialization, which can take some time depending on the model selected. If you plan to run a vLLM endpoint in production, it’s best to package your model into a Docker image ahead of time (using the Docker image method below), as this can significantly reduce cold start times.
Option 2: Deploy using a Docker image
Deploy a packaged vLLM worker image from GitHub or Docker Hub, configuring your endpoint using environment variables.
Follow the instructions in the vLLM worker README to build a model into your worker image.
You can add new functionality your vLLM worker deployment by customizing its handler function.
Compatible models
You can deploy almost any model on Hugging Face as a vLLM worker. You can find a full list of supported models architectures on the GitHub README.
How vLLM works
When deployed to a Serverless endpoint, vLLM workers:
- Download and load the specified LLM from Hugging Face or other compatible sources.
- Optimize the model for inference using vLLM's techniques like continuous batching and PagedAttention.
- Expose API endpoints for both OpenAI-compatible requests and RunPod's native endpoint request format.
- Process incoming requests by dynamically allocating GPU resources.
- Scale workers up or down based on traffic patterns.
Use cases
vLLM workers are an effective choice for:
- High-performance inference for text generation.
- Cost-effective scaling for LLM workloads.
- Integration with existing OpenAI-based applications.
- Deploying open-source models with commercial licenses.
- AI systems requiring both synchronous and streaming responses.
Performance considerations
The performance of vLLM workers depends on several factors:
- GPU selection: Larger models require more VRAM (A10G or better recommended for 7B+ parameter models). For a list of available GPUs, see GPU types
- Model size: Affects both loading time and inference speed.
- Quantization: Options like AWQ or GPTQ can reduce memory requirements at a small quality cost.
- Batch size settings: Impact throughput and latency tradeoffs.
- Context length: Longer contexts require more memory and processing time.