vLLM worker overview

vLLM workers are specialized containers designed to efficiently deploy and serve large language models (LLMs) on Runpod’s Serverless infrastructure. By leveraging Runpod’s vLLM workers, you can quickly deploy state-of-the-art language models with optimized performance, flexible scaling, and cost-effective operation. For detailed information on model compatibility and configuration options, check out the vLLM worker GitHub repository.

Key features

vLLM workers offer several advantages that make them ideal for LLM deployment:

Pre-built optimization: The workers come with the vLLM inference engine pre-configured, which includes PagedAttention technology for optimized memory usage and faster inference.
OpenAI API compatibility: They provide a drop-in replacement for OpenAI’s API, allowing you to use existing OpenAI client code by simply changing the endpoint URL and API key.
Hugging Face integration: vLLM workers support most models available on Hugging Face, including popular options like Llama 2, Mistral, Gemma, and many others.
Configurable environments: Extensive customization options through environment variables allow you to adjust model parameters, performance settings, and other behaviors.
Auto-scaling architecture: Serverless automatically scales your endpoint from zero to many workers based on demand, billing on a per-second basis.

Deployment options

There are two ways to deploy a vLLM worker:

Option 1: Quick deploy a vLLM endpoint

This is the simplest approach. Use Runpod’s UI to deploy a model directly from Hugging Face with minimal configuration. For step-by-step instructions, see Deploy a vLLM worker.

Quick-deployed workers will download models during initialization, which can take some time depending on the model selected. If you plan to run a vLLM endpoint in production, it’s best to package your model into a Docker image ahead of time (using the Docker image method below), as this can significantly reduce cold start times.

Option 2: Deploy using a Docker image

Deploy a packaged vLLM worker image from GitHub or Docker Hub, configuring your endpoint using environment variables. Follow the instructions in the vLLM worker README to build a model into your worker image. You can add new functionality your vLLM worker deployment by customizing its handler function.

Compatible models

You can deploy almost any model on Hugging Face as a vLLM worker. You can find a full list of supported models architectures on the GitHub README.

How vLLM works

When deployed to a Serverless endpoint, vLLM workers:

Download and load the specified LLM from Hugging Face or other compatible sources.
Optimize the model for inference using vLLM’s techniques like continuous batching and PagedAttention.
Expose API endpoints for both OpenAI-compatible requests and Runpod’s native endpoint request format.
Process incoming requests by dynamically allocating GPU resources.
Scale workers up or down based on traffic patterns.

Use cases

vLLM workers are an effective choice for:

High-performance inference for text generation.
Cost-effective scaling for LLM workloads.
Integration with existing OpenAI-based applications.
Deploying open-source models with commercial licenses.
AI systems requiring both synchronous and streaming responses.

Performance considerations

The performance of vLLM workers depends on several factors:

GPU selection: Larger models require more VRAM (A10G or better recommended for 7B+ parameter models). For a list of available GPUs, see GPU types
Model size: Affects both loading time and inference speed.
Quantization: Options like AWQ or GPTQ can reduce memory requirements at a small quality cost.
Batch size settings: Impact throughput and latency tradeoffs.
Context length: Longer contexts require more memory and processing time.

Get started

Serverless

Pods

Storage

Hub

Instant Clusters

Fine-tuning

Reference

Key features

Deployment options

Option 1: Quick deploy a vLLM endpoint

Option 2: Deploy using a Docker image

Compatible models

How vLLM works

Use cases

Performance considerations

Next steps

Get started

Serverless

Pods

Storage

Hub

Instant Clusters

Fine-tuning

Reference

​Key features

​Deployment options

​Option 1: Quick deploy a vLLM endpoint

​Option 2: Deploy using a Docker image

​Compatible models

​How vLLM works

​Use cases

​Performance considerations

​Next steps

Key features

Deployment options

Option 1: Quick deploy a vLLM endpoint

Option 2: Deploy using a Docker image

Compatible models

How vLLM works

Use cases

Performance considerations

Next steps