Overview

You can use vLLM workers to deploy and serve large language models on Runpod Serverless, delivering fast and efficient inference with automatic scaling. The vLLM worker image can be deployed directly from the Runpod Hub or customized and built from the GitHub repository.

What is vLLM?

vLLM is an open-source inference engine designed to serve large language models efficiently. It maximizes throughput and minimizes latency when running LLM inference workloads. The vLLM worker image includes the vLLM engine with GPU optimizations and support for both OpenAI’s API and Runpod’s native API. You can deploy any supported model from Hugging Face with minimal configuration and start serving requests immediately. The workers run on Runpod Serverless, which automatically scales based on demand.

Pre-built optimization: vLLM workers come with the vLLM inference engine pre-configured, which includes PagedAttention technology for optimized memory usage and faster inference.
OpenAI API compatibility: They provide a drop-in replacement for OpenAI’s API, allowing you to use existing OpenAI client code by simply changing the endpoint URL and API key.
Hugging Face integration: vLLM workers support most models available on Hugging Face, including popular options like Llama 2, Mistral, Gemma, and many others.
Configurable environments: Extensive customization options through environment variables allow you to adjust model parameters, performance settings, and other behaviors.
Auto-scaling architecture: Serverless automatically scales your endpoint from zero to many workers based on demand, billing on a per-second basis.

vLLM uses several advanced techniques to achieve high performance when serving LLMs. Understanding these can help you optimize your deployments and troubleshoot issues.

PagedAttention for memory efficiency

PagedAttention is the key innovation in vLLM. It dramatically improves how GPU memory is used during inference. Traditional LLM serving wastes memory by pre-allocating large contiguous blocks for key-value (KV) caches. PagedAttention breaks the KV cache into smaller pages, similar to how operating systems manage memory. This reduces memory waste and allows vLLM to serve more requests concurrently on the same GPU. You can handle higher throughput or serve larger models on smaller GPUs.

Continuous batching

vLLM uses continuous batching (also called dynamic batching) to process multiple requests simultaneously. Unlike traditional batching, which waits for a batch to fill up before processing, continuous batching processes requests as they arrive and adds new requests to the batch as soon as previous ones complete. This keeps your GPU busy and reduces latency for individual requests, especially during periods of variable traffic.

Request lifecycle

When you send a request to a vLLM worker endpoint:

The request arrives at Runpod Serverless infrastructure.
If no worker is available, the request is queued and a worker starts automatically.
The worker loads your model from Hugging Face (or from the pre-baked Docker image).
vLLM processes the request using PagedAttention and continuous batching.
The response is returned to your application.
If there are no more requests, the worker scales down to zero after a configured timeout.

vLLM endpoints use the same /run and /runsync operations as other Runpod Serverless endpoints. The only difference is the input format and the specialized LLM processing inside the worker.

Why use vLLM?

vLLM workers offer several advantages over other LLM deployment options.

Performance and efficiency

vLLM’s PagedAttention and continuous batching deliver significantly better throughput than traditional serving methods. You can serve 2-3x more requests per GPU compared to naive implementations, which directly translates to lower costs and better user experiences.

OpenAI API compatibility

vLLM workers provide a drop-in replacement for OpenAI’s API. If you’re already using the OpenAI Python client or any other OpenAI-compatible library, you can switch to your Runpod endpoint by changing just two lines of code: the API key and the base URL. Your existing prompts, parameters, and response handling code continue to work without modification.

Model flexibility

You can deploy virtually any model available on Hugging Face, including popular options like Llama, Mistral, Qwen, Gemma, and thousands of others. vLLM supports a wide range of model architectures out of the box, and new architectures are added regularly.

Auto-scaling and cost efficiency

Runpod Serverless automatically scales your vLLM workers from zero to many based on demand. You only pay for the seconds when workers are actively processing requests. This makes vLLM workers ideal for workloads with variable traffic patterns or when you’re getting started and don’t want to pay for idle capacity.

Production-ready features

vLLM workers come with features that make them suitable for production deployments, including streaming responses, configurable context lengths, quantization support (AWQ, GPTQ), multi-GPU tensor parallelism, and comprehensive error handling.

Deployment options

There are two ways to deploy vLLM workers on Runpod.

Using cached models

If your model is available on Hugging Face, we strongly recommend enabling cached models instead of baking/downloading the model into your Docker image. Cached models provide faster startup times, lower costs, and uses less storage.

Building custom Docker images with models baked in

For production deployments where cold start time matters, you can build a custom Docker image that includes your model weights. This eliminates download time and can reduce cold starts from minutes to seconds. This approach requires more upfront work but provides the best performance for production workloads with consistent traffic.

Compatible models

vLLM supports most model architectures available on Hugging Face. You can deploy models from families including Llama (1, 2, 3, 3.1, 3.2), Mistral and Mixtral, Qwen2 and Qwen2.5, Gemma and Gemma 2, Phi (2, 3, 3.5, 4), DeepSeek (V2, V3, R1), GPT-2, GPT-J, OPT, BLOOM, Falcon, MPT, StableLM, Yi, and many others. For a complete and up-to-date list of supported model architectures, see the vLLM supported models documentation.

Configuration

vLLM supports hundreds of models, but default settings only work out of the box for a subset of them. Depending on the model you’re deploying, you may need to configure your endpoint with additional environment variables (which map directly to vllm serve command line flags) to get it working properly. You will likely need to consult the README for your model on Hugging Face and the vLLM documentation for more details.

Use cases

vLLM workers are ideal for several types of applications. Production LLM APIs benefit from vLLM’s high throughput and OpenAI compatibility. You can build scalable APIs for chatbots, content generation, code completion, or any other LLM-powered feature. Cost-effective scaling is enabled by Serverless auto-scaling. If your traffic varies significantly throughout the day or week, vLLM workers automatically scale down to zero during quiet periods, saving costs compared to always-on servers. OpenAI migration is straightforward because vLLM provides API compatibility. You can migrate existing OpenAI-based applications to open-source models by changing only your endpoint URL and API key. Custom model hosting lets you deploy fine-tuned or specialized models. If you’ve trained a custom model or fine-tuned an existing one, vLLM workers make it easy to serve it at scale. Development and experimentation is cheaper with pay-per-second billing. You can test multiple models and configurations without worrying about idle costs.

Next steps

Ready to deploy your first vLLM worker? Start with the get started guide to deploy a model in minutes. Once your endpoint is running, learn how to send requests using Runpod’s native API or the OpenAI-compatible API. For advanced configuration options, see the environment variables documentation. For a hands-on example, see Create a chatbot with Gemma 3 to deploy a vLLM endpoint and build an interactive chatbot using the OpenAI-compatible API.

Get started

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Integrations

Hub

Fine-tuning

Reference

What is vLLM?

PagedAttention for memory efficiency

Continuous batching

Request lifecycle

Why use vLLM?

Performance and efficiency

OpenAI API compatibility

Model flexibility

Auto-scaling and cost efficiency

Production-ready features

Deployment options

Using cached models

Building custom Docker images with models baked in

Compatible models

Configuration

Use cases

Next steps

Get started

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Integrations

Hub

Fine-tuning

Reference

​What is vLLM?

​PagedAttention for memory efficiency

​Continuous batching

​Request lifecycle

​Why use vLLM?

​Performance and efficiency

​OpenAI API compatibility

​Model flexibility

​Auto-scaling and cost efficiency

​Production-ready features

​Deployment options

​Using cached models

​Building custom Docker images with models baked in

​Compatible models

​Configuration

​Use cases

​Next steps

What is vLLM?

PagedAttention for memory efficiency

Continuous batching

Request lifecycle

Why use vLLM?

Performance and efficiency

OpenAI API compatibility

Model flexibility

Auto-scaling and cost efficiency

Production-ready features

Deployment options

Using cached models

Building custom Docker images with models baked in

Compatible models

Configuration

Use cases

Next steps