> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpod.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

> Deploy scalable LLM inference endpoints using vLLM workers.

<div className="overview-page-wrapper" />

vLLM workers deploy and serve large language models on Runpod Serverless with fast inference and automatic scaling. Deploy directly from the [Runpod Hub](https://console.runpod.io/hub/runpod-workers/worker-vllm) or customize using the [runpod-workers/worker-vllm](https://github.com/runpod-workers/worker-vllm) repository as a base.

<CardGroup cols={2}>
  <Card title="Get started" href="/serverless/vllm/get-started" icon="bolt">
    Deploy your first vLLM worker in minutes.
  </Card>

  <Card title="Configuration" href="/serverless/vllm/configuration" icon="gear">
    Configure your vLLM endpoint with environment variables.
  </Card>

  <Card title="Send requests" href="/serverless/vllm/vllm-requests" icon="paper-plane">
    Send requests using Runpod's native API.
  </Card>

  <Card title="OpenAI compatibility" href="/serverless/vllm/openai-compatibility" icon="plug">
    Integrate vLLM with OpenAI-compatible tools.
  </Card>
</CardGroup>

## What is vLLM?

vLLM is an open-source inference engine optimized for serving large language models. It maximizes throughput and minimizes latency through techniques like PagedAttention and continuous batching.

* **[PagedAttention](https://docs.vllm.ai/en/latest/design/paged_attention.html)**: Breaks KV cache into pages for efficient memory use, enabling higher concurrency and larger models on smaller GPUs.
* **Continuous batching**: Processes requests as they arrive rather than waiting for batches, keeping GPUs busy and reducing latency.
* **OpenAI compatibility**: Drop-in replacement for OpenAI's API. Switch by changing the endpoint URL and API key.
* **Hugging Face integration**: Supports most models including Llama, Mistral, Qwen, Gemma, DeepSeek, and [many more](https://docs.vllm.ai/en/latest/models/supported_models.html).
* **Auto-scaling**: Scales from zero to many workers based on demand, with per-second billing.

## Deployment options

* **[Cached models](/serverless/endpoints/model-caching)** (recommended): Fastest setup with lower storage costs. Best for most deployments.
* **[Baked-in models](/serverless/workers/create-dockerfile#including-models-and-files)**: Eliminates download time and reduces cold starts to seconds. Requires building a custom Docker image.

## Configuration

Default settings work for many models, but some require additional [environment variables](/serverless/vllm/environment-variables) (which map to `vllm serve` flags). Consult your model's Hugging Face README and the [vLLM documentation](https://docs.vllm.ai/en/latest/usage/) for model-specific requirements.
