Environment variables
vLLM is configured using command-line flags. On Runpod, set these as environment variables instead. Convert flag names to uppercase with underscores. For example:Example: Deploying Mistral
CLI command:| Variable | Value |
|---|---|
MODEL_NAME | mistralai/Ministral-8B-Instruct-2410 |
TOKENIZER_MODE | mistral |
CONFIG_FORMAT | mistral |
LOAD_FORMAT | mistral |
ENABLE_AUTO_TOOL_CHOICE | true |
TOOL_CALL_PARSER | mistral |
Model-specific configurations
Recommended environment variables for popular model families. Check your model’s documentation for exact requirements.| Model family | Example model | Key environment variables | Notes |
|---|---|---|---|
| Qwen3 | Qwen/Qwen3-8B | ENABLE_AUTO_TOOL_CHOICE=true TOOL_CALL_PARSER=hermes | For AWQ/GPTQ versions, set QUANTIZATION accordingly. |
| OpenChat | openchat/openchat-3.5-0106 | None required | Use CUSTOM_CHAT_TEMPLATE if default templates produce poor results. |
| Gemma | google/gemma-3-1b-it | None required | Requires HF_TOKEN. Set DTYPE=bfloat16 for best results. |
| DeepSeek-R1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | REASONING_PARSER=deepseek_r1 | Enables reasoning mode for chain-of-thought outputs. |
| Phi-4 | microsoft/Phi-4-mini-instruct | None required | ENFORCE_EAGER=true can resolve initialization issues on older CUDA versions. |
| Llama 3 | meta-llama/Llama-3.2-3B-Instruct | TOOL_CALL_PARSER=llama3_json ENABLE_AUTO_TOOL_CHOICE=true | Use MAX_MODEL_LEN to prevent KV cache from exceeding GPU VRAM. |
| Mistral | mistralai/Ministral-8B-Instruct-2410 | TOKENIZER_MODE=mistral CONFIG_FORMAT=mistral LOAD_FORMAT=mistral TOOL_CALL_PARSER=mistral ENABLE_AUTO_TOOL_CHOICE=true | Mistral models require specialized tokenizers. |
GPU selection
vLLM pre-allocates memory for its KV cache, so you need more VRAM than the minimum to load the model.VRAM estimation
- FP16/BF16: 2 bytes per parameter.
- INT8: 1 byte per parameter.
- INT4 (AWQ/GPTQ): 0.5 bytes per parameter.
- KV cache: vLLM reserves 10-30% of remaining VRAM for concurrent requests.
| Model size | Recommended GPUs | VRAM |
|---|---|---|
| Small (<10B) | RTX 4090, A6000, L4 | 16-24 GB |
| Medium (10B-30B) | A6000, L40S | 32-48 GB |
| Large (30B-70B) | A100, H100, B200 | 80-180 GB |
Troubleshooting memory issues
- OOM errors: Lower
GPU_MEMORY_UTILIZATIONfrom 0.90 to 0.85, or reduceMAX_MODEL_LEN. - Context window limits: More context means more KV cache. A 7B model that OOMs at 32k context often runs fine at 16k.
- Limited VRAM: Use quantized models (AWQ/GPTQ) to reduce memory by 50-75%.
Additional resources
- vLLM recipes: Step-by-step deployment guides.
- Mistral + vLLM guide.
- Qwen + vLLM guide.