Skip to main content
Most LLMs need specific configuration to run properly on vLLM. Default settings work for some models, but many require custom tokenization, attention mechanisms, or feature flags. Without the right settings, workers may fail to load or produce incorrect outputs. When deploying a model, check its Hugging Face README and the vLLM documentation for required settings.

Environment variables

vLLM is configured using command-line flags. On Runpod, set these as environment variables instead. Convert flag names to uppercase with underscores. For example:
--tokenizer_mode mistral
Becomes:
TOKENIZER_MODE=mistral

Example: Deploying Mistral

CLI command:
vllm serve mistralai/Ministral-8B-Instruct-2410 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --enable-auto-tool-choice \
  --tool-call-parser mistral
Equivalent Runpod environment variables:
VariableValue
MODEL_NAMEmistralai/Ministral-8B-Instruct-2410
TOKENIZER_MODEmistral
CONFIG_FORMATmistral
LOAD_FORMATmistral
ENABLE_AUTO_TOOL_CHOICEtrue
TOOL_CALL_PARSERmistral

Model-specific configurations

Recommended environment variables for popular model families. Check your model’s documentation for exact requirements.
Model familyExample modelKey environment variablesNotes
Qwen3Qwen/Qwen3-8BENABLE_AUTO_TOOL_CHOICE=true TOOL_CALL_PARSER=hermesFor AWQ/GPTQ versions, set QUANTIZATION accordingly.
OpenChatopenchat/openchat-3.5-0106None requiredUse CUSTOM_CHAT_TEMPLATE if default templates produce poor results.
Gemmagoogle/gemma-3-1b-itNone requiredRequires HF_TOKEN. Set DTYPE=bfloat16 for best results.
DeepSeek-R1deepseek-ai/DeepSeek-R1-Distill-Qwen-7BREASONING_PARSER=deepseek_r1Enables reasoning mode for chain-of-thought outputs.
Phi-4microsoft/Phi-4-mini-instructNone requiredENFORCE_EAGER=true can resolve initialization issues on older CUDA versions.
Llama 3meta-llama/Llama-3.2-3B-InstructTOOL_CALL_PARSER=llama3_json ENABLE_AUTO_TOOL_CHOICE=trueUse MAX_MODEL_LEN to prevent KV cache from exceeding GPU VRAM.
Mistralmistralai/Ministral-8B-Instruct-2410TOKENIZER_MODE=mistral CONFIG_FORMAT=mistral LOAD_FORMAT=mistral TOOL_CALL_PARSER=mistral ENABLE_AUTO_TOOL_CHOICE=trueMistral models require specialized tokenizers.

GPU selection

vLLM pre-allocates memory for its KV cache, so you need more VRAM than the minimum to load the model.

VRAM estimation

  • FP16/BF16: 2 bytes per parameter.
  • INT8: 1 byte per parameter.
  • INT4 (AWQ/GPTQ): 0.5 bytes per parameter.
  • KV cache: vLLM reserves 10-30% of remaining VRAM for concurrent requests.
Model sizeRecommended GPUsVRAM
Small (<10B)RTX 4090, A6000, L416-24 GB
Medium (10B-30B)A6000, L40S32-48 GB
Large (30B-70B)A100, H100, B20080-180 GB

Troubleshooting memory issues

  • OOM errors: Lower GPU_MEMORY_UTILIZATION from 0.90 to 0.85, or reduce MAX_MODEL_LEN.
  • Context window limits: More context means more KV cache. A 7B model that OOMs at 32k context often runs fine at 16k.
  • Limited VRAM: Use quantized models (AWQ/GPTQ) to reduce memory by 50-75%.
For production workloads, select multiple GPU types in your endpoint configuration for hardware fallback.

Additional resources