Skip to main content

Environment variables

Environment variables configure your vLLM Worker by providing control over model selection, access credentials, and operational parameters necessary for optimal Worker performance.

CUDA versions

Operating your vLLM Worker with different CUDA versions enhances compatibilit and performance across various hardware configurations. When deploying, ensure you choose an appropriate CUDA version based on your needs.

CUDA VersionStable Image TagDevelopment Image TagNote
11.8.0runpod/worker-vllm:stable-cuda11.8.0runpod/worker-vllm:dev-cuda11.8.0Available on all RunPod Workers without additional selection needed.
12.1.0runpod/worker-vllm:stable-cuda12.1.0runpod/worker-vllm:dev-cuda12.1.0When creating an Endpoint, select CUDA Version 12.2 and 12.1 in the GPU filter.

This table provides a reference to the image tags you should use based on the desired CUDA veersion and image stability, stable or development. Ensure you follow the selection note for CUDA 12.1.0 compatibility.

Environment variables


0 is equivalent to False and 1 is equivalent to True for boolean values.

LLM Settings
MODEL_NAME *-strHugging Face Model Repository (e.g., openchat/openchat-3.5-1210).
MODEL_REVISIONNonestrModel revision(branch) to load.
MAX_MODEL_LENModel's maximumintMaximum number of tokens for the engine to handle per request.
BASE_PATH/runpod-volumestrStorage directory for Huggingface cache and model. Utilizes network storage if attached when pointed at /runpod-volume, which will have only one worker download the model once, which all workers will be able to load. If no network volume is present, creates a local directory within each worker.
LOAD_FORMATautostrFormat to load model in.
HF_TOKEN-strHugging Face token for private and gated models.
QUANTIZATIONNoneawq, squeezellm, gptqQuantization of given model. The model must already be quantized.
TRUST_REMOTE_CODE0boolean as intTrust remote code for Hugging Face models. Can help with Mixtral 8x7B, Quantized models, and unusual models/architectures.
SEED0intSets random seed for operations.
KV_CACHE_DTYPEautoauto, fp8_e5m2Data type for kv cache storage. Uses DTYPE if set to auto.
DTYPEautoauto, half, float16, bfloat16, float, float32Sets datatype/precision for model weights and activations.
Tokenizer Settings
TOKENIZER_NAMENonestrTokenizer repository to use a different tokenizer than the model's default.
TOKENIZER_REVISIONNonestrTokenizer revision to load.
CUSTOM_CHAT_TEMPLATENonestr of single-line jinja templateCustom chat jinja template. More Info
System, GPU, and Tensor Parallelism(Multi-GPU) Settings
GPU_MEMORY_UTILIZATION0.95floatSets GPU VRAM utilization.
MAX_PARALLEL_LOADING_WORKERSNoneintLoad model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.
BLOCK_SIZE168, 16, 32Token block size for contiguous chunks of tokens.
SWAP_SPACE4intCPU swap space size (GiB) per GPU.
ENFORCE_EAGER0boolean as intAlways use eager-mode PyTorch. If False(0), will use eager mode and CUDA graph in hybrid for maximal performance and flexibility.
MAX_CONTEXT_LEN_TO_CAPTURE8192intMaximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode.
DISABLE_CUSTOM_ALL_REDUCE0intEnables or disables custom all reduce.
Streaming Batch Size Settings:
DEFAULT_BATCH_SIZE50intDefault and Maximum batch size for token streaming to reduce HTTP calls.
DEFAULT_MIN_BATCH_SIZE1intBatch size for the first request, which will be multiplied by the growth factor every subsequent request.
DEFAULT_BATCH_SIZE_GROWTH_FACTOR3floatGrowth factor for dynamic batch size.
The way this works is that the first request will have a batch size of DEFAULT_MIN_BATCH_SIZE, and each subsequent request will have a batch size of previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR. This will continue until the batch size reaches DEFAULT_BATCH_SIZE. E.g. for the default values, the batch sizes will be 1, 3, 9, 27, 50, 50, 50, .... You can also specify this per request, with inputs max_batch_size, min_batch_size, and batch_size_growth_factor. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker
OpenAI Settings
RAW_OPENAI_OUTPUT1boolean as intEnables raw OpenAI SSE format string output when streaming. Required to be enabled (which it is by default) for OpenAI compatibility.
OPENAI_SERVED_MODEL_NAME_OVERRIDENonestrOverrides the name of the served model from model repo/path to specified name, which you will then be able to use the value for the model parameter when making OpenAI requests
OPENAI_RESPONSE_ROLEassistantstrRole of the LLM's Response in OpenAI Chat Completions.
Serverless Settings
MAX_CONCURRENCY300intMax concurrent requests per worker. vLLM has an internal queue, so you don't have to worry about limiting by VRAM, this is for improving scaling/load balancing efficiency
DISABLE_LOG_STATS1boolean as intEnables or disables vLLM stats logging.
DISABLE_LOG_REQUESTS1boolean as intEnables or disables vLLM request logging.

If you are facing issues when using Mixtral 8x7B, Quantized models, or handling unusual models/architectures, try setting TRUST_REMOTE_CODE to 1.