Skip to main content

Environment variables

Environment variables configure your vLLM Worker by providing control over model selection, access credentials, and operational parameters necessary for optimal Worker performance.

CUDA versions

Operating your vLLM Worker with different CUDA versions enhances compatibility and performance across various hardware configurations. When deploying, ensure you choose an appropriate CUDA version based on your needs.

CUDA VersionStable Image TagDevelopment Image TagNote
12.1.0runpod/worker-v1-vllm:stable-cuda12.1.0runpod/worker-v1-vllm:dev-cuda12.1.0When creating an Endpoint, select CUDA Version 12.2 and 12.1 in the GPU filter.

This table provides a reference to the image tags you should use based on the desired CUDA version and image stability, stable or development. Ensure you follow the selection note for CUDA 12.1.0 compatibility.

Environment variables

note

0 is equivalent to False and 1 is equivalent to True for boolean values.

NameDefaultType/ChoicesDescription
MODEL_NAME'facebook/opt-125m'strName or path of the Hugging Face model to use.
TOKENIZERNonestrName or path of the Hugging Face tokenizer to use.
SKIP_TOKENIZER_INITFalseboolSkip initialization of tokenizer and detokenizer.
TOKENIZER_MODE'auto'['auto', 'slow']The tokenizer mode.
TRUST_REMOTE_CODEFalseboolTrust remote code from Hugging Face.
DOWNLOAD_DIRNonestrDirectory to download and load the weights.
LOAD_FORMAT'auto'strThe format of the model weights to load.
HF_TOKEN-strHugging Face token for private and gated models.
DTYPE'auto'['auto', 'half', 'float16', 'bfloat16', 'float', 'float32']Data type for model weights and activations.
KV_CACHE_DTYPE'auto'['auto', 'fp8']Data type for KV cache storage.
QUANTIZATION_PARAM_PATHNonestrPath to the JSON file containing the KV cache scaling factors.
MAX_MODEL_LENNoneintModel context length.
GUIDED_DECODING_BACKEND'outlines'['outlines', 'lm-format-enforcer']Which engine will be used for guided decoding by default.
DISTRIBUTED_EXECUTOR_BACKENDNone['ray', 'mp']Backend to use for distributed serving.
WORKER_USE_RAYFalseboolDeprecated, use --distributed-executor-backend=ray.
PIPELINE_PARALLEL_SIZE1intNumber of pipeline stages.
TENSOR_PARALLEL_SIZE1intNumber of tensor parallel replicas.
MAX_PARALLEL_LOADING_WORKERSNoneintLoad model sequentially in multiple batches.
RAY_WORKERS_USE_NSIGHTFalseboolIf specified, use nsight to profile Ray workers.
ENABLE_PREFIX_CACHINGFalseboolEnables automatic prefix caching.
DISABLE_SLIDING_WINDOWFalseboolDisables sliding window, capping to sliding window size.
USE_V2_BLOCK_MANAGERFalseboolUse BlockSpaceMangerV2.
NUM_LOOKAHEAD_SLOTS0intExperimental scheduling config necessary for speculative decoding.
SEED0intRandom seed for operations.
NUM_GPU_BLOCKS_OVERRIDENoneintIf specified, ignore GPU profiling result and use this number of GPU blocks.
MAX_NUM_BATCHED_TOKENSNoneintMaximum number of batched tokens per iteration.
MAX_NUM_SEQS256intMaximum number of sequences per iteration.
MAX_LOGPROBS20intMax number of log probs to return when logprobs is specified in SamplingParams.
DISABLE_LOG_STATSFalseboolDisable logging statistics.
QUANTIZATIONNone['awq', 'squeezellm', 'gptq']Method used to quantize the weights.
ROPE_SCALINGNonedictRoPE scaling configuration in JSON format.
ROPE_THETANonefloatRoPE theta. Use with rope_scaling.
TOKENIZER_POOL_SIZE0intSize of tokenizer pool to use for asynchronous tokenization.
TOKENIZER_POOL_TYPE'ray'strType of tokenizer pool to use for asynchronous tokenization.
TOKENIZER_POOL_EXTRA_CONFIGNonedictExtra config for tokenizer pool.
ENABLE_LORAFalseboolIf True, enable handling of LoRA adapters.
MAX_LORAS1intMax number of LoRAs in a single batch.
MAX_LORA_RANK16intMax LoRA rank.
LORA_EXTRA_VOCAB_SIZE256intMaximum size of extra vocabulary for LoRA adapters.
LORA_DTYPE'auto'['auto', 'float16', 'bfloat16', 'float32']Data type for LoRA.
LONG_LORA_SCALING_FACTORSNonetupleSpecify multiple scaling factors for LoRA adapters.
MAX_CPU_LORASNoneintMaximum number of LoRAs to store in CPU memory.
FULLY_SHARDED_LORASFalseboolEnable fully sharded LoRA layers.
SCHEDULER_DELAY_FACTOR0.0floatApply a delay before scheduling next prompt.
ENABLE_CHUNKED_PREFILLFalseboolEnable chunked prefill requests.
SPECULATIVE_MODELNonestrThe name of the draft model to be used in speculative decoding.
NUM_SPECULATIVE_TOKENSNoneintThe number of speculative tokens to sample from the draft model.
SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZENoneintNumber of tensor parallel replicas for the draft model.
SPECULATIVE_MAX_MODEL_LENNoneintThe maximum sequence length supported by the draft model.
SPECULATIVE_DISABLE_BY_BATCH_SIZENoneintDisable speculative decoding if the number of enqueue requests is larger than this value.
NGRAM_PROMPT_LOOKUP_MAXNoneintMax size of window for ngram prompt lookup in speculative decoding.
NGRAM_PROMPT_LOOKUP_MINNoneintMin size of window for ngram prompt lookup in speculative decoding.
SPEC_DECODING_ACCEPTANCE_METHOD'rejection_sampler'['rejection_sampler', 'typical_acceptance_sampler']Specify the acceptance method for draft token verification in speculative decoding.
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLDNonefloatSet the lower bound threshold for the posterior probability of a token to be accepted.
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHANonefloatA scaling factor for the entropy-based threshold for token acceptance.
MODEL_LOADER_EXTRA_CONFIGNonedictExtra config for model loader.
PREEMPTION_MODENonestrIf 'recompute', the engine performs preemption-aware recomputation. If 'save', the engine saves activations into the CPU memory as preemption happens.
PREEMPTION_CHECK_PERIOD1.0floatHow frequently the engine checks if a preemption happens.
PREEMPTION_CPU_CAPACITY2floatThe percentage of CPU memory used for the saved activations.
DISABLE_LOGGING_REQUESTFalseboolDisable logging requests.
MAX_LOG_LENNoneintMax number of prompt characters or prompt ID numbers being printed in log.
Tokenizer Settings
TOKENIZER_NAMENonestrTokenizer repository to use a different tokenizer than the model's default.
TOKENIZER_REVISIONNonestrTokenizer revision to load.
CUSTOM_CHAT_TEMPLATENonestr of single-line jinja templateCustom chat jinja template. More Info
System, GPU, and Tensor Parallelism(Multi-GPU) Settings
GPU_MEMORY_UTILIZATION0.95floatSets GPU VRAM utilization.
MAX_PARALLEL_LOADING_WORKERSNoneintLoad model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.
BLOCK_SIZE168, 16, 32Token block size for contiguous chunks of tokens.
SWAP_SPACE4intCPU swap space size (GiB) per GPU.
ENFORCE_EAGERFalseboolAlways use eager-mode PyTorch. If False(0), will use eager mode and CUDA graph in hybrid for maximal performance and flexibility.
MAX_SEQ_LEN_TO_CAPTURE8192intMaximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode.
DISABLE_CUSTOM_ALL_REDUCE0intEnables or disables custom all reduce.
Streaming Batch Size Settings:
DEFAULT_BATCH_SIZE50intDefault and Maximum batch size for token streaming to reduce HTTP calls.
DEFAULT_MIN_BATCH_SIZE1intBatch size for the first request, which will be multiplied by the growth factor every subsequent request.
DEFAULT_BATCH_SIZE_GROWTH_FACTOR3floatGrowth factor for dynamic batch size.
The way this works is that the first request will have a batch size of DEFAULT_MIN_BATCH_SIZE, and each subsequent request will have a batch size of previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR. This will continue until the batch size reaches DEFAULT_BATCH_SIZE. E.g. for the default values, the batch sizes will be 1, 3, 9, 27, 50, 50, 50, .... You can also specify this per request, with inputs max_batch_size, min_batch_size, and batch_size_growth_factor. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker
OpenAI Settings
RAW_OPENAI_OUTPUT1boolean as intEnables raw OpenAI SSE format string output when streaming. Required to be enabled (which it is by default) for OpenAI compatibility.
OPENAI_SERVED_MODEL_NAME_OVERRIDENonestrOverrides the name of the served model from model repo/path to specified name, which you will then be able to use the value for the model parameter when making OpenAI requests
OPENAI_RESPONSE_ROLEassistantstrRole of the LLM's Response in OpenAI Chat Completions.
Serverless Settings
MAX_CONCURRENCY300intMax concurrent requests per worker. vLLM has an internal queue, so you don't have to worry about limiting by VRAM, this is for improving scaling/load balancing efficiency
DISABLE_LOG_STATSFalseboolEnables or disables vLLM stats logging.
DISABLE_LOG_REQUESTSFalseboolEnables or disables vLLM request logging.
note

If you are facing issues when using Mixtral 8x7B, Quantized models, or handling unusual models/architectures, try setting TRUST_REMOTE_CODE to 1.