Skip to main content
Environment variables let you configure your vLLM worker without rebuilding your Docker image. You can customize model behavior, performance settings, and other deployment options. To set environment variables, go to your endpoint settings and add them under Environment Variables.

LLM settings

These variables control the core language model configuration.
VariableDefaultType(s)Description
MODEL_NAMEfacebook/opt-125mstrThe name or path of the Hugging Face model to use.
MODEL_REVISIONmainstrThe model revision to load.
TOKENIZERNonestrThe name or path of the Hugging Face tokenizer to use.
SKIP_TOKENIZER_INITFalseboolIf True, skips the initialization of the tokenizer and detokenizer.
TOKENIZER_MODEautoauto, slowThe tokenizer mode.
TRUST_REMOTE_CODEFalseboolIf True, trusts remote code from Hugging Face.
DOWNLOAD_DIRNonestrThe directory to download and load the model weights from.
LOAD_FORMATautostrThe format of the model weights to load.
HF_TOKEN-strYour Hugging Face token, used for private and gated models.
DTYPEautoauto, half, float16, bfloat16, float, float32The data type for model weights and activations.
KV_CACHE_DTYPEautoauto, fp8The data type for KV cache storage.
QUANTIZATION_PARAM_PATHNonestrThe path to the JSON file containing the KV cache scaling factors.
MAX_MODEL_LENNoneintThe maximum model context length.
GUIDED_DECODING_BACKENDoutlinesoutlines, lm-format-enforcerThe default engine for guided decoding.
DISTRIBUTED_EXECUTOR_BACKENDNoneray, mpThe backend to use for distributed serving.
WORKER_USE_RAYFalseboolDeprecated. Use DISTRIBUTED_EXECUTOR_BACKEND=ray instead.
PIPELINE_PARALLEL_SIZE1intThe number of pipeline stages.
TENSOR_PARALLEL_SIZE1intThe number of tensor parallel replicas.
MAX_PARALLEL_LOADING_WORKERSNoneintThe number of workers to use for parallel model loading.
RAY_WORKERS_USE_NSIGHTFalseboolIf True, uses nsight to profile Ray workers.
ENABLE_PREFIX_CACHINGFalseboolIf True, enables automatic prefix caching.
DISABLE_SLIDING_WINDOWFalseboolIf True, disables the sliding window, capping to the sliding window size.
USE_V2_BLOCK_MANAGERFalseboolIf True, uses the BlockSpaceMangerV2.
NUM_LOOKAHEAD_SLOTS0intThe number of lookahead slots, an experimental scheduling configuration for speculative decoding.
SEED0intThe random seed for operations.
NUM_GPU_BLOCKS_OVERRIDENoneintIf specified, this value overrides the GPU profiling result for the number of GPU blocks.
MAX_NUM_BATCHED_TOKENSNoneintThe maximum number of batched tokens per iteration.
MAX_NUM_SEQS256intThe maximum number of sequences per iteration.
MAX_LOGPROBS20intThe maximum number of log probabilities to return when logprobs is specified in SamplingParams.
DISABLE_LOG_STATSFalseboolIf True, disables logging statistics.
QUANTIZATIONNoneawq, squeezellm, gptq, bitsandbytesThe method used to quantize the model weights.
ROPE_SCALINGNonedictThe RoPE scaling configuration in JSON format.
ROPE_THETANonefloatThe RoPE theta value. Use with ROPE_SCALING.
TOKENIZER_POOL_SIZE0intThe size of the tokenizer pool for asynchronous tokenization.
TOKENIZER_POOL_TYPEraystrThe type of the tokenizer pool for asynchronous tokenization.
TOKENIZER_POOL_EXTRA_CONFIGNonedictExtra configuration for the tokenizer pool.

LoRA settings

Configure LoRA (Low-Rank Adaptation) adapters for your model.
VariableDefaultTypeDescription
ENABLE_LORAFalseboolIf True, enables the handling of LoRA adapters.
MAX_LORAS1intThe maximum number of LoRAs in a single batch.
MAX_LORA_RANK16intThe maximum LoRA rank.
LORA_EXTRA_VOCAB_SIZE256intThe maximum size of the extra vocabulary for LoRA adapters.
LORA_DTYPEautoauto, float16, bfloat16, float32The data type for LoRA.
LONG_LORA_SCALING_FACTORSNonetupleSpecifies multiple scaling factors for LoRA adapters.
MAX_CPU_LORASNoneintThe maximum number of LoRAs to store in CPU memory.
FULLY_SHARDED_LORASFalseboolIf True, enables fully sharded LoRA layers.
LORA_MODULES[]list[dict]A list of LoRA adapters to add from Hugging Face. Example: [{"name": "adapter1", "path": "user/adapter1"}]

Speculative decoding settings

Configure speculative decoding to improve inference performance.
VariableDefaultType(s)Description
SCHEDULER_DELAY_FACTOR0.0floatApplies a delay before scheduling the next prompt.
ENABLE_CHUNKED_PREFILLFalseboolIf True, enables chunked prefill requests.
SPECULATIVE_MODELNonestrThe name of the draft model for speculative decoding.
NUM_SPECULATIVE_TOKENSNoneintThe number of speculative tokens to sample from the draft model.
SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZENoneintThe number of tensor parallel replicas for the draft model.
SPECULATIVE_MAX_MODEL_LENNoneintThe maximum sequence length supported by the draft model.
SPECULATIVE_DISABLE_BY_BATCH_SIZENoneintDisables speculative decoding if the number of enqueued requests is larger than this value.
NGRAM_PROMPT_LOOKUP_MAXNoneintThe maximum window size for ngram prompt lookup in speculative decoding.
NGRAM_PROMPT_LOOKUP_MINNoneintThe minimum window size for ngram prompt lookup in speculative decoding.
SPEC_DECODING_ACCEPTANCE_METHODrejection_samplerrejection_sampler, typical_acceptance_samplerThe acceptance method for draft token verification in speculative decoding.
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLDNonefloatSets the lower bound threshold for the posterior probability of a token to be accepted.
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHANonefloatA scaling factor for the entropy-based threshold for token acceptance.

System performance settings

Configure GPU memory and system resource utilization.
VariableDefaultType(s)Description
GPU_MEMORY_UTILIZATION0.95floatThe GPU VRAM utilization.
MAX_PARALLEL_LOADING_WORKERSNoneintLoads the model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and large models.
BLOCK_SIZE168, 16, 32The token block size for contiguous chunks of tokens.
SWAP_SPACE4intThe CPU swap space size (in GiB) per GPU.
ENFORCE_EAGERFalseboolIf True, always uses eager-mode PyTorch. If False, uses a hybrid of eager mode and CUDA graphs for maximal performance and flexibility.
MAX_SEQ_LEN_TO_CAPTURE8192intThe maximum context length covered by CUDA graphs. When a sequence has a context length larger than this, the system falls back to eager mode.
DISABLE_CUSTOM_ALL_REDUCE0intIf 0, enables custom all-reduce. If 1, disables it.

Tokenizer settings

Customize tokenizer behavior and chat templates.
VariableDefaultType(s)Description
TOKENIZER_NAMENonestrThe tokenizer repository to use a different tokenizer than the model’s default.
TOKENIZER_REVISIONNonestrThe tokenizer revision to load.
CUSTOM_CHAT_TEMPLATENonestr of single-line jinja templateA custom chat Jinja template. See the Hugging Face documentation for more information.

Streaming and batch settings

Control how tokens are streamed back in HTTP responses. These settings control how tokens are batched in HTTP responses when streaming. The batch size starts at DEFAULT_MIN_BATCH_SIZE and increases by a factor of DEFAULT_BATCH_SIZE_GROWTH_FACTOR with each request until it reaches DEFAULT_BATCH_SIZE. For example, with default values, the batch sizes would be 1, 3, 9, 27, and then 50 for all subsequent requests. These settings do not affect vLLM’s internal batching.
VariableDefaultType(s)Description
DEFAULT_BATCH_SIZE50intThe default and maximum batch size for token streaming.
DEFAULT_MIN_BATCH_SIZE1intThe initial batch size for the first request.
DEFAULT_BATCH_SIZE_GROWTH_FACTOR3floatThe growth factor for the dynamic batch size.

OpenAI compatibility settings

Configure OpenAI API compatibility features.
VariableDefaultType(s)Description
RAW_OPENAI_OUTPUT1boolean as intIf 1, enables raw OpenAI SSE format string output when streaming. This is required for OpenAI compatibility.
OPENAI_SERVED_MODEL_NAME_OVERRIDENonestrOverrides the served model name. This allows you to use a custom name in the model parameter of your OpenAI requests.
OPENAI_RESPONSE_ROLEassistantstrThe role of the LLM’s response in OpenAI chat completions.
ENABLE_AUTO_TOOL_CHOICEfalseboolIf true, enables automatic tool selection for supported models.
TOOL_CALL_PARSERNonestrThe parser for tool calls.
REASONING_PARSERNonestrThe parser for reasoning-capable models. Setting this enables reasoning mode.

Serverless and concurrency settings

Configure concurrency and logging for Serverless deployments.
VariableDefaultType(s)Description
MAX_CONCURRENCY300intThe maximum number of concurrent requests per worker. vLLM’s internal queue handles VRAM limitations, so this setting is primarily for scaling and load balancing efficiency.
DISABLE_LOG_STATSFalseboolIf False, enables vLLM stats logging.
DISABLE_LOG_REQUESTSFalseboolIf False, enables vLLM request logging.

Advanced settings

Additional configuration options for specialized use cases.
VariableDefaultTypeDescription
MODEL_LOADER_EXTRA_CONFIGNonedictExtra configuration for the model loader.
PREEMPTION_MODENonestrThe preemption mode. If recompute, the engine performs preemption-aware recomputation. If save, the engine saves activations to CPU memory during preemption.
PREEMPTION_CHECK_PERIOD1.0floatThe frequency (in seconds) at which the engine checks for preemption.
PREEMPTION_CPU_CAPACITY2floatThe percentage of CPU memory to use for saved activations.
DISABLE_LOGGING_REQUESTFalseboolIf True, disables logging requests.
MAX_LOG_LENNoneintThe maximum number of prompt characters or prompt ID numbers to print in the log.

Docker build arguments

These variables are used when building custom Docker images with models baked in.
VariableDefaultTypeDescription
BASE_PATH/runpod-volumestrThe storage directory for the Hugging Face cache and model.
WORKER_CUDA_VERSION12.1.0strThe CUDA version for the worker image.
I