vLLM environment variables

Environment variables let you configure your vLLM workers without rebuilding your Docker image. You can customize model behavior, performance settings, and other deployment options to suit your needs. To set environment variables, go to your endpoint settings and add them under Environment Variables.

LLM settings

These variables control the core language model configuration.

Variable	Default	Type(s)	Description
`MODEL_NAME`	`facebook/opt-125m`	`str`	The name or path of the Hugging Face model to use.
`MODEL_REVISION`	`main`	`str`	The model revision to load.
`TOKENIZER`	None	`str`	The name or path of the Hugging Face tokenizer to use.
`SKIP_TOKENIZER_INIT`	`False`	`bool`	If `True`, skips the initialization of the tokenizer and detokenizer.
`TOKENIZER_MODE`	`auto`	`auto`, `slow`	The tokenizer mode.
`TRUST_REMOTE_CODE`	`False`	`bool`	If `True`, trusts remote code from Hugging Face.
`DOWNLOAD_DIR`	None	`str`	The directory to download and load the model weights from.
`LOAD_FORMAT`	`auto`	`str`	The format of the model weights to load.
`HF_TOKEN`	-	`str`	Your Hugging Face token, used for private and gated models.
`DTYPE`	`auto`	`auto`, `half`, `float16`, `bfloat16`, `float`, `float32`	The data type for model weights and activations.
`KV_CACHE_DTYPE`	`auto`	`auto`, `fp8`	The data type for KV cache storage.
`QUANTIZATION_PARAM_PATH`	None	`str`	The path to the JSON file containing the KV cache scaling factors.
`MAX_MODEL_LEN`	None	`int`	The maximum model context length.
`GUIDED_DECODING_BACKEND`	`outlines`	`outlines`, `lm-format-enforcer`	The default engine for guided decoding.
`DISTRIBUTED_EXECUTOR_BACKEND`	None	`ray`, `mp`	The backend to use for distributed serving.
`WORKER_USE_RAY`	`False`	`bool`	Deprecated. Use `DISTRIBUTED_EXECUTOR_BACKEND=ray` instead.
`PIPELINE_PARALLEL_SIZE`	`1`	`int`	The number of pipeline stages.
`TENSOR_PARALLEL_SIZE`	`1`	`int`	The number of tensor parallel replicas.
`MAX_PARALLEL_LOADING_WORKERS`	None	`int`	The number of workers to use for parallel model loading.
`RAY_WORKERS_USE_NSIGHT`	`False`	`bool`	If `True`, uses nsight to profile Ray workers.
`ENABLE_PREFIX_CACHING`	`False`	`bool`	If `True`, enables automatic prefix caching.
`DISABLE_SLIDING_WINDOW`	`False`	`bool`	If `True`, disables the sliding window, capping to the sliding window size.
`USE_V2_BLOCK_MANAGER`	`False`	`bool`	If `True`, uses the BlockSpaceMangerV2.
`NUM_LOOKAHEAD_SLOTS`	`0`	`int`	The number of lookahead slots, an experimental scheduling configuration for speculative decoding.
`SEED`	`0`	`int`	The random seed for operations.
`NUM_GPU_BLOCKS_OVERRIDE`	None	`int`	If specified, this value overrides the GPU profiling result for the number of GPU blocks.
`MAX_NUM_BATCHED_TOKENS`	None	`int`	The maximum number of batched tokens per iteration.
`MAX_NUM_SEQS`	`256`	`int`	The maximum number of sequences per iteration.
`MAX_LOGPROBS`	`20`	`int`	The maximum number of log probabilities to return when `logprobs` is specified in `SamplingParams`.
`DISABLE_LOG_STATS`	`False`	`bool`	If `True`, disables logging statistics.
`QUANTIZATION`	None	`awq`, `squeezellm`, `gptq`, `bitsandbytes`	The method used to quantize the model weights.
`ROPE_SCALING`	None	`dict`	The RoPE scaling configuration in JSON format.
`ROPE_THETA`	None	`float`	The RoPE theta value. Use with `ROPE_SCALING`.
`TOKENIZER_POOL_SIZE`	`0`	`int`	The size of the tokenizer pool for asynchronous tokenization.
`TOKENIZER_POOL_TYPE`	`ray`	`str`	The type of the tokenizer pool for asynchronous tokenization.
`TOKENIZER_POOL_EXTRA_CONFIG`	None	`dict`	Extra configuration for the tokenizer pool.

LoRA settings

Configure LoRA (Low-Rank Adaptation) adapters for your model.

Variable	Default	Type	Description
`ENABLE_LORA`	`False`	`bool`	If `True`, enables the handling of LoRA adapters.
`MAX_LORAS`	`1`	`int`	The maximum number of LoRAs in a single batch.
`MAX_LORA_RANK`	`16`	`int`	The maximum LoRA rank.
`LORA_EXTRA_VOCAB_SIZE`	`256`	`int`	The maximum size of the extra vocabulary for LoRA adapters.
`LORA_DTYPE`	`auto`	`auto`, `float16`, `bfloat16`, `float32`	The data type for LoRA.
`LONG_LORA_SCALING_FACTORS`	None	`tuple`	Specifies multiple scaling factors for LoRA adapters.
`MAX_CPU_LORAS`	None	`int`	The maximum number of LoRAs to store in CPU memory.
`FULLY_SHARDED_LORAS`	`False`	`bool`	If `True`, enables fully sharded LoRA layers.
`LORA_MODULES`	`[]`	`list[dict]`	A list of LoRA adapters to add from Hugging Face. Example: `[{"name": "adapter1", "path": "user/adapter1"}]`

Speculative decoding settings

Configure speculative decoding to improve inference performance.

Variable	Default	Type(s)	Description
`SCHEDULER_DELAY_FACTOR`	`0.0`	`float`	Applies a delay before scheduling the next prompt.
`ENABLE_CHUNKED_PREFILL`	`False`	`bool`	If `True`, enables chunked prefill requests.
`SPECULATIVE_MODEL`	None	`str`	The name of the draft model for speculative decoding.
`NUM_SPECULATIVE_TOKENS`	None	`int`	The number of speculative tokens to sample from the draft model.
`SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE`	None	`int`	The number of tensor parallel replicas for the draft model.
`SPECULATIVE_MAX_MODEL_LEN`	None	`int`	The maximum sequence length supported by the draft model.
`SPECULATIVE_DISABLE_BY_BATCH_SIZE`	None	`int`	Disables speculative decoding if the number of enqueued requests is larger than this value.
`NGRAM_PROMPT_LOOKUP_MAX`	None	`int`	The maximum window size for ngram prompt lookup in speculative decoding.
`NGRAM_PROMPT_LOOKUP_MIN`	None	`int`	The minimum window size for ngram prompt lookup in speculative decoding.
`SPEC_DECODING_ACCEPTANCE_METHOD`	`rejection_sampler`	`rejection_sampler`, `typical_acceptance_sampler`	The acceptance method for draft token verification in speculative decoding.
`TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD`	None	`float`	Sets the lower bound threshold for the posterior probability of a token to be accepted.
`TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA`	None	`float`	A scaling factor for the entropy-based threshold for token acceptance.

System performance settings

Configure GPU memory and system resource utilization.

Variable	Default	Type(s)	Description
`GPU_MEMORY_UTILIZATION`	`0.95`	`float`	The GPU VRAM utilization.
`MAX_PARALLEL_LOADING_WORKERS`	None	`int`	Loads the model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and large models.
`BLOCK_SIZE`	`16`	`8`, `16`, `32`	The token block size for contiguous chunks of tokens.
`SWAP_SPACE`	`4`	`int`	The CPU swap space size (in GiB) per GPU.
`ENFORCE_EAGER`	`False`	`bool`	If `True`, always uses eager-mode PyTorch. If `False`, uses a hybrid of eager mode and CUDA graphs for maximal performance and flexibility.
`MAX_SEQ_LEN_TO_CAPTURE`	`8192`	`int`	The maximum context length covered by CUDA graphs. When a sequence has a context length larger than this, the system falls back to eager mode.
`DISABLE_CUSTOM_ALL_REDUCE`	`0`	`int`	If `0`, enables custom all-reduce. If `1`, disables it.

Tokenizer settings

Customize tokenizer behavior and chat templates.

Variable	Default	Type(s)	Description
`TOKENIZER_NAME`	None	`str`	The tokenizer repository to use a different tokenizer than the model’s default.
`TOKENIZER_REVISION`	None	`str`	The tokenizer revision to load.
`CUSTOM_CHAT_TEMPLATE`	None	`str` of single-line jinja template	A custom chat Jinja template. See the Hugging Face documentation for more information.

Streaming and batch settings

Control how tokens are streamed back in HTTP responses. These settings control how tokens are batched in HTTP responses when streaming. The batch size starts at DEFAULT_MIN_BATCH_SIZE and increases by a factor of DEFAULT_BATCH_SIZE_GROWTH_FACTOR with each request until it reaches DEFAULT_BATCH_SIZE. For example, with default values, the batch sizes would be 1, 3, 9, 27, and then 50 for all subsequent requests. These settings do not affect vLLM’s internal batching.

Variable	Default	Type(s)	Description
`DEFAULT_BATCH_SIZE`	`50`	`int`	The default and maximum batch size for token streaming.
`DEFAULT_MIN_BATCH_SIZE`	`1`	`int`	The initial batch size for the first request.
`DEFAULT_BATCH_SIZE_GROWTH_FACTOR`	`3`	`float`	The growth factor for the dynamic batch size.

OpenAI compatibility settings

Configure OpenAI API compatibility features.

Variable	Default	Type(s)	Description
`RAW_OPENAI_OUTPUT`	`1`	boolean as `int`	If `1`, enables raw OpenAI SSE format string output when streaming. This is required for OpenAI compatibility.
`OPENAI_SERVED_MODEL_NAME_OVERRIDE`	None	`str`	Overrides the served model name. This allows you to use a custom name in the `model` parameter of your OpenAI requests.
`OPENAI_RESPONSE_ROLE`	`assistant`	`str`	The role of the LLM’s response in OpenAI chat completions.
`ENABLE_AUTO_TOOL_CHOICE`	`false`	`bool`	If `true`, enables automatic tool selection for supported models.
`TOOL_CALL_PARSER`	None	`str`	The parser for tool calls.
`REASONING_PARSER`	None	`str`	The parser for reasoning-capable models. Setting this enables reasoning mode.

Serverless and concurrency settings

Configure concurrency and logging for Serverless deployments.

Variable	Default	Type(s)	Description
`MAX_CONCURRENCY`	`300`	`int`	The maximum number of concurrent requests per worker. vLLM’s internal queue handles VRAM limitations, so this setting is primarily for scaling and load balancing efficiency.
`DISABLE_LOG_STATS`	`False`	`bool`	If `False`, enables vLLM stats logging.
`DISABLE_LOG_REQUESTS`	`False`	`bool`	If `False`, enables vLLM request logging.

Advanced settings

Additional configuration options for specialized use cases.

Variable	Default	Type	Description
`MODEL_LOADER_EXTRA_CONFIG`	None	`dict`	Extra configuration for the model loader.
`PREEMPTION_MODE`	None	`str`	The preemption mode. If `recompute`, the engine performs preemption-aware recomputation. If `save`, the engine saves activations to CPU memory during preemption.
`PREEMPTION_CHECK_PERIOD`	`1.0`	`float`	The frequency (in seconds) at which the engine checks for preemption.
`PREEMPTION_CPU_CAPACITY`	`2`	`float`	The percentage of CPU memory to use for saved activations.
`DISABLE_LOGGING_REQUEST`	`False`	`bool`	If `True`, disables logging requests.
`MAX_LOG_LEN`	None	`int`	The maximum number of prompt characters or prompt ID numbers to print in the log.

Docker build arguments

These variables are used when building custom Docker images with models baked in.

Variable	Default	Type	Description
`BASE_PATH`	`/runpod-volume`	`str`	The storage directory for the Hugging Face cache and model.
`WORKER_CUDA_VERSION`	`12.1.0`	`str`	The CUDA version for the worker image.

Get started

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Fine-tuning

Hub

Reference

LLM settings

LoRA settings

Speculative decoding settings

System performance settings

Tokenizer settings

Streaming and batch settings

OpenAI compatibility settings

Serverless and concurrency settings

Advanced settings

Docker build arguments

Get started

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Fine-tuning

Hub

Reference

​LLM settings

​LoRA settings

​Speculative decoding settings

​System performance settings

​Tokenizer settings

​Streaming and batch settings

​OpenAI compatibility settings

​Serverless and concurrency settings

​Advanced settings

​Docker build arguments

LLM settings

LoRA settings

Speculative decoding settings

System performance settings

Tokenizer settings

Streaming and batch settings

OpenAI compatibility settings

Serverless and concurrency settings

Advanced settings

Docker build arguments