LLM settings
These variables control the core language model configuration.Variable | Default | Type(s) | Description |
---|---|---|---|
MODEL_NAME | facebook/opt-125m | str | The name or path of the Hugging Face model to use. |
MODEL_REVISION | main | str | The model revision to load. |
TOKENIZER | None | str | The name or path of the Hugging Face tokenizer to use. |
SKIP_TOKENIZER_INIT | False | bool | If True , skips the initialization of the tokenizer and detokenizer. |
TOKENIZER_MODE | auto | auto , slow | The tokenizer mode. |
TRUST_REMOTE_CODE | False | bool | If True , trusts remote code from Hugging Face. |
DOWNLOAD_DIR | None | str | The directory to download and load the model weights from. |
LOAD_FORMAT | auto | str | The format of the model weights to load. |
HF_TOKEN | - | str | Your Hugging Face token, used for private and gated models. |
DTYPE | auto | auto , half , float16 , bfloat16 , float , float32 | The data type for model weights and activations. |
KV_CACHE_DTYPE | auto | auto , fp8 | The data type for KV cache storage. |
QUANTIZATION_PARAM_PATH | None | str | The path to the JSON file containing the KV cache scaling factors. |
MAX_MODEL_LEN | None | int | The maximum model context length. |
GUIDED_DECODING_BACKEND | outlines | outlines , lm-format-enforcer | The default engine for guided decoding. |
DISTRIBUTED_EXECUTOR_BACKEND | None | ray , mp | The backend to use for distributed serving. |
WORKER_USE_RAY | False | bool | Deprecated. Use DISTRIBUTED_EXECUTOR_BACKEND=ray instead. |
PIPELINE_PARALLEL_SIZE | 1 | int | The number of pipeline stages. |
TENSOR_PARALLEL_SIZE | 1 | int | The number of tensor parallel replicas. |
MAX_PARALLEL_LOADING_WORKERS | None | int | The number of workers to use for parallel model loading. |
RAY_WORKERS_USE_NSIGHT | False | bool | If True , uses nsight to profile Ray workers. |
ENABLE_PREFIX_CACHING | False | bool | If True , enables automatic prefix caching. |
DISABLE_SLIDING_WINDOW | False | bool | If True , disables the sliding window, capping to the sliding window size. |
USE_V2_BLOCK_MANAGER | False | bool | If True , uses the BlockSpaceMangerV2. |
NUM_LOOKAHEAD_SLOTS | 0 | int | The number of lookahead slots, an experimental scheduling configuration for speculative decoding. |
SEED | 0 | int | The random seed for operations. |
NUM_GPU_BLOCKS_OVERRIDE | None | int | If specified, this value overrides the GPU profiling result for the number of GPU blocks. |
MAX_NUM_BATCHED_TOKENS | None | int | The maximum number of batched tokens per iteration. |
MAX_NUM_SEQS | 256 | int | The maximum number of sequences per iteration. |
MAX_LOGPROBS | 20 | int | The maximum number of log probabilities to return when logprobs is specified in SamplingParams . |
DISABLE_LOG_STATS | False | bool | If True , disables logging statistics. |
QUANTIZATION | None | awq , squeezellm , gptq , bitsandbytes | The method used to quantize the model weights. |
ROPE_SCALING | None | dict | The RoPE scaling configuration in JSON format. |
ROPE_THETA | None | float | The RoPE theta value. Use with ROPE_SCALING . |
TOKENIZER_POOL_SIZE | 0 | int | The size of the tokenizer pool for asynchronous tokenization. |
TOKENIZER_POOL_TYPE | ray | str | The type of the tokenizer pool for asynchronous tokenization. |
TOKENIZER_POOL_EXTRA_CONFIG | None | dict | Extra configuration for the tokenizer pool. |
LoRA settings
Configure LoRA (Low-Rank Adaptation) adapters for your model.Variable | Default | Type | Description |
---|---|---|---|
ENABLE_LORA | False | bool | If True , enables the handling of LoRA adapters. |
MAX_LORAS | 1 | int | The maximum number of LoRAs in a single batch. |
MAX_LORA_RANK | 16 | int | The maximum LoRA rank. |
LORA_EXTRA_VOCAB_SIZE | 256 | int | The maximum size of the extra vocabulary for LoRA adapters. |
LORA_DTYPE | auto | auto , float16 , bfloat16 , float32 | The data type for LoRA. |
LONG_LORA_SCALING_FACTORS | None | tuple | Specifies multiple scaling factors for LoRA adapters. |
MAX_CPU_LORAS | None | int | The maximum number of LoRAs to store in CPU memory. |
FULLY_SHARDED_LORAS | False | bool | If True , enables fully sharded LoRA layers. |
LORA_MODULES | [] | list[dict] | A list of LoRA adapters to add from Hugging Face. Example: [{"name": "adapter1", "path": "user/adapter1"}] |
Speculative decoding settings
Configure speculative decoding to improve inference performance.Variable | Default | Type(s) | Description |
---|---|---|---|
SCHEDULER_DELAY_FACTOR | 0.0 | float | Applies a delay before scheduling the next prompt. |
ENABLE_CHUNKED_PREFILL | False | bool | If True , enables chunked prefill requests. |
SPECULATIVE_MODEL | None | str | The name of the draft model for speculative decoding. |
NUM_SPECULATIVE_TOKENS | None | int | The number of speculative tokens to sample from the draft model. |
SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE | None | int | The number of tensor parallel replicas for the draft model. |
SPECULATIVE_MAX_MODEL_LEN | None | int | The maximum sequence length supported by the draft model. |
SPECULATIVE_DISABLE_BY_BATCH_SIZE | None | int | Disables speculative decoding if the number of enqueued requests is larger than this value. |
NGRAM_PROMPT_LOOKUP_MAX | None | int | The maximum window size for ngram prompt lookup in speculative decoding. |
NGRAM_PROMPT_LOOKUP_MIN | None | int | The minimum window size for ngram prompt lookup in speculative decoding. |
SPEC_DECODING_ACCEPTANCE_METHOD | rejection_sampler | rejection_sampler , typical_acceptance_sampler | The acceptance method for draft token verification in speculative decoding. |
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD | None | float | Sets the lower bound threshold for the posterior probability of a token to be accepted. |
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA | None | float | A scaling factor for the entropy-based threshold for token acceptance. |
System performance settings
Configure GPU memory and system resource utilization.Variable | Default | Type(s) | Description |
---|---|---|---|
GPU_MEMORY_UTILIZATION | 0.95 | float | The GPU VRAM utilization. |
MAX_PARALLEL_LOADING_WORKERS | None | int | Loads the model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and large models. |
BLOCK_SIZE | 16 | 8 , 16 , 32 | The token block size for contiguous chunks of tokens. |
SWAP_SPACE | 4 | int | The CPU swap space size (in GiB) per GPU. |
ENFORCE_EAGER | False | bool | If True , always uses eager-mode PyTorch. If False , uses a hybrid of eager mode and CUDA graphs for maximal performance and flexibility. |
MAX_SEQ_LEN_TO_CAPTURE | 8192 | int | The maximum context length covered by CUDA graphs. When a sequence has a context length larger than this, the system falls back to eager mode. |
DISABLE_CUSTOM_ALL_REDUCE | 0 | int | If 0 , enables custom all-reduce. If 1 , disables it. |
Tokenizer settings
Customize tokenizer behavior and chat templates.Variable | Default | Type(s) | Description |
---|---|---|---|
TOKENIZER_NAME | None | str | The tokenizer repository to use a different tokenizer than the model’s default. |
TOKENIZER_REVISION | None | str | The tokenizer revision to load. |
CUSTOM_CHAT_TEMPLATE | None | str of single-line jinja template | A custom chat Jinja template. See the Hugging Face documentation for more information. |
Streaming and batch settings
Control how tokens are streamed back in HTTP responses. These settings control how tokens are batched in HTTP responses when streaming. The batch size starts atDEFAULT_MIN_BATCH_SIZE
and increases by a factor of DEFAULT_BATCH_SIZE_GROWTH_FACTOR
with each request until it reaches DEFAULT_BATCH_SIZE
.
For example, with default values, the batch sizes would be 1, 3, 9, 27, and then 50 for all subsequent requests. These settings do not affect vLLM’s internal batching.
Variable | Default | Type(s) | Description |
---|---|---|---|
DEFAULT_BATCH_SIZE | 50 | int | The default and maximum batch size for token streaming. |
DEFAULT_MIN_BATCH_SIZE | 1 | int | The initial batch size for the first request. |
DEFAULT_BATCH_SIZE_GROWTH_FACTOR | 3 | float | The growth factor for the dynamic batch size. |
OpenAI compatibility settings
Configure OpenAI API compatibility features.Variable | Default | Type(s) | Description |
---|---|---|---|
RAW_OPENAI_OUTPUT | 1 | boolean as int | If 1 , enables raw OpenAI SSE format string output when streaming. This is required for OpenAI compatibility. |
OPENAI_SERVED_MODEL_NAME_OVERRIDE | None | str | Overrides the served model name. This allows you to use a custom name in the model parameter of your OpenAI requests. |
OPENAI_RESPONSE_ROLE | assistant | str | The role of the LLM’s response in OpenAI chat completions. |
ENABLE_AUTO_TOOL_CHOICE | false | bool | If true , enables automatic tool selection for supported models. |
TOOL_CALL_PARSER | None | str | The parser for tool calls. |
REASONING_PARSER | None | str | The parser for reasoning-capable models. Setting this enables reasoning mode. |
Serverless and concurrency settings
Configure concurrency and logging for Serverless deployments.Variable | Default | Type(s) | Description |
---|---|---|---|
MAX_CONCURRENCY | 300 | int | The maximum number of concurrent requests per worker. vLLM’s internal queue handles VRAM limitations, so this setting is primarily for scaling and load balancing efficiency. |
DISABLE_LOG_STATS | False | bool | If False , enables vLLM stats logging. |
DISABLE_LOG_REQUESTS | False | bool | If False , enables vLLM request logging. |
Advanced settings
Additional configuration options for specialized use cases.Variable | Default | Type | Description |
---|---|---|---|
MODEL_LOADER_EXTRA_CONFIG | None | dict | Extra configuration for the model loader. |
PREEMPTION_MODE | None | str | The preemption mode. If recompute , the engine performs preemption-aware recomputation. If save , the engine saves activations to CPU memory during preemption. |
PREEMPTION_CHECK_PERIOD | 1.0 | float | The frequency (in seconds) at which the engine checks for preemption. |
PREEMPTION_CPU_CAPACITY | 2 | float | The percentage of CPU memory to use for saved activations. |
DISABLE_LOGGING_REQUEST | False | bool | If True , disables logging requests. |
MAX_LOG_LEN | None | int | The maximum number of prompt characters or prompt ID numbers to print in the log. |
Docker build arguments
These variables are used when building custom Docker images with models baked in.Variable | Default | Type | Description |
---|---|---|---|
BASE_PATH | /runpod-volume | str | The storage directory for the Hugging Face cache and model. |
WORKER_CUDA_VERSION | 12.1.0 | str | The CUDA version for the worker image. |