/run and /runsync operations as other Runpod Serverless endpoints. The difference is the input format: vLLM expects prompts, messages, and sampling parameters for text generation.
Input formats
Messages (chat models)
Use for instruction-tuned models. The worker automatically applies the model’s chat template.Prompt (text completion)
Use for base models or when providing raw text without a chat template."apply_chat_template": true.
Send requests
- Async (/run)
- Sync (/runsync)
Submit a job that processes in the background. Poll
/status/{job_id} for results.Streaming
Receive tokens as they’re generated instead of waiting for the complete response.Sampling parameters
Add parameters to control how the model generates text. Include these in thesampling_params object in your request.
Common parameters
Common parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
max_tokens | int | 16 | Maximum tokens to generate. |
temperature | float | 1.0 | Randomness of sampling. Lower = more deterministic. |
top_p | float | 1.0 | Cumulative probability of top tokens to consider. |
top_k | int | -1 | Number of top tokens to consider. -1 = all. |
stop | string or list | None | Stop generation when these strings are produced. |
presence_penalty | float | 0.0 | Penalize tokens based on presence in output. |
frequency_penalty | float | 0.0 | Penalize tokens based on frequency in output. |
Advanced parameters
Advanced parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
n | int | 1 | Number of output sequences to generate. |
best_of | int | n | Generate this many sequences, return top n. |
repetition_penalty | float | 1.0 | Penalize repeated tokens. Values > 1 discourage repetition. |
min_p | float | 0.0 | Minimum probability threshold relative to top token. |
min_tokens | int | 0 | Minimum tokens before allowing EOS. |
use_beam_search | bool | false | Use beam search instead of sampling. |
length_penalty | float | 1.0 | Length penalty for beam search. |
early_stopping | bool | false | Stop beam search early. |
stop_token_ids | list[int] | None | Token IDs that stop generation. |
ignore_eos | bool | false | Continue generating after EOS token. |
skip_special_tokens | bool | true | Omit special tokens from output. |
spaces_between_special_tokens | bool | true | Add spaces between special tokens. |
truncate_prompt_tokens | int | None | Truncate prompt to this many tokens. |
Streaming parameters
Streaming parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
stream | bool | false | Enable streaming output. |
max_batch_size | int | env default | Max tokens per streaming chunk. |
min_batch_size | int | env default | Min tokens per streaming chunk. |
batch_size_growth_factor | int | env default | Growth factor for batch size. |