Skip to main content
vLLM workers use the same /run and /runsync operations as other Runpod Serverless endpoints. The difference is the input format: vLLM expects prompts, messages, and sampling parameters for text generation.

Input formats

Messages (chat models)

Use for instruction-tuned models. The worker automatically applies the model’s chat template.
{
  "input": {
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "sampling_params": {
      "temperature": 0.7,
      "max_tokens": 100
    }
  }
}

Prompt (text completion)

Use for base models or when providing raw text without a chat template.
{
  "input": {
    "prompt": "The capital of France is",
    "sampling_params": {
      "temperature": 0.7,
      "max_tokens": 50
    }
  }
}
To apply the model’s chat template to a prompt, add "apply_chat_template": true.

Send requests

Submit a job that processes in the background. Poll /status/{job_id} for results.
import requests

response = requests.post(
    "https://api.runpod.ai/v2/ENDPOINT_ID/run",
    headers={
        "Authorization": "Bearer RUNPOD_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "input": {
            "messages": [{"role": "user", "content": "Explain quantum computing."}],
            "sampling_params": {"temperature": 0.7, "max_tokens": 200}
        }
    }
)

job_id = response.json()["id"]
print(f"Job ID: {job_id}")

# Poll for results
status = requests.get(
    f"https://api.runpod.ai/v2/ENDPOINT_ID/status/{job_id}",
    headers={"Authorization": "Bearer RUNPOD_API_KEY"}
)
print(status.json())
For more on request operations, see Send requests to Serverless endpoints.

Streaming

Receive tokens as they’re generated instead of waiting for the complete response.
import requests
import json

# Submit with streaming enabled
response = requests.post(
    "https://api.runpod.ai/v2/ENDPOINT_ID/run",
    headers={
        "Authorization": "Bearer RUNPOD_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "input": {
            "prompt": "Write a short story about a robot.",
            "sampling_params": {"temperature": 0.8, "max_tokens": 500},
            "stream": True
        }
    }
)

job_id = response.json()["id"]

# Stream results
stream_url = f"https://api.runpod.ai/v2/ENDPOINT_ID/stream/{job_id}"
with requests.get(stream_url, headers={"Authorization": "Bearer RUNPOD_API_KEY"}, stream=True) as r:
    for line in r.iter_lines():
        if line:
            print(json.loads(line))
See streaming documentation for more details.

Sampling parameters

Add parameters to control how the model generates text. Include these in the sampling_params object in your request.
ParameterTypeDefaultDescription
max_tokensint16Maximum tokens to generate.
temperaturefloat1.0Randomness of sampling. Lower = more deterministic.
top_pfloat1.0Cumulative probability of top tokens to consider.
top_kint-1Number of top tokens to consider. -1 = all.
stopstring or listNoneStop generation when these strings are produced.
presence_penaltyfloat0.0Penalize tokens based on presence in output.
frequency_penaltyfloat0.0Penalize tokens based on frequency in output.
ParameterTypeDefaultDescription
nint1Number of output sequences to generate.
best_ofintnGenerate this many sequences, return top n.
repetition_penaltyfloat1.0Penalize repeated tokens. Values > 1 discourage repetition.
min_pfloat0.0Minimum probability threshold relative to top token.
min_tokensint0Minimum tokens before allowing EOS.
use_beam_searchboolfalseUse beam search instead of sampling.
length_penaltyfloat1.0Length penalty for beam search.
early_stoppingboolfalseStop beam search early.
stop_token_idslist[int]NoneToken IDs that stop generation.
ignore_eosboolfalseContinue generating after EOS token.
skip_special_tokensbooltrueOmit special tokens from output.
spaces_between_special_tokensbooltrueAdd spaces between special tokens.
truncate_prompt_tokensintNoneTruncate prompt to this many tokens.
ParameterTypeDefaultDescription
streamboolfalseEnable streaming output.
max_batch_sizeintenv defaultMax tokens per streaming chunk.
min_batch_sizeintenv defaultMin tokens per streaming chunk.
batch_size_growth_factorintenv defaultGrowth factor for batch size.

Error handling

Implement retry logic with exponential backoff to handle network issues, rate limits, and cold starts.
import requests
import time

def send_request(url, headers, payload, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.post(url, headers=headers, json=payload, timeout=300)
            response.raise_for_status()
            return response.json()
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 429:  # Rate limit
                time.sleep(5)
            elif e.response.status_code >= 500:
                time.sleep(2 ** attempt)
            else:
                raise
        except requests.exceptions.RequestException:
            time.sleep(2 ** attempt)
    raise Exception("Max retries exceeded")