Setup
- Python
- JavaScript
ENDPOINT_ID and RUNPOD_API_KEY with your actual values.
Supported endpoints
| Endpoint | Description |
|---|---|
/chat/completions | Chat model completions (instruction-tuned models) |
/completions | Text completions (base models) |
/models | List available models |
Chat completions
For instruction-tuned models that follow a chat format.- Standard
- Streaming
Text completions
For base models and raw text completion.- Standard
- Streaming
Model name
Themodel parameter must match either:
- The Hugging Face model you deployed (e.g.,
mistralai/Mistral-7B-Instruct-v0.2) - A custom name set via the
OPENAI_SERVED_MODEL_NAME_OVERRIDEenvironment variable
Parameters
Standard OpenAI parameters are supported. Include them directly in your request.Common parameters
Common parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | Required | Your deployed model name. |
messages | list | Required | Chat messages with role and content. |
prompt | string | Required | Text completion prompt. |
temperature | float | 0.7 | Sampling randomness. Lower = more deterministic. |
max_tokens | int | 16 | Maximum tokens to generate. |
top_p | float | 1.0 | Nucleus sampling threshold. |
n | int | 1 | Number of completions to generate. |
stop | string or list | None | Stop sequences. |
stream | bool | false | Enable streaming. |
presence_penalty | float | 0.0 | Penalize tokens already present. |
frequency_penalty | float | 0.0 | Penalize frequent tokens. |
Additional vLLM parameters
Additional vLLM parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
best_of | int | None | Generate this many, return top n. |
top_k | int | -1 | Top-k sampling. -1 = all tokens. |
repetition_penalty | float | 1.0 | Penalize repeated tokens. |
min_p | float | 0.0 | Minimum probability threshold. |
use_beam_search | bool | false | Use beam search instead of sampling. |
length_penalty | float | 1.0 | Length penalty for beam search. |
ignore_eos | bool | false | Continue after EOS token. |
skip_special_tokens | bool | true | Omit special tokens from output. |
echo | bool | false | Include prompt in output. |
Environment variables
Use these environment variables to customize the OpenAI compatibility:| Variable | Default | Description |
|---|---|---|
RAW_OPENAI_OUTPUT | 1 | Enable raw OpenAI SSE format for streaming. |
OPENAI_SERVED_MODEL_NAME_OVERRIDE | None | Override model name in responses. |
OPENAI_RESPONSE_ROLE | assistant | Role for chat completion responses. |
Differences from OpenAI
- Token counting may differ due to different tokenizers.
- Rate limits follow Runpod’s policies, not OpenAI’s.
- Function/tool calling depends on model and vLLM support.
- Vision/multimodal depends on underlying model support.
Troubleshooting
| Issue | Solution |
|---|---|
| ”Invalid model” error | Verify model name matches your deployment. |
| Authentication error | Use your Runpod API key, not an OpenAI key. |
| Timeout errors | Increase client timeout for large models. |
| Unexpected response format | Set RAW_OPENAI_OUTPUT=1. |