Skip to main content
vLLM workers implement OpenAI API compatibility, so you can use OpenAI client libraries with your deployed models. To integrate with OpenAI-compatible tools, just configure the base URL and API key using your Runpod API key and Serverless endpoint ID.

Setup

from openai import OpenAI

client = OpenAI(
    api_key="RUNPOD_API_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1"
)
Replace ENDPOINT_ID and RUNPOD_API_KEY with your actual values.

Supported endpoints

EndpointDescription
/chat/completionsChat model completions (instruction-tuned models)
/completionsText completions (base models)
/modelsList available models

Chat completions

For instruction-tuned models that follow a chat format.
response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, who are you?"}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Text completions

For base models and raw text completion.
response = client.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    prompt="Write a poem about artificial intelligence:",
    temperature=0.7,
    max_tokens=150
)

print(response.choices[0].text)

Model name

The model parameter must match either:
  • The Hugging Face model you deployed (e.g., mistralai/Mistral-7B-Instruct-v0.2)
  • A custom name set via the OPENAI_SERVED_MODEL_NAME_OVERRIDE environment variable
List available models:
models = client.models.list()
print([model.id for model in models])

Parameters

Standard OpenAI parameters are supported. Include them directly in your request.
ParameterTypeDefaultDescription
modelstringRequiredYour deployed model name.
messageslistRequiredChat messages with role and content.
promptstringRequiredText completion prompt.
temperaturefloat0.7Sampling randomness. Lower = more deterministic.
max_tokensint16Maximum tokens to generate.
top_pfloat1.0Nucleus sampling threshold.
nint1Number of completions to generate.
stopstring or listNoneStop sequences.
streamboolfalseEnable streaming.
presence_penaltyfloat0.0Penalize tokens already present.
frequency_penaltyfloat0.0Penalize frequent tokens.
ParameterTypeDefaultDescription
best_ofintNoneGenerate this many, return top n.
top_kint-1Top-k sampling. -1 = all tokens.
repetition_penaltyfloat1.0Penalize repeated tokens.
min_pfloat0.0Minimum probability threshold.
use_beam_searchboolfalseUse beam search instead of sampling.
length_penaltyfloat1.0Length penalty for beam search.
ignore_eosboolfalseContinue after EOS token.
skip_special_tokensbooltrueOmit special tokens from output.
echoboolfalseInclude prompt in output.

Environment variables

Use these environment variables to customize the OpenAI compatibility:
VariableDefaultDescription
RAW_OPENAI_OUTPUT1Enable raw OpenAI SSE format for streaming.
OPENAI_SERVED_MODEL_NAME_OVERRIDENoneOverride model name in responses.
OPENAI_RESPONSE_ROLEassistantRole for chat completion responses.
See environment variables reference for all options.

Differences from OpenAI

  • Token counting may differ due to different tokenizers.
  • Rate limits follow Runpod’s policies, not OpenAI’s.
  • Function/tool calling depends on model and vLLM support.
  • Vision/multimodal depends on underlying model support.

Troubleshooting

IssueSolution
”Invalid model” errorVerify model name matches your deployment.
Authentication errorUse your Runpod API key, not an OpenAI key.
Timeout errorsIncrease client timeout for large models.
Unexpected response formatSet RAW_OPENAI_OUTPUT=1.