Send requests to vLLM workers
This guide covers different methods for sending requests to vLLM workers on RunPod, including code examples and best practices for RunPod’s native API format. Use this guide to effectively integrate LLMs into your applications while maintaining control over performance and cost.
Requirements
- You’ve created a RunPod account.
- You’ve created a RunPod API key.
- You’ve installed Python.
- (For gated models) You’ve created a Hugging Face access token.
Many of the code samples below will require you to input your endpoint ID. You can find your endpoint ID on the endpoint details page:
RunPod API requests
RunPod’s native API provides additional flexibility and control over your requests. These requests follow RunPod’s standard endpoint operations format.
Python Example
Replace [RUNPOD_API_KEY]
with your RunPod API key.
cURL Example
Run the following command in your local terminal, replacing [RUNPOD_API_KEY]
with your RunPod API key and [RUNPOD_ENDPOINT_ID]
with your vLLM endpoint ID.
Request formats
vLLM workers accept two primary input formats:
Messages format (for chat models)
Prompt format (for text completion)
Request input parameters
vLLM workers support various parameters to control generation behavior. Here are some commonly used parameters:
Parameter | Type | Description |
---|---|---|
temperature | float | Controls randomness (0.0-1.0) |
max_tokens | int | Maximum number of tokens to generate |
top_p | float | Nucleus sampling parameter (0.0-1.0) |
top_k | int | Limits consideration to top k tokens |
stop | string or array | Sequence(s) at which to stop generation |
repetition_penalty | float | Penalizes repetition (1.0 = no penalty) |
presence_penalty | float | Penalizes new tokens already in text |
frequency_penalty | float | Penalizes token frequency |
min_p | float | Minimum probability threshold relative to most likely token |
best_of | int | Number of completions to generate server-side |
use_beam_search | boolean | Whether to use beam search instead of sampling |
You can find a complete list of request input parameters on the GitHub README.
Error handling
When working with vLLM workers, it’s crucial to implement proper error handling to address potential issues such as network timeouts, rate limiting, worker initialization delays, and model loading errors.
Here is an example error handling implementation:
Best practices
Here are some best practices to keep in mind when creating your requests:
- Use appropriate timeouts: Set timeouts based on your model size and complexity.
- Implement retry logic: Add exponential backoff for failed requests.
- Optimize batch size: Adjust request frequency based on model inference speed.
- Monitor response times: Track performance to identify optimization opportunities.
- Use streaming for long responses: Improve user experience for lengthy content generation.
- Cache frequent requests: Reduce redundant API calls for common queries.
- Handle rate limits: Implement queuing for high-volume applications.