Load balancing endpoints are currently in beta. We’re actively addressing issues and working to improve the user experience. Join our Discord if you’d like to provide feedback.
What you’ll learn
To get a basic understanding of how to build a load balancing worker (or for more general use cases), see Build a load balancing worker.
- Create a FastAPI application to serve your vLLM endpoints.
- Implement proper health checks for your vLLM workers.
- Deploy your vLLM application as a load balancing Serverless endpoint.
- Test and interact with your vLLM APIs.
Requirements
Before you begin you’ll need:- A Runpod account.
- Basic familiarity with Python, REST APIs, and vLLM.
- Docker installed on your local machine.
Step 1: Create your project files
You can download a preconfigured repository containing the completed code for this tutorial on GitHub.
Step 2: Define data models
We’ll start by creating the data models that define the structure of your API. These models specify what data your endpoints expect to receive and what they’ll return. Add the following code tosrc/models.py:
GenerationRequest and ChatCompletionRequest models specify what data clients need to send, while GenerationResponse and ErrorResponse define what they’ll receive back.
Each data model includes validation rules using Pydantic’s Field function to ensure parameters stay within acceptable ranges.
Step 3: Create utility functions
Next, we’ll create a few helper functions to support the main application. These utilities handle common tasks like formatting chat prompts and creating standardized error responses. Add the following code tosrc/utils.py:
format_chat_prompt function converts chat-style conversations into the text format expected by language models. It first tries to use the model’s built-in chat template, then falls back to a generic format if that’s not available.
The create_error_response function provides a consistent way to generate error messages throughout your application.
Step 4: Build the main FastAPI application
Now we’ll build the main application file,src/handler.py. This file acts as the orchestrator, bringing together the models and utilities we just created. It uses FastAPI to create the server, defines the API endpoints, and manages the vLLM engine’s lifecycle.
Add the following code to src/handler.py:
- A health check at
/pingthat tells the load balancer when your worker is ready. - A text completion endpoint at
/v1/completions. - An OpenAI-compatible chat endpoint at
/v1/chat/completions.
Step 5: Set up dependencies and build steps
With the application code complete, we still need to define its dependencies and create a Dockerfile to package it into a container image.-
Add the following dependencies to
requirements.txt: -
Add the following build steps to your
Dockerfile:
Step 6: Build and push your Docker image
Build and push your Docker image to a container registry:Step 7: Deploy to Runpod
Now, let’s deploy our application to a Serverless endpoint:- Go to the Serverless page in the Runpod console.
- Click New Endpoint
- Click Import from Docker Registry.
- In the Container Image field, enter your Docker image URL:
Then click Next.
- Give your endpoint a name.
- Under Endpoint Type, select Load Balancer.
- Under GPU Configuration, select at least one GPU type (16 GB or 24 GB GPUs are fine for this example).
- Leave all other settings at their defaults.
- Click Create Endpoint.
Step 8: Test your endpoints
You can find a Python script to test your vLLM load balancer locally on GitHub.
- Health check:
https://ENDPOINT_ID.api.runpod.ai/ping - Generate text:
https://ENDPOINT_ID.api.runpod.ai/v1/completions - Chat completions:
https://ENDPOINT_ID.api.runpod.ai/v1/chat/completions
ENDPOINT_ID and RUNPOD_API_KEY with your actual values.
To run a health check:
ping
If you see:
{"error":"no workers available"}% after running the request, this means your workers did not initialize in time to process it. If you try running the request again, this will usually resolve the issue.For production applications, implement a health check with retries before sending requests. See Handling cold start errors for a complete code example.Next steps
Now that you’ve deployed a load balancing vLLM endpoint, you can try:- Experimenting with different models and frameworks.
- Adding authentication to your API.
- Exploring advanced FastAPI features like background tasks and WebSockets.
- Optimizing your application for performance and reliability.