Load balancing endpoints are currently in beta. We’re actively addressing issues and working to improve the user experience. Join our Discord if you’d like to provide feedback.
What you’ll learn
In this tutorial you’ll learn how to:- Create a FastAPI application to serve your API endpoints.
- Implement proper health checks for your workers.
- Deploy your application as a load balancing Serverless endpoint.
- Test and interact with your custom APIs.
Requirements
Before you begin you’ll need:- A Runpod account.
- Basic familiarity with Python and REST APIs.
- Docker installed on your local machine.
Step 1: Create a basic FastAPI application
You can download a preconfigured repository containing the completed code for this tutorial on GitHub.
app.py
:
- A health check endpoint at
/ping
- A text generation endpoint at
/generate
- A statistics endpoint at
/stats
Step 2: Create a Dockerfile
Now, let’s create aDockerfile
to package our application:
requirements.txt
file:
Step 3: Build and push the Docker image
Build and push your Docker image to a container registry:Step 4: Deploy to Runpod
Now, let’s deploy our application to a Serverless endpoint:- Go to the Serverless page in the Runpod console.
- Click New Endpoint
- Click Import from Docker Registry.
- In the Container Image field, enter your Docker image URL:
Then click Next.
- Give your endpoint a name.
- Under Endpoint Type, select Load Balancer.
- Under GPU Configuration, select at least one GPU type (16 GB or 24 GB GPUs are fine for this example).
- Leave all other settings at their defaults.
- Click Create Endpoint.
Step 5: Access your custom API
Once your endpoint is created, you can access your custom APIs at:- Health check:
https://ENDPOINT_ID.api.runpod.ai/ping
- Generate text:
https://ENDPOINT_ID.api.runpod.ai/generate
- Get request count:
https://ENDPOINT_ID.api.runpod.ai/stats
ENDPOINT_ID
and RUNPOD_API_KEY
with your actual endpoint ID and API key:
If you see the following error:This means your workers did not initialize in time to process the request. If you try running the request again, this will usually resolve the issue.
(Optional) Advanced endpoint definitions
For a more complex API, you can define multiple endpoints and organize them logically. Here’s an example of how to structure a more complex API:Troubleshooting
Here are some common issues and methods for troubleshooting:- No workers available: If your request returns
{"error":"no workers available"}%
, this means means your workers did not initialize in time to process the request. Running the request again will usually fix this issue. - Worker unhealthy: Check your health endpoint implementation and ensure it’s returning proper status codes.
- API not accessible: If your request returns
{"error":"not allowed for QB API"}
, verify that your endpoint type is set to “Load Balancer”. - Port issues: Make sure the environment variable for
PORT
matches what your application is using, and that thePORT_HEALTH
variable is set to a different port. - Model errors: Check your model’s requirements and whether it’s compatible with your GPU.