Load balancing endpoints are currently in beta. We’re actively addressing issues and working to improve the user experience. Join our Discord if you’d like to provide feedback.
/run
or /runsync
endpoints. Instead, you can create custom REST endpoints that are accessible via a unique URL:
Get started
When you’re ready to get started, follow this tutorial to learn how to build and deploy a load balancing worker. Or, if you’re ready for a more advanced use case, you can jump straight into building a vLLM load balancer.Key features
- Direct HTTP access: Connect directly to worker HTTP servers, bypassing queue infrastructure for lower latency.
- Custom REST API endpoints: Define your own API paths, methods, and contracts to match your specific application needs.
- Environment variable port configuration: Control which ports your API listens on through standardized environment variables.
- Framework agnostic: Build with FastAPI, Flask, Express.js, or any HTTP server framework of your choice.
- Multi-endpoint support: Expose multiple API endpoints through a single worker, creating complete REST API services.
- Health-based routing: Requests are only sent to healthy workers, with automatic removal of unhealthy instances.
Load balancing vs. queue-based endpoints
Here are the key differences between the two endpoint types:Queue-based endpoints (traditional)
With queue-based endpoints, requests are placed in a queue and processed in order. They use the standard handler pattern (def handler(job)
) and are accessed through fixed endpoints like /run
and /runsync
.
These endpoints are better for tasks that can be processed asynchronously and guarantee request processing, similar to how TCP guarantees packet delivery in networking.
Load balancing endpoints (new)
Load balancing endpoints send requests directly to workers without queuing. You can use any HTTP framework such as FastAPI or Flask, and define custom URL paths and API contracts to suit your specific needs. These endpoints are ideal for real-time applications and streaming, but provide no queuing mechanism for request backlog, similar to UDP’s behavior in networking.Endpoint type comparison table
Aspect | Load Balancing | Queue-Based |
---|---|---|
Request flow | Direct to worker HTTP server | Through queueing system |
Implementation | Custom HTTP server | Handler function |
Protocol flexibility | Supports any HTTP capability | JSON input/output only |
Backpressure handling | Request drop when overloaded | Queue buffering |
Latency | Lower (single-hop) | Higher (queue+worker) |
Error recovery | No built-in retry mechanism | Automatic retries |
Worker implementation comparison
Queue-based Serverless worker
Traditional Serverless workers require a specific handler function structure:- Requests are processed through Runpod’s queueing system.
- Access is available via fixed the endpoints
/run
and/runsync
. - You implement a single handler function.
- You’re limited to JSON input/output.
Load balancing worker
Load balancing workers do not require standardized handlers, or use the Runpod SDK at all. Instead, you can create full REST APIs using frameworks like FastAPI:- Endpoint requests go directly to your HTTP server.
- You can define custom URL paths and endpoints.
- You have control over your entire API structure.
When to use load balancing endpoints
Consider using load balancing endpoints when you need:- Direct access to your model’s HTTP server
- To leverage internal batching systems, like those provided by vLLM.
- The ability to return non-JSON payloads
- To implement multiple endpoints within a single worker.
- Lower latency for real-time applications, where immediate processing is more important than guaranteed execution.
Worker health management
Runpod continuously monitors worker health through a dedicated health check mechanism. Workers must expose a/ping
endpoint on the port specified by the PORT_HEALTH
environment variable. The load balancer periodically sends requests to this endpoint. Workers respond with appropriate HTTP status codes:
200
: healthy204
: initializing- Any other code: unhealthy
When calculating endpoint metrics, Runpod calculates the cold start time for load balancing workers by measuring the time it takes between
/ping
first returning 204
until it first returns 200
.Environment variables
You can use environment variables to configure ports and other settings for your load balancing worker.PORT
: The port for the main application server (default:80
).PORT_HEALTH
: The port for the health check endpoint (default:PORT
).
PORT
or PORT_HEALTH
during deployment, environment variables will automatically be set to 80
for both, and port 80 will be automatically exposed in the container configuration.
If you’re using a custom port, make sure to add it to your endpoint’s environment variables, and expose it in the container configuration of your endpoint settings (under Expose HTTP Ports (Max 10)).
Request timeouts
Requests made to a load balancing endpoint have two timeout scenarios:- Request timeout (2 minutes): If no worker is available to process your request within 2 minutes (e.g., if a worker can’t be initialized fast enough, or the endpoint has reached
MAX_WORKERS
), the system returns a400
error. To implement retries, you should account for this response code in your client-side application. - Processing timeout (5.5 minutes): Once a worker receives and begins processing your request, there is a maximum processing time of 5.5 minutes. If processing exceeds this limit, the connection will be terminated with a
524
error. For tasks that consistently take longer than 5.5 minutes to process, load balancing endpoints may not be suitable.
If your server is misconfigured and the ports are not correctly opened, your workers will stay up for 8 minutes before being terminated. In this case requests will return a
502
error. This is a known issue and a fix is in progress.Technical details
The load balancing system employs an HTTP load balancer that inspects application-level protocols to make routing decisions. When a request arrives athttps://ENDPOINT_ID.api.runpod.ai/PATH
, the system:
- Identifies available healthy workers within the endpoint’s worker pool.
- Routes the request to a worker’s exposed HTTP server.
- Returns the worker’s response directly to the client.
- Listens on ports specified via environment variables.
- Handles requests according to its custom API contract.
- Implements a required health check endpoint.