Skip to main content
Load balancing endpoints route incoming traffic directly to available workers, bypassing the queueing system. Unlike that process requests sequentially, load balancing distributes requests across your worker pool for lower latency. You can create custom REST endpoints accessible via a unique URL:
https://ENDPOINT_ID.api.runpod.ai/YOUR_CUSTOM_PATH

Build a worker

Create and deploy a load balancing worker.

vLLM load balancer

Deploy vLLM with load balancing.

Load balancing vs. queue-based endpoints

Queue-based endpoints

With queue-based endpoints, are placed in a queue and processed in order. They use the standard handler pattern (def handler(job)) and are accessed through fixed endpoints like /run and /runsync. These endpoints are better for tasks that can be processed asynchronously and guarantee request processing, similar to how TCP guarantees packet delivery in networking.

Load balancing endpoints (new)

Load balancing endpoints send requests directly to workers without queuing. You can use any HTTP framework such as FastAPI or Flask, and define custom URL paths and API contracts to suit your specific needs. These endpoints are ideal for real-time applications and streaming, but provide no queuing mechanism for request backlog, similar to UDP’s behavior in networking.

Endpoint type comparison table

AspectLoad balancingQueue-based
Request flowDirect to worker HTTP serverThrough queueing system
ImplementationCustom HTTP server (FastAPI, Flask, etc.)Handler function
API flexibilityCustom URL paths, any HTTP capabilityFixed /run and /runsync endpoints
BackpressureDrops requests when overloadedQueue buffering
LatencyLower (single-hop)Higher (queue + worker)
Error handlingNo built-in retryAutomatic retries

Worker comparison

Queue-based worker (traditional):
import runpod

def handler(job):
    prompt = job["input"].get("prompt", "Hello world")
    return {"generated_text": f"Generated text for: {prompt}"}

runpod.serverless.start({"handler": handler})
Load balancing worker (custom HTTP server):
from fastapi import FastAPI
import os

app = FastAPI()

@app.get("/ping")
async def health_check():
    return {"status": "healthy"}

@app.post("/generate")
async def generate(request: dict):
    return {"generated_text": f"Generated text for: {request['prompt']}"}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=int(os.getenv("PORT", "80")))
This exposes custom endpoints: https://ENDPOINT_ID.api.runpod.ai/ping and https://ENDPOINT_ID.api.runpod.ai/generate

Health checks

Workers must expose a /ping endpoint on the PORT_HEALTH port. The load balancer periodically checks this endpoint:
Response codeStatus
200Healthy
204Initializing
OtherUnhealthy
Unhealthy workers are automatically removed from the routing pool.
When calculating endpoint metrics, Runpod calculates the cold start time for load balancing workers by measuring the time it takes between /ping first returning 204 until it first returns 200.

Environment variables

VariableDefaultDescription
PORT80Main application server port
PORT_HEALTHSame as PORTHealth check endpoint port
If using a custom port, add it to your endpoint’s environment variables and expose it in container configuration (under Expose HTTP Ports (Max 10)).

Timeouts and limits

LimitValue
Request timeout2 min (no worker available)
Processing timeout5.5 min (per request)
Payload limit30 MB (request and response)
For payloads larger than 30 MB, use network volumes or implement chunking.
If your server ports are misconfigured, workers stay up for 8 minutes before terminating, returning 502 errors.

Handling cold starts

When workers are initializing, you may get “no workers available” errors. Implement retry logic to handle this:
import requests
import time

def health_check_with_retry(base_url, api_key, max_retries=3, delay=5):
    headers = {"Authorization": f"Bearer {api_key}"}

    for attempt in range(max_retries):
        try:
            response = requests.get(f"{base_url}/ping", headers=headers, timeout=10)
            if response.status_code == 200:
                return True
        except Exception:
            pass
        if attempt < max_retries - 1:
            time.sleep(delay)
    return False

# Usage
if health_check_with_retry("https://ENDPOINT_ID.api.runpod.ai", "RUNPOD_API_KEY"):
    # Worker ready, send requests
    pass
Use at least 3 retries with 5-10 second delays.

When to use load balancing endpoints

Use load balancing endpoints when you need:
  • Direct access to your model’s HTTP server.
  • Internal batching systems (like vLLM).
  • Non-JSON payloads.
  • Multiple endpoints within a single worker.
  • Lower latency for real-time applications.