Overview

Load balancing endpoints offer a completely new paradigm for Serverless endpoint creation, enabling direct access to worker HTTP servers without an intermediary queueing system. Unlike traditional queue-based endpoints that process requests sequentially, load balancing endpoints route incoming traffic directly to available workers, distributing requests across the worker pool. When building a load balancer, you’re no longer limited to the standard /run or /runsync endpoints. Instead, you can create custom REST endpoints that are accessible via a unique URL:

https://ENDPOINT_ID.api.runpod.ai/YOUR_CUSTOM_PATH

Get started

When you’re ready to get started, follow this tutorial to learn how to build and deploy a load balancing worker. Or, if you’re ready for a more advanced use case, you can jump straight into building a vLLM load balancer.

Key features

Direct HTTP access: Connect directly to worker HTTP servers, bypassing queue infrastructure for lower latency.
Custom REST API endpoints: Define your own API paths, methods, and contracts to match your specific application needs.
Environment variable port configuration: Control which ports your API listens on through standardized environment variables.
Framework agnostic: Build with FastAPI, Flask, Express.js, or any HTTP server framework of your choice.
Multi-endpoint support: Expose multiple API endpoints through a single worker, creating complete REST API services.
Health-based routing: Requests are only sent to healthy workers, with automatic removal of unhealthy instances.

Load balancing vs. queue-based endpoints

Here are the key differences between the two endpoint types:

Queue-based endpoints (traditional)

With queue-based endpoints, requests are placed in a queue and processed in order. They use the standard handler pattern (def handler(job)) and are accessed through fixed endpoints like /run and /runsync. These endpoints are better for tasks that can be processed asynchronously and guarantee request processing, similar to how TCP guarantees packet delivery in networking.

Load balancing endpoints (new)

Load balancing endpoints send requests directly to workers without queuing. You can use any HTTP framework such as FastAPI or Flask, and define custom URL paths and API contracts to suit your specific needs. These endpoints are ideal for real-time applications and streaming, but provide no queuing mechanism for request backlog, similar to UDP’s behavior in networking.

Endpoint type comparison table

Aspect	Load Balancing	Queue-Based
Request flow	Direct to worker HTTP server	Through queueing system
Implementation	Custom HTTP server	Handler function
Protocol flexibility	Supports any HTTP capability	JSON input/output only
Backpressure handling	Request drop when overloaded	Queue buffering
Latency	Lower (single-hop)	Higher (queue+worker)
Error recovery	No built-in retry mechanism	Automatic retries

Worker implementation comparison

Queue-based Serverless worker

Traditional Serverless workers require a specific handler function structure:

import runpod

def handler(job):
    """Handler function that will be used to process jobs."""
    job_input = job["input"]
    prompt = job_input.get("prompt", "Hello world")

    # Process the request
    result = f"Generated text for: {prompt}"

    return {"generated_text": result}

runpod.serverless.start({"handler": handler})

With traditional endpoints:

Requests are processed through Runpod’s queueing system.
Access is available via fixed the endpoints /run and /runsync.
You implement a single handler function.
You’re limited to JSON input/output.

Load balancing worker

Load balancing workers do not require standardized handlers, or use the Runpod SDK at all. Instead, you can create full REST APIs using frameworks like FastAPI:

from fastapi import FastAPI
from pydantic import BaseModel
import os

app = FastAPI()

class GenerationRequest(BaseModel):
    prompt: str
    max_tokens: int = 100

@app.get("/ping")
async def health_check():
    return {"status": "healthy"}

@app.post("/generate")
async def generate(request: GenerationRequest):
    # Process the request
    result = f"Generated text for: {request.prompt}"
    return {"generated_text": result}

if __name__ == "__main__":
    import uvicorn
    port = int(os.getenv("PORT", "80"))
    uvicorn.run(app, host="0.0.0.0", port=port)

Once deployed, this example would expose two custom endpoints on each Serverless worker:

https://ENDPOINT_ID.api.runpod.ai/ping
https://ENDPOINT_ID.api.runpod.ai/generate

With load balancing endpoints:

Endpoint requests go directly to your HTTP server.
You can define custom URL paths and endpoints.
You have control over your entire API structure.

When to use load balancing endpoints

Consider using load balancing endpoints when you need:

Direct access to your model’s HTTP server
To leverage internal batching systems, like those provided by vLLM.
The ability to return non-JSON payloads
To implement multiple endpoints within a single worker.
Lower latency for real-time applications, where immediate processing is more important than guaranteed execution.

Worker health management

Runpod continuously monitors worker health through a dedicated health check mechanism. Workers must expose a /ping endpoint on the port specified by the PORT_HEALTH environment variable. The load balancer periodically sends requests to this endpoint. Workers respond with appropriate HTTP status codes:

200 : healthy
204 : initializing
Any other code: unhealthy

Unhealthy workers are automatically removed from the routing pool.

When calculating endpoint metrics, Runpod calculates the cold start time for load balancing workers by measuring the time it takes between /ping first returning 204 until it first returns 200.

Environment variables

You can use environment variables to configure ports and other settings for your load balancing worker.

PORT: The port for the main application server (default: 80).
PORT_HEALTH: The port for the health check endpoint (default: PORT).

If you don’t set PORT or PORT_HEALTH during deployment, environment variables will automatically be set to 80 for both, and port 80 will be automatically exposed in the container configuration. If you’re using a custom port, make sure to add it to your endpoint’s environment variables, and expose it in the container configuration of your endpoint settings (under Expose HTTP Ports (Max 10)).

Request timeouts

Requests made to a load balancing endpoint have two timeout scenarios:

Request timeout (2 minutes): If no worker is available to process your request within 2 minutes (e.g., if a worker can’t be initialized fast enough, or the endpoint has reached MAX_WORKERS), the system returns a 400 error. To implement retries, you should account for this response code in your client-side application.
Processing timeout (5.5 minutes): Once a worker receives and begins processing your request, there is a maximum processing time of 5.5 minutes. If processing exceeds this limit, the connection will be terminated with a 524 error. For tasks that consistently take longer than 5.5 minutes to process, load balancing endpoints may not be suitable.

If your server is misconfigured and the ports are not correctly opened, your workers will stay up for 8 minutes before being terminated. In this case requests will return a 502 error. This is a known issue and a fix is in progress.

Payload limits

Load balancing endpoints have a 30 MB payload limit for both requests and responses. If you need to handle payloads larger than 30 MB, you can try these approaches:

Use a network volume to store model artifacts and large datasets for access during runtime.
Implement chunking strategies to split large payloads into smaller pieces.

Handling cold start errors

When you first send a request to a load balancing endpoint, you might get a “no workers available” error. This happens because workers need time to initialize, i.e. the server is up, but the health check at /ping isn’t passing yet. For production applications, you should implement a health check with retries before sending your actual requests. Here’s a Python function that handles this:

import requests
import time

def health_check_with_retry(base_url, api_key, max_retries=3, delay=2):
    """Simple health check with retry logic for Runpod cold starts"""
    headers = {"Authorization": f"Bearer {api_key}"}
    
    for attempt in range(max_retries):
        try:
            response = requests.get(f"{base_url}/ping", headers=headers, timeout=10)
            if response.status_code == 200:
                print("✓ Health check passed")
                return True
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
        
        if attempt < max_retries - 1:
            time.sleep(delay)
    
    print("✗ Health check failed after retries")
    return False

# Usage example
base_url = "https://ENDPOINT_ID.api.runpod.ai"
api_key = "RUNPOD_API_KEY"

# Ensures that a worker is ready (with retries)
if health_check_with_retry(base_url, api_key):
    # Worker is ready, send your actual /generate request
    response = requests.post(
        f"{base_url}/generate",
        headers={"Authorization": f"Bearer {api_key}"},
        json={"prompt": "Hello, world!"}
    )
    print(response.json())
else:
    print("Worker failed to initialize")

The health_check_with_retry function:

Sends requests to the /ping endpoint with configurable retries (default: 3 attempts).
Waits between attempts to give workers time to initialize (default: 2 seconds).
Uses a 10-second timeout per health check request.
Returns True when the worker is ready, or False if initialization fails.

Use at least 3 retries with 5-10 second delays between attempts. This gives workers enough time to complete their cold start process before you send production requests.

Technical details

The load balancing system employs an HTTP load balancer that inspects application-level protocols to make routing decisions. When a request arrives at https://ENDPOINT_ID.api.runpod.ai/PATH, the system:

Identifies available healthy workers within the endpoint’s worker pool.
Routes the request to a worker’s exposed HTTP server.
Returns the worker’s response directly to the client.

Each worker runs an independent HTTP server (such as FastAPI, Flask, or Express) that:

Listens on ports specified via environment variables.
Handles requests according to its custom API contract.
Implements a required health check endpoint.

Get started

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Fine-tuning

Hub

Reference

Get started

Key features

Load balancing vs. queue-based endpoints

Queue-based endpoints (traditional)

Load balancing endpoints (new)

Endpoint type comparison table

Worker implementation comparison

Queue-based Serverless worker

Load balancing worker

When to use load balancing endpoints

Worker health management

Environment variables

Request timeouts

Payload limits

Handling cold start errors

Technical details

Get started

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Fine-tuning

Hub

Reference

​Get started

​Key features

​Load balancing vs. queue-based endpoints

​Queue-based endpoints (traditional)

​Load balancing endpoints (new)

​Endpoint type comparison table

​Worker implementation comparison

​Queue-based Serverless worker

​Load balancing worker

​When to use load balancing endpoints

​Worker health management

​Environment variables

​Request timeouts

​Payload limits

​Handling cold start errors

​Technical details

Get started

Key features

Load balancing vs. queue-based endpoints

Queue-based endpoints (traditional)

Load balancing endpoints (new)

Endpoint type comparison table

Worker implementation comparison

Queue-based Serverless worker

Load balancing worker

When to use load balancing endpoints

Worker health management

Environment variables

Request timeouts

Payload limits

Handling cold start errors

Technical details