> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpod.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

> Deploy custom direct-access REST APIs with load balancing Serverless endpoints.

export const QueueBasedEndpointsTooltip = () => {
  return <Tooltip headline="Queue-based endpoint" tip="A Serverless endpoint that processes requests sequentially through a managed queue, providing guaranteed execution and automatic retries. Uses handler functions and standard operations like /run and /runsync." cta="Learn more about queue-based endpoints" href="/serverless/endpoints/overview#queue-based-endpoints">queue-based endpoints</Tooltip>;
};

export const RequestsTooltip = () => {
  return <Tooltip headline="Requests" tip="HTTP requests that you send to an endpoint, which can include parameters, payloads, and headers that define what the endpoint should process." cta="Learn more about requests" href="/serverless/endpoints/send-requests">requests</Tooltip>;
};

Load balancing endpoints route incoming traffic directly to available workers, bypassing the queueing system. Unlike <QueueBasedEndpointsTooltip /> that process requests sequentially, load balancing distributes requests across your worker pool for lower latency.

You can create custom REST endpoints accessible via a unique URL:

```
https://ENDPOINT_ID.api.runpod.ai/YOUR_CUSTOM_PATH
```

<CardGroup cols={2}>
  <Card title="Build a worker" href="/serverless/load-balancing/build-a-worker" icon="hammer" horizontal>
    Create and deploy a load balancing worker.
  </Card>

  <Card title="vLLM load balancer" href="/serverless/load-balancing/vllm-worker" icon="message-bot" horizontal>
    Deploy vLLM with load balancing.
  </Card>
</CardGroup>

## Load balancing vs. queue-based endpoints

### Queue-based endpoints

With queue-based endpoints, <RequestsTooltip /> are placed in a queue and processed in order. They use the standard handler pattern (`def handler(job)`) and are accessed through fixed endpoints like `/run` and `/runsync`.

These endpoints are better for tasks that can be processed asynchronously and guarantee request processing, similar to how TCP guarantees packet delivery in networking.

### Load balancing endpoints (new)

Load balancing endpoints send requests directly to workers without queuing. You can use any HTTP framework such as FastAPI or Flask, and define custom URL paths and API contracts to suit your specific needs.

These endpoints are ideal for real-time applications and streaming, but provide no queuing mechanism for request backlog, similar to UDP's behavior in networking.

## Endpoint type comparison table

| Aspect              | Load balancing                            | Queue-based                           |
| ------------------- | ----------------------------------------- | ------------------------------------- |
| **Request flow**    | Direct to worker HTTP server              | Through queueing system               |
| **Implementation**  | Custom HTTP server (FastAPI, Flask, etc.) | Handler function                      |
| **API flexibility** | Custom URL paths, any HTTP capability     | Fixed `/run` and `/runsync` endpoints |
| **Backpressure**    | Drops requests when overloaded            | Queue buffering                       |
| **Latency**         | Lower (single-hop)                        | Higher (queue + worker)               |
| **Error handling**  | No built-in retry                         | Automatic retries                     |

## Worker comparison

**Queue-based worker** (traditional):

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
import runpod

def handler(job):
    prompt = job["input"].get("prompt", "Hello world")
    return {"generated_text": f"Generated text for: {prompt}"}

runpod.serverless.start({"handler": handler})
```

**Load balancing worker** (custom HTTP server):

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from fastapi import FastAPI
import os

app = FastAPI()

@app.get("/ping")
async def health_check():
    return {"status": "healthy"}

@app.post("/generate")
async def generate(request: dict):
    return {"generated_text": f"Generated text for: {request['prompt']}"}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=int(os.getenv("PORT", "80")))
```

This exposes custom endpoints: `https://ENDPOINT_ID.api.runpod.ai/ping` and `https://ENDPOINT_ID.api.runpod.ai/generate`

## Health checks

Workers must expose a `/ping` endpoint on the `PORT_HEALTH` port. The load balancer periodically checks this endpoint:

| Response code | Status       |
| ------------- | ------------ |
| `200`         | Healthy      |
| `204`         | Initializing |
| Other         | Unhealthy    |

Unhealthy workers are automatically removed from the routing pool.

<Note>
  When calculating endpoint metrics, Runpod calculates the cold start time for load balancing workers by measuring the time it takes between  `/ping`  first returning `204` until it first returns `200`.
</Note>

## Environment variables

| Variable      | Default        | Description                  |
| ------------- | -------------- | ---------------------------- |
| `PORT`        | `80`           | Main application server port |
| `PORT_HEALTH` | Same as `PORT` | Health check endpoint port   |

If using a custom port, add it to your endpoint's environment variables and expose it in container configuration (under **Expose HTTP Ports (Max 10)**).

## Timeouts and limits

| Limit                  | Value                        |
| ---------------------- | ---------------------------- |
| **Request timeout**    | 2 min (no worker available)  |
| **Processing timeout** | 5.5 min (per request)        |
| **Payload limit**      | 30 MB (request and response) |

For payloads larger than 30 MB, use [network volumes](/storage/network-volumes) or implement chunking.

<Warning>
  If your server ports are misconfigured, workers stay up for 8 minutes before terminating, returning `502` errors.
</Warning>

## Handling cold starts

When workers are initializing, you may get "no workers available" errors. Implement retry logic to handle this:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
import requests
import time

def health_check_with_retry(base_url, api_key, max_retries=3, delay=5):
    headers = {"Authorization": f"Bearer {api_key}"}

    for attempt in range(max_retries):
        try:
            response = requests.get(f"{base_url}/ping", headers=headers, timeout=10)
            if response.status_code == 200:
                return True
        except Exception:
            pass
        if attempt < max_retries - 1:
            time.sleep(delay)
    return False

# Usage
if health_check_with_retry("https://ENDPOINT_ID.api.runpod.ai", "RUNPOD_API_KEY"):
    # Worker ready, send requests
    pass
```

Use at least 3 retries with 5-10 second delays.

## When to use load balancing endpoints

Use load balancing endpoints when you need:

* Direct access to your model's HTTP server.
* Internal batching systems (like vLLM).
* Non-JSON payloads.
* Multiple endpoints within a single worker.
* Lower latency for real-time applications.
