> ## Documentation Index > Fetch the complete documentation index at: https://docs.runpod.io/llms.txt > Use this file to discover all available pages before exploring further. # Overview > Deploy custom direct-access REST APIs with load balancing Serverless endpoints. export const QueueBasedEndpointsTooltip = () => { return queue-based endpoints; }; export const RequestsTooltip = () => { return requests; }; Load balancing endpoints route incoming traffic directly to available workers, bypassing the queueing system. Unlike that process requests sequentially, load balancing distributes requests across your worker pool for lower latency. You can create custom REST endpoints accessible via a unique URL: ``` https://ENDPOINT_ID.api.runpod.ai/YOUR_CUSTOM_PATH ``` Create and deploy a load balancing worker. Deploy vLLM with load balancing. ## Load balancing vs. queue-based endpoints ### Queue-based endpoints With queue-based endpoints, are placed in a queue and processed in order. They use the standard handler pattern (`def handler(job)`) and are accessed through fixed endpoints like `/run` and `/runsync`. These endpoints are better for tasks that can be processed asynchronously and guarantee request processing, similar to how TCP guarantees packet delivery in networking. ### Load balancing endpoints (new) Load balancing endpoints send requests directly to workers without queuing. You can use any HTTP framework such as FastAPI or Flask, and define custom URL paths and API contracts to suit your specific needs. These endpoints are ideal for real-time applications and streaming, but provide no queuing mechanism for request backlog, similar to UDP's behavior in networking. ## Endpoint type comparison table | Aspect | Load balancing | Queue-based | | ------------------- | ----------------------------------------- | ------------------------------------- | | **Request flow** | Direct to worker HTTP server | Through queueing system | | **Implementation** | Custom HTTP server (FastAPI, Flask, etc.) | Handler function | | **API flexibility** | Custom URL paths, any HTTP capability | Fixed `/run` and `/runsync` endpoints | | **Backpressure** | Drops requests when overloaded | Queue buffering | | **Latency** | Lower (single-hop) | Higher (queue + worker) | | **Error handling** | No built-in retry | Automatic retries | ## Worker comparison **Queue-based worker** (traditional): ```python theme={"theme":{"light":"github-light","dark":"github-dark"}} import runpod def handler(job): prompt = job["input"].get("prompt", "Hello world") return {"generated_text": f"Generated text for: {prompt}"} runpod.serverless.start({"handler": handler}) ``` **Load balancing worker** (custom HTTP server): ```python theme={"theme":{"light":"github-light","dark":"github-dark"}} from fastapi import FastAPI import os app = FastAPI() @app.get("/ping") async def health_check(): return {"status": "healthy"} @app.post("/generate") async def generate(request: dict): return {"generated_text": f"Generated text for: {request['prompt']}"} if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=int(os.getenv("PORT", "80"))) ``` This exposes custom endpoints: `https://ENDPOINT_ID.api.runpod.ai/ping` and `https://ENDPOINT_ID.api.runpod.ai/generate` ## Health checks Workers must expose a `/ping` endpoint on the `PORT_HEALTH` port. The load balancer periodically checks this endpoint: | Response code | Status | | ------------- | ------------ | | `200` | Healthy | | `204` | Initializing | | Other | Unhealthy | Unhealthy workers are automatically removed from the routing pool. When calculating endpoint metrics, Runpod calculates the cold start time for load balancing workers by measuring the time it takes between `/ping` first returning `204` until it first returns `200`. ## Environment variables | Variable | Default | Description | | ------------- | -------------- | ---------------------------- | | `PORT` | `80` | Main application server port | | `PORT_HEALTH` | Same as `PORT` | Health check endpoint port | If using a custom port, add it to your endpoint's environment variables and expose it in container configuration (under **Expose HTTP Ports (Max 10)**). ## Timeouts and limits | Limit | Value | | ---------------------- | ---------------------------- | | **Request timeout** | 2 min (no worker available) | | **Processing timeout** | 5.5 min (per request) | | **Payload limit** | 30 MB (request and response) | For payloads larger than 30 MB, use [network volumes](/storage/network-volumes) or implement chunking. If your server ports are misconfigured, workers stay up for 8 minutes before terminating, returning `502` errors. ## Handling cold starts When workers are initializing, you may get "no workers available" errors. Implement retry logic to handle this: ```python theme={"theme":{"light":"github-light","dark":"github-dark"}} import requests import time def health_check_with_retry(base_url, api_key, max_retries=3, delay=5): headers = {"Authorization": f"Bearer {api_key}"} for attempt in range(max_retries): try: response = requests.get(f"{base_url}/ping", headers=headers, timeout=10) if response.status_code == 200: return True except Exception: pass if attempt < max_retries - 1: time.sleep(delay) return False # Usage if health_check_with_retry("https://ENDPOINT_ID.api.runpod.ai", "RUNPOD_API_KEY"): # Worker ready, send requests pass ``` Use at least 3 retries with 5-10 second delays. ## When to use load balancing endpoints Use load balancing endpoints when you need: * Direct access to your model's HTTP server. * Internal batching systems (like vLLM). * Non-JSON payloads. * Multiple endpoints within a single worker. * Lower latency for real-time applications.