Skip to main content
Endpoints are the foundation of Runpod Serverless, serving as the gateway for deploying and managing your Serverless workers. They provide a consistent API interface that allows your applications to interact with powerful compute resources on demand. Endpoints are RESTful APIs that accept HTTP requests, processing the input using your handler function, and returning the result via HTTP response. Each endpoint provides a unique URL and abstracts away the complexity of managing individual GPUs/CPUs.

Endpoint types

Queue-based endpoints

Queue-based endpoints are the traditional type of endpoint that process requests sequentially in a queue (managed automatically by Runpod), providing guaranteed execution and automatic retries for failed requests. Queue-based endpoints offer two types execution modes:
  • Asynchronous processing via the /run endpoint operation, which lets you submit jobs that run in the background and check results later (with /status), making this ideal for long-running tasks.
  • Synchronous operations through the /runsync endpoint operation, allowing you to receive immediate results in the same request, which is perfect for interactive applications.
To learn more about the available endpoint operations, see the Send API requests page.

Load balancing endpoints

Load balancing endpoints offer direct HTTP access to your worker’s HTTP server, bypassing the queueing system. These are ideal for real-time applications and streaming, but provide no queuing mechanism for request backlog (similar to UDP’s behavior in networking). Load balancing endpoints don’t require a handler function, allowing you to define your own custom API endpoints using any HTTP framework (like FastAPI or Flask). To learn more, see the Load balancing endpoints page.

Key features

Auto-scaling

Runpod endpoints (both queue-based and load balancing) can automatically scale from zero to hundreds of workers based on demand. You can customize your endpoint configuration to adjust the minimum and maximum worker count, GPU allocation, and memory settings. The system also offers GPU prioritization, allowing you to specify preferred GPU types in order of priority. To learn more, see Endpoint settings.

Integration options

Runpod endpoints support webhook notifications, allowing you to configure endpoints to call your webhook when jobs complete. It also includes S3-compatible storage integration for working with object storage for larger inputs and outputs.

Create an endpoint

Before creating an endpoint make sure you have a working handler function and Dockerfile.
To create a new Serverless endpoint through the Runpod web interface:
  1. Navigate to the Serverless section of the Runpod console.
  2. Click New Endpoint.
  3. On the Deploy a New Serverless Endpoint screen, choose your deployment source:
    • Import Git Repository (if GitHub is connected). See Deploy from GitHub for details.
    • Import from Docker Registry. See Deploy from Docker Hub for details.
    • Or select a preconfigured endpoint under Ready-to-Deploy Repos.
  4. Follow the UI steps to configure your selected source (Docker image, GitHub repo), then click Next.
  5. Configure your endpoint settings:
    • Endpoint Name: The display name for your endpoint in the console.
    • Endpoint Type: Select Queue for traditional queue-based processing or Load balancer for direct HTTP access. See Load balancing endpoints for details.
    • GPU Configuration: Select the appropriate GPU types and configure worker settings.
    • Model: (Optional) Enter a model URL from Hugging Face to optimize worker startup times. See Cached models for details.
    • Container Configuration: Edit the container start command, specify the container disk size, and expose HTTP/TCP ports.
    • Environment Variables: Add environment variables for your worker containers.
  6. Click Deploy Endpoint to deploy.
You can optimize cost and availability by specifying GPU preferences in order of priority. Runpod attempts to allocate your first choice GPU. If unavailable, it automatically uses the next GPU in your priority list, ensuring your workloads run on the best available resources.You can enable or disable particular GPU types using the Advanced > Enabled GPU Types section.
After deployment, your endpoint takes time to initialize before it is ready to process requests. You can monitor the deployment status on the endpoint details page, which shows worker status and initialization progress. Once active, your endpoint displays a unique API URL (https://api.runpod.ai/v2/{endpoint_id}/) that you can use to send requests.

Edit an endpoint

You can modify your endpoint’s configuration at any time:
  1. Navigate to the Serverless section in the Runpod console.
  2. Click the three dots in the top right corner of the endpoint you want to modify.
  3. Click Edit Endpoint.
  4. Update any endpoint settings as needed.
  5. Click Save Endpoint to save your changes.
Changes to some settings (like GPU types or worker counts) may require restarting active workers to take effect.

Delete an endpoint

To delete an endpoint:
  1. Navigate to the Serverless section in the Runpod console.
  2. Click the three dots in the top right corner of the endpoint you want to delete.
  3. Click Delete Endpoint.
  4. Type the name of the endpoint, then click Confirm.
Deleting an endpoint permanently removes all configuration, logs, and job history. This action cannot be undone.

Next steps