Skip to main content
This guide details the configuration options available for Runpod Serverless endpoints. These settings control how your endpoint scales, how it utilizes hardware, and how it manages request lifecycles.

General configuration

Endpoint name

The name assigned to your endpoint helps you identify it within the Runpod console. This is a local display name and does not impact the endpoint ID used for API requests.

Endpoint type

Select the architecture that best fits your application’s traffic pattern: Queue based endpoints utilize a built-in queueing system to manage requests. They are ideal for asynchronous tasks, batch processing, and long-running jobs where immediate synchronous responses are not required. These endpoints provide guaranteed execution and automatic retries for failed requests. Queue based endpoints are implemented using handler functions. Load balancing endpoints route traffic directly to available workers, bypassing the internal queue. They are designed for high-throughput, low-latency applications that require synchronous request/response cycles, such as real-time inference or custom REST APIs. For implementation details, see Load balancing endpoints.

GPU configuration

This setting determines the hardware tier your workers will utilize. You can select multiple GPU categories to create a prioritized list. Runpod attempts to allocate the first category in your list. If that hardware is unavailable, it automatically falls back to the subsequent options. Selecting multiple GPU types significantly improves endpoint availability during periods of high demand.
GPU type(s)MemoryFlex cost per secondActive cost per secondDescription
A4000, A4500, RTX 400016 GB$0.00016$0.00011The most cost-effective for small models.
4090 PRO24 GB$0.00031$0.00021Extreme throughput for small-to-medium models.
L4, A5000, 309024 GB$0.00019$0.00013Great for small-to-medium sized inference workloads.
L40, L40S, 6000 Ada PRO48 GB$0.00053$0.00037Extreme inference throughput on LLMs like Llama 3 7B.
A6000, A4048 GB$0.00034$0.00024A cost-effective option for running big models.
H100 PRO80 GB$0.00116$0.00093Extreme throughput for big models.
A10080 GB$0.00076$0.00060High throughput GPU, yet still very cost-effective.
H200 PRO141 GB$0.00155$0.00124Extreme throughput for huge models.
B200180 GB$0.00240$0.00190Maximum throughput for huge models.

Worker scaling

Active workers

This setting defines the minimum number of workers that remain warm and ready to process requests at all times. Setting this to 1 or higher eliminates cold starts for the initial wave of requests. Active workers incur charges even when idle, but they receive a 20-30% discount compared to on-demand workers.

Max workers

This setting controls the maximum number of concurrent instances your endpoint can scale to. This acts as a safety limit for costs and a cap on concurrency. We recommend setting your max worker count approximately 20% higher than your expected maximum concurrency. This buffer allows for smoother scaling during traffic spikes.

GPUs per worker

This defines how many GPUs are assigned to a single worker instance. The default is 1. When choosing between multiple lower-tier GPUs or fewer high-end GPUs, you should generally prioritize high-end GPUs with lower GPU count per worker when possible.

Auto-scaling type

This setting determines the logic used to scale workers up and down. Queue delay scaling adds workers based on wait times. If requests sit in the queue for longer than a defined threshold (default 4 seconds), the system provisions new workers. This is best for workloads where slight delays are acceptable in exchange for higher utilization. Request count scaling is more aggressive. It adjusts worker numbers based on the total volume of pending and active work. The formula used is Math.ceil((requestsInQueue + requestsInProgress) / scalerValue). Use a scaler value of 1 for maximum responsiveness, or increase it to scale more conservatively. This strategy is recommended for LLM workloads or applications with frequent, short requests.

Lifecycle and timeouts

Idle timeout

The idle timeout determines how long a worker remains active after completing a request before shutting down. While a worker is idle, you are billed for the time, but the worker remains “warm,” allowing it to process subsequent requests immediately. The default is 5 seconds.

Execution timeout

The execution timeout acts as a failsafe to prevent runaway jobs from consuming infinite resources. It specifies the maximum duration a single job is allowed to run before being forcibly terminated. We strongly recommend keeping this enabled. The default is 600 seconds (10 minutes), and it can be extended up to 24 hours.

Job TTL (time-to-live)

This setting defines how long a job request remains valid in the queue before expiring. If a worker does not pick up the job within this window, the system discards it. The default is 24 hours.

Performance features

FlashBoot

FlashBoot reduces cold start times by retaining the state of worker resources shortly after they spin down. This allows the system to “revive” a worker much faster than a standard fresh boot. FlashBoot is most effective on endpoints with consistent traffic, where workers frequently cycle between active and idle states.

Model

The Model field allows you to select from a list of cached models. When selected, Runpod schedules your workers on host machines that already have these large model files pre-loaded. This significantly reduces the time required to load models during worker initialization.

Advanced settings

Data centers

You can restrict your endpoint to specific geographical regions. For maximum reliability and availability, we recommend allowing all data centers. Restricting this list decreases the pool of available GPUs your endpoint can draw from.

Network volumes

Network volumes provide persistent storage that survives worker restarts. While they enable data sharing between workers, they introduce network latency and restrict your endpoint to the specific data center where the volume resides. Use network volumes only if your workload specifically requires shared persistence or datasets larger than the container limit.

CUDA version selection

This filter ensures your workers are scheduled on host machines with compatible drivers. While you should select the version your code requires, we recommend also selecting all newer versions. CUDA is generally backward compatible, and selecting a wider range of versions increases the pool of available hardware.

Expose HTTP/TCP ports

Enabling this option exposes the public IP and port of the worker, allowing for direct external communication. This is required for applications that need persistent connections, such as WebSockets.