Skip to main content

Endpoint configurations

This guide explains all configurable settings for RunPod Serverless endpoints, helping you optimize for performance, cost, and reliability.

Basic configurations

Endpoint name

The name you assign to your endpoint for easy identification in your dashboard. This name is only visible to you and doesn't affect the endpoint ID used for API calls.

GPU selection

Choose one or more GPU types for your endpoint in order of preference. RunPod prioritizes allocating the first GPU type in your list and falls back to subsequent GPU types if your first choice is unavailable. Selecting multiple GPU types improves availability, especially for high-demand GPUs.

Worker configuration

Active (min) workers

Sets the minimum number of workers that remain running at all times. Setting this at one or higher eliminates cold start delays for faster response times. Active workers incur charges immediately, but receive up to 30% discount from regular pricing.

Default: 0

Max workers

The maximum number of concurrent workers your endpoint can scale to.

Default: 3

tip

We recommend that you set this value 20% higher than your expected maximum concurrency. If requests are frequently throttled, consider increasing this value to 5 or more.

GPUs per worker

The number of GPUs assigned to each worker instance.

Default: 1

Timeout settings

Idle timeout

The amount of time that a worker continues running after completing a request. You’re still charged for this time, even if the worker isn’t actively processing any requests.

By default, the idle timeout is set to 5 seconds to help avoid frequent start/stop cycles and reduce the likelihood of cold starts. Setting a longer idle timeout can help minimize cold starts for intermittent traffic, but it may also increase your costs

Execution timeout

The maximum time a job can run before automatic termination. This prevents runaway jobs from consuming excessive resources. You can turn off this setting, but we highly recommend keeping it on.

Default: 600 seconds (10 minutes) Maximum: 24 hours (can be extended using job TTL)

Job TTL (time-to-live)

The maximum time a job remains in the queue before automatic termination.

Default: 86,400,000 milliseconds (24 hours) Minimum: 10,000 milliseconds (10 seconds)

See Execution policies for more information.

tip

You can use the /status operation to configure the time-to-live (TTL) for an individual job by appending a TTL parameter when checking the status of a job. For example, https://api.runpod.ai/v2/{endpoint_id}/status/{job_id}?ttl=6000 sets the TTL for the job to 6 seconds. Use this when you want to tell the system to remove a job result sooner than the default retention time.

FlashBoot

FlashBoot is RunPod's solution for reducing the average cold-start times on your endpoint. It works probabilistically. When your endpoint has consistent traffic, your workers have a higher chance of benefiting from FlashBoot for faster spin-ups. However, if your endpoint isn't receiving frequent requests, FlashBoot has fewer opportunities to optimize performance. There is no additional cost associated with FlashBoot.

Advanced configurations

Data centers

Control which data centers can deploy and cache your workers. Allowing multiple data centers improves availability, while using a network volume restricts your endpoint to a single data center.

Default: All data centers

Network volumes

Attach persistent storage to your workers. Network volumes have higher latency than local storage, and restrict workers to the data center containing your volume. However, they're very useful for sharing large models or data between workers on an endpoint.

See Create a network volume for more information.

Auto-scaling type

Queue delay

Adds workers based on request wait times.

The queue delay scaling strategy adjusts worker numbers based on request wait times. Workers are added if requests spend more than X seconds in the queue, where X is a threshold you define. By default, this threshold is set at 4 seconds.

Request count

The request count scaling strategy adjusts worker numbers according to total requests in the queue and in progress. It automatically adds workers as the number of requests increases, ensuring tasks are handled efficiently.

Total workers formula: Math.ceil((requestsInQueue + requestsInProgress) / 4)

Expose HTTP/TCP ports

Enables direct communication with your worker via its public IP and port. This can be useful for real-time applications requiring minimal latency, such as WebSocket applications.

Enabled GPU types

Here you can specify which GPU types to use within your selected GPU size categories. By default, all GPU types are enabled.

CUDA version selection

Specify which CUDA versions can be used with your workload to ensures your code runs on compatible GPU hardware RunPod will match your workload to GPU instances with the selected CUDA versions.

tip

CUDA is generally backward compatible, so we recommend that you check for the version you need and any higher versions. For example, if your code requires CUDA 12.4, you should also try running it on 12.5, 12.6, and so on.

Limiting your endpoint to just one or two CUDA versions can significantly reduce GPU availability. RunPod continuously updates GPU drivers to support the latest CUDA versions, so keeping more CUDA versions selected gives you access to more resources.

Best practices

  • Start conservative with max workers and scale up as needed.
  • Monitor throttling and adjust max workers accordingly.
  • Use active workers for latency-sensitive applications.
  • Select multiple GPU types to improve availability.
  • Choose appropriate timeouts based on your workload characteristics.
  • Consider data locality when using network volumes.