Skip to main content
This guide details the configuration options available for Runpod Serverless endpoints.
Some settings can only be updated after deploying your endpoint. See Edit an endpoint.

Quick reference

SettingDefaultDescription
Active workers0Always-on workers (eliminates cold starts)
Max workers3Maximum concurrent workers
GPUs per worker1GPU count per worker instance
Idle timeout5sTime before idle worker shuts down
Execution timeout600s (10 min)Max job duration
Job TTL24hTotal job lifespan in system
FlashBootEnabledFaster cold starts via state retention

General configuration

Endpoint name

Display name for identifying your endpoint in the console. Does not affect the endpoint ID used for API requests.

Endpoint type

Queue-based endpoints use a built-in queueing system with guaranteed execution and automatic retries. Ideal for async tasks, batch processing, and long-running jobs. Implemented using handler functions. Load balancing endpoints route traffic directly to workers, bypassing the queue. Designed for low-latency applications like real-time or custom REST APIs. See Load balancing endpoints.

GPU configuration

Determines the hardware tier for your workers. Select multiple GPU categories to create a prioritized fallback list. If your first choice is unavailable, Runpod automatically uses the next option. Selecting multiple types improves availability during high demand.
GPU type(s)MemoryFlex cost per secondActive cost per secondDescription
A4000, A4500, RTX 400016 GB$0.00016$0.00011The most cost-effective for small models.
4090 PRO24 GB$0.00031$0.00021Extreme throughput for small-to-medium models.
L4, A5000, 309024 GB$0.00019$0.00013Great for small-to-medium sized inference workloads.
L40, L40S, 6000 Ada PRO48 GB$0.00053$0.00037Extreme inference throughput on LLMs like Llama 3 7B.
A6000, A4048 GB$0.00034$0.00024A cost-effective option for running big models.
H100 PRO80 GB$0.00116$0.00093Extreme throughput for big models.
A10080 GB$0.00076$0.00060High throughput GPU, yet still very cost-effective.
H200 PRO141 GB$0.00155$0.00124Extreme throughput for huge models.
B200180 GB$0.00240$0.00190Maximum throughput for huge models.

Worker scaling

Active workers

Minimum number of workers that remain warm and ready at all times. Setting this to 1+ eliminates cold starts. Active workers incur charges when idle but receive a 20-30% discount.

Max workers

Maximum concurrent instances your endpoint can scale to. Acts as a cost safety limit and concurrency cap. Set ~20% higher than expected max concurrency to handle traffic spikes smoothly.

GPUs per worker

Number of GPUs assigned to each worker instance. Default is 1. Generally prioritize fewer high-end GPUs over multiple lower-tier GPUs.

Auto-scaling type

Queue delay: Adds workers when requests wait longer than the threshold (default: 4 seconds). Best when slight delays are acceptable for higher utilization. Request count: More aggressive scaling based on pending + active work. Formula: Math.ceil((requestsInQueue + requestsInProgress) / scalerValue). Use scaler value of 1 for max responsiveness. Recommended for LLM workloads or frequent short requests.

Lifecycle and timeouts

Idle timeout

How long a worker stays active after completing a request before shutting down. You’re billed during idle time, but the worker remains warm for immediate processing. Default: 5 seconds.

Execution timeout

Maximum duration for a single job. When exceeded, the job fails and the worker stops. Keep enabled to prevent runaway jobs. Default: 600s (10 min). Range: 5s to 7 days. Configure in Advanced settings, or override per-request via executionTimeout in the job policy.

Job TTL (time-to-live)

Total lifespan of a job in the system. When TTL expires, job data is deleted regardless of state (queued, running, or completed). Default: 24 hours. Range: 10s to 7 days. The timer starts at submission, not execution. If a job queues for 45 minutes with a 1-hour TTL, only 15 minutes remain for execution.
TTL is a hard limit. If it expires while a job is running, the job is immediately removed and status checks return 404. Set TTL to cover both expected queue time and execution time.
Override per-request via ttl in the job policy.

Result retention

Request typeRetentionNotes
Async (/run)30 minRetrieve via /status/{job_id}
Sync (/runsync)1 minReturned in response; also available via /status/{job_id}
Results are permanently deleted after retention expires.

Performance features

FlashBoot

Reduces cold starts by retaining worker state after spin-down, allowing faster “revival” than fresh boots. Most effective on endpoints with consistent traffic where workers frequently cycle between active and idle.

Model

Select from cached models to schedule workers on with model files pre-loaded. Significantly reduces model loading time during initialization.

Advanced settings

Data centers

Restrict your endpoint to specific regions. For maximum availability, allow all data centers:restricting decreases the available GPU pool.

Network volumes

Network volumes provide persistent storage across worker restarts. Tradeoffs: adds network latency and restricts your endpoint to the volume’s data center. Use only when you need shared persistence or datasets exceeding container limits.

CUDA version selection

Ensures workers run on with compatible drivers. Select your required version plus all newer versions, since CUDA is backward compatible and a wider range increases available hardware.

Expose HTTP/TCP ports

Exposes the worker’s public IP and port for direct external communication. Required for persistent connections like WebSockets.