This guide details the configuration options available for Runpod Serverless endpoints.
Some settings can only be updated after deploying your endpoint. See Edit an endpoint.
Quick reference
| Setting | Default | Description |
|---|
| Active workers | 0 | Always-on workers (eliminates cold starts) |
| Max workers | 3 | Maximum concurrent workers |
| GPUs per worker | 1 | GPU count per worker instance |
| Idle timeout | 5s | Time before idle worker shuts down |
| Execution timeout | 600s (10 min) | Max job duration |
| Job TTL | 24h | Total job lifespan in system |
| FlashBoot | Enabled | Faster cold starts via state retention |
General configuration
Endpoint name
Display name for identifying your endpoint in the console. Does not affect the endpoint ID used for API requests.
Endpoint type
Queue-based endpoints use a built-in queueing system with guaranteed execution and automatic retries. Ideal for async tasks, batch processing, and long-running jobs. Implemented using handler functions.
Load balancing endpoints route traffic directly to workers, bypassing the queue. Designed for low-latency applications like real-time or custom REST APIs. See Load balancing endpoints.
GPU configuration
Determines the hardware tier for your workers. Select multiple GPU categories to create a prioritized fallback list. If your first choice is unavailable, Runpod automatically uses the next option. Selecting multiple types improves availability during high demand.
| GPU type(s) | Memory | Flex cost per second | Active cost per second | Description |
|---|
| A4000, A4500, RTX 4000 | 16 GB | $0.00016 | $0.00011 | The most cost-effective for small models. |
| 4090 PRO | 24 GB | $0.00031 | $0.00021 | Extreme throughput for small-to-medium models. |
| L4, A5000, 3090 | 24 GB | $0.00019 | $0.00013 | Great for small-to-medium sized inference workloads. |
| L40, L40S, 6000 Ada PRO | 48 GB | $0.00053 | $0.00037 | Extreme inference throughput on LLMs like Llama 3 7B. |
| A6000, A40 | 48 GB | $0.00034 | $0.00024 | A cost-effective option for running big models. |
| H100 PRO | 80 GB | $0.00116 | $0.00093 | Extreme throughput for big models. |
| A100 | 80 GB | $0.00076 | $0.00060 | High throughput GPU, yet still very cost-effective. |
| H200 PRO | 141 GB | $0.00155 | $0.00124 | Extreme throughput for huge models. |
| B200 | 180 GB | $0.00240 | $0.00190 | Maximum throughput for huge models. |
Worker scaling
Active workers
Minimum number of workers that remain warm and ready at all times. Setting this to 1+ eliminates cold starts. Active workers incur charges when idle but receive a 20-30% discount.
Max workers
Maximum concurrent instances your endpoint can scale to. Acts as a cost safety limit and concurrency cap. Set ~20% higher than expected max concurrency to handle traffic spikes smoothly.
GPUs per worker
Number of GPUs assigned to each worker instance. Default is 1. Generally prioritize fewer high-end GPUs over multiple lower-tier GPUs.
Auto-scaling type
Queue delay: Adds workers when requests wait longer than the threshold (default: 4 seconds). Best when slight delays are acceptable for higher utilization.
Request count: More aggressive scaling based on pending + active work. Formula: Math.ceil((requestsInQueue + requestsInProgress) / scalerValue). Use scaler value of 1 for max responsiveness. Recommended for LLM workloads or frequent short requests.
Lifecycle and timeouts
Idle timeout
How long a worker stays active after completing a request before shutting down. You’re billed during idle time, but the worker remains warm for immediate processing. Default: 5 seconds.
Execution timeout
Maximum duration for a single job. When exceeded, the job fails and the worker stops. Keep enabled to prevent runaway jobs. Default: 600s (10 min). Range: 5s to 7 days.
Configure in Advanced settings, or override per-request via executionTimeout in the job policy.
Job TTL (time-to-live)
Total lifespan of a job in the system. When TTL expires, job data is deleted regardless of state (queued, running, or completed). Default: 24 hours. Range: 10s to 7 days.
The timer starts at submission, not execution. If a job queues for 45 minutes with a 1-hour TTL, only 15 minutes remain for execution.
TTL is a hard limit. If it expires while a job is running, the job is immediately removed and status checks return 404. Set TTL to cover both expected queue time and execution time.
Override per-request via ttl in the job policy.
Result retention
| Request type | Retention | Notes |
|---|
Async (/run) | 30 min | Retrieve via /status/{job_id} |
Sync (/runsync) | 1 min | Returned in response; also available via /status/{job_id} |
Results are permanently deleted after retention expires.
FlashBoot
Reduces cold starts by retaining worker state after spin-down, allowing faster “revival” than fresh boots. Most effective on endpoints with consistent traffic where workers frequently cycle between active and idle.
Model
Select from cached models to schedule workers on with model files pre-loaded. Significantly reduces model loading time during initialization.
Advanced settings
Data centers
Restrict your endpoint to specific regions. For maximum availability, allow all data centers:restricting decreases the available GPU pool.
Network volumes
Network volumes provide persistent storage across worker restarts. Tradeoffs: adds network latency and restricts your endpoint to the volume’s data center. Use only when you need shared persistence or datasets exceeding container limits.
CUDA version selection
Ensures workers run on with compatible drivers. Select your required version plus all newer versions, since CUDA is backward compatible and a wider range increases available hardware.
Expose HTTP/TCP ports
Exposes the worker’s public IP and port for direct external communication. Required for persistent connections like WebSockets.