General configuration
Endpoint name
The name assigned to your endpoint helps you identify it within the Runpod console. This is a local display name and does not impact the endpoint ID used for API requests.Endpoint type
Select the architecture that best fits your application’s traffic pattern: Queue based endpoints utilize a built-in queueing system to manage requests. They are ideal for asynchronous tasks, batch processing, and long-running jobs where immediate synchronous responses are not required. These endpoints provide guaranteed execution and automatic retries for failed requests. Queue based endpoints are implemented using handler functions. Load balancing endpoints route traffic directly to available workers, bypassing the internal queue. They are designed for high-throughput, low-latency applications that require synchronous request/response cycles, such as real-time inference or custom REST APIs. For implementation details, see Load balancing endpoints.GPU configuration
This setting determines the hardware tier your workers will utilize. You can select multiple GPU categories to create a prioritized list. Runpod attempts to allocate the first category in your list. If that hardware is unavailable, it automatically falls back to the subsequent options. Selecting multiple GPU types significantly improves endpoint availability during periods of high demand.| GPU type(s) | Memory | Flex cost per second | Active cost per second | Description |
|---|---|---|---|---|
| A4000, A4500, RTX 4000 | 16 GB | $0.00016 | $0.00011 | The most cost-effective for small models. |
| 4090 PRO | 24 GB | $0.00031 | $0.00021 | Extreme throughput for small-to-medium models. |
| L4, A5000, 3090 | 24 GB | $0.00019 | $0.00013 | Great for small-to-medium sized inference workloads. |
| L40, L40S, 6000 Ada PRO | 48 GB | $0.00053 | $0.00037 | Extreme inference throughput on LLMs like Llama 3 7B. |
| A6000, A40 | 48 GB | $0.00034 | $0.00024 | A cost-effective option for running big models. |
| H100 PRO | 80 GB | $0.00116 | $0.00093 | Extreme throughput for big models. |
| A100 | 80 GB | $0.00076 | $0.00060 | High throughput GPU, yet still very cost-effective. |
| H200 PRO | 141 GB | $0.00155 | $0.00124 | Extreme throughput for huge models. |
| B200 | 180 GB | $0.00240 | $0.00190 | Maximum throughput for huge models. |
Worker scaling
Active workers
This setting defines the minimum number of workers that remain warm and ready to process requests at all times. Setting this to 1 or higher eliminates cold starts for the initial wave of requests. Active workers incur charges even when idle, but they receive a 20-30% discount compared to on-demand workers.Max workers
This setting controls the maximum number of concurrent instances your endpoint can scale to. This acts as a safety limit for costs and a cap on concurrency. We recommend setting your max worker count approximately 20% higher than your expected maximum concurrency. This buffer allows for smoother scaling during traffic spikes.GPUs per worker
This defines how many GPUs are assigned to a single worker instance. The default is 1. When choosing between multiple lower-tier GPUs or fewer high-end GPUs, you should generally prioritize high-end GPUs with lower GPU count per worker when possible.Auto-scaling type
This setting determines the logic used to scale workers up and down. Queue delay scaling adds workers based on wait times. If requests sit in the queue for longer than a defined threshold (default 4 seconds), the system provisions new workers. This is best for workloads where slight delays are acceptable in exchange for higher utilization. Request count scaling is more aggressive. It adjusts worker numbers based on the total volume of pending and active work. The formula used isMath.ceil((requestsInQueue + requestsInProgress) / scalerValue). Use a scaler value of 1 for maximum responsiveness, or increase it to scale more conservatively. This strategy is recommended for LLM workloads or applications with frequent, short requests.