> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpod.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Optimize your endpoints

> Implement strategies to reduce latency and cost for your Serverless endpoints.

export const InferenceTooltip = () => {
  return <Tooltip headline="AI inference" tip="The execution phase where a trained model makes predictions on new data. When you prompt a model and it responds, that's inference.">inference</Tooltip>;
};

Optimization involves measuring performance with [benchmarking](/serverless/development/benchmarking), identifying bottlenecks, and tuning your [endpoint configurations](/serverless/endpoints/endpoint-configurations).

## Quick optimization checklist

| Strategy                                                                                       | Impact                     | When to use            |
| ---------------------------------------------------------------------------------------------- | -------------------------- | ---------------------- |
| [Use cached models](/serverless/endpoints/model-caching)                                       | ⬇️ Cold start (major)      | Models on Hugging Face |
| [Bake models into image](/serverless/workers/create-dockerfile#including-models-and-files)     | ⬇️ Cold start              | Private models         |
| [Set active workers > 0](/serverless/endpoints/endpoint-configurations#active-workers)         | ⬇️ Cold start (eliminates) | Latency-sensitive apps |
| [Select multiple GPU types](/serverless/endpoints/endpoint-configurations#gpu-configuration)   | ⬆️ Availability            | Production workloads   |
| [Increase max workers](/serverless/endpoints/endpoint-configurations#max-workers)              | ⬆️ Throughput              | High concurrency       |
| [Lower queue delay threshold](/serverless/endpoints/endpoint-configurations#auto-scaling-type) | ⬇️ Response time           | Traffic spikes         |

## Understanding delay time

Two metrics affect request response time:

| Metric             | Description                                | Optimization                     |
| ------------------ | ------------------------------------------ | -------------------------------- |
| **Delay time**     | Waiting for a worker (includes cold start) | Model caching, active workers    |
| **Execution time** | GPU processing the request                 | Code optimization, GPU selection |

**Delay time** breaks down into:

* **Initialization time**: Downloading Docker image
* **Cold start time**: Loading model into GPU memory

<Tip>
  Use [benchmarking](/serverless/development/benchmarking) to measure these metrics for your workload.
</Tip>

<Note>
  If cold start exceeds 7 minutes, the worker is marked unhealthy. Extend with `RUNPOD_INIT_TIMEOUT=800` (seconds).
</Note>

## Reduce cold starts

### Use cached models (recommended)

For models on Hugging Face, [cached models](/serverless/endpoints/model-caching) provide the fastest cold starts and lowest cost.

### Bake models into images

For private models, [embed them in your Docker image](/serverless/workers/create-dockerfile#including-models-and-files). Models load from high-speed local NVMe storage instead of downloading at runtime.

### Maintain active workers

Set [active workers](/serverless/endpoints/endpoint-configurations#active-workers) > 0 to eliminate cold starts entirely.

**Formula**: `Active workers = (Requests/min × Request duration in seconds) / 60`

Example: 6 requests/min × 30 seconds = 3 active workers needed.

## Improve availability

### Select multiple GPU types

Specify multiple [GPU types](/references/gpu-types) in priority order. A single high-end GPU often outperforms multiple lower-tier cards for <InferenceTooltip />.

For endpoints with five or more workers, Runpod [distributes workers across your GPU priorities](/serverless/endpoints/endpoint-configurations#gpu-priority-and-worker-distribution) to reduce throttling when your primary GPU type is constrained.

### Add headroom to max workers

Set [max workers](/serverless/endpoints/endpoint-configurations#max-workers) \~20% above expected concurrency to handle load spikes without throttling.

### Tune auto-scaling

Lower the [queue delay threshold](/serverless/endpoints/endpoint-configurations#auto-scaling-type) to 2-3 seconds (default: 4) for faster worker provisioning.

## Architecture considerations

| Choice                 | Tradeoff                                         |
| ---------------------- | ------------------------------------------------ |
| **Baked models**       | Fastest loading, but larger images               |
| **Network volumes**    | Flexible, but restricts to specific data centers |
| **Multiple GPU types** | Higher availability, variable performance        |