> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpod.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Endpoint parameters

> Complete reference for all Endpoint class parameters.

This page provides a complete reference for all parameters available on the `Endpoint` class.

## Parameter overview

| Parameter              | Type                           | Description                                       | Default                       |
| ---------------------- | ------------------------------ | ------------------------------------------------- | ----------------------------- |
| `name`                 | `str`                          | Endpoint name (required unless `id=` is used)     | -                             |
| `id`                   | `str`                          | Connect to existing endpoint by ID                | `None`                        |
| `gpu`                  | `GpuGroup`, `GpuType`, or list | GPU type(s) for the endpoint                      | `GpuGroup.ANY`                |
| `cpu`                  | `str` or `CpuInstanceType`     | CPU instance type (mutually exclusive with `gpu`) | `None`                        |
| `workers`              | `int` or `(min, max)`          | Worker scaling configuration                      | `(0, 1)`                      |
| `idle_timeout`         | `int`                          | Seconds before scaling down idle workers          | `60`                          |
| `dependencies`         | `list[str]`                    | Python packages to install                        | `None`                        |
| `system_dependencies`  | `list[str]`                    | System packages to install (apt)                  | `None`                        |
| `accelerate_downloads` | `bool`                         | Enable download acceleration                      | `True`                        |
| `volume`               | `NetworkVolume` or list        | Network volume(s) for persistent storage          | `None`                        |
| `datacenter`           | `DataCenter`, list, or `None`  | Datacenter(s) for deployment                      | `None` (all DCs)              |
| `env`                  | `dict[str, str]`               | Environment variables                             | `None`                        |
| `gpu_count`            | `int`                          | GPUs per worker                                   | `1`                           |
| `execution_timeout_ms` | `int`                          | Max execution time in milliseconds                | `0` (no limit)                |
| `flashboot`            | `bool`                         | Enable Flashboot fast startup                     | `True`                        |
| `image`                | `str`                          | Custom Docker image to deploy                     | `None`                        |
| `scaler_type`          | `ServerlessScalerType`         | Scaling strategy                                  | auto                          |
| `scaler_value`         | `int`                          | Scaling threshold                                 | `4`                           |
| `template`             | `PodTemplate`                  | Pod template overrides                            | `None`                        |
| `min_cuda_version`     | `str` or `CudaVersion`         | Minimum CUDA version for GPU host selection       | `"12.8"` (GPU) / `None` (CPU) |
| `python_version`       | `str`                          | Python version for the worker image               | Local Python                  |

## Parameter details

### name

**Type**: `str`
**Required**: Yes (unless `id=` is specified)

The endpoint name visible in the [Runpod console](https://www.runpod.io/console/serverless). Use descriptive names to easily identify endpoints.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
@Endpoint(name="ml-inference-prod", gpu=GpuGroup.ANY)
async def infer(data): ...
```

<Tip>
  Use naming conventions like `image-generation-prod` or `batch-processor-dev` to organize your endpoints.
</Tip>

### id

**Type**: `str`
**Default**: `None`

Connect to an existing deployed endpoint by its ID. When `id` is specified, `name` is not required.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
# Connect to existing endpoint
ep = Endpoint(id="abc123xyz")

# Make requests
job = await ep.run({"prompt": "hello"})
result = await ep.post("/inference", {"data": "..."})
```

### gpu

**Type**: `GpuGroup`, `GpuType`, or `list[GpuGroup | GpuType]`
**Default**: `GpuGroup.ANY` (if neither `gpu` nor `cpu` is specified)

Specifies GPU hardware for the endpoint. Accepts a single GPU type/group or a list for fallback strategies.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from runpod_flash import Endpoint, GpuType, GpuGroup

# Specific GPU type
@Endpoint(name="inference", gpu=GpuType.NVIDIA_A100_80GB_PCIe)
async def infer(data): ...

# Another specific GPU type
@Endpoint(name="rtx-worker", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090)
async def process(data): ...

# Multiple types for fallback
@Endpoint(name="flexible", gpu=[GpuType.NVIDIA_A100_80GB_PCIe, GpuType.NVIDIA_RTX_A6000, GpuType.NVIDIA_GEFORCE_RTX_4090])
async def flexible_infer(data): ...
```

See [GPU types](/flash/configuration/gpu-types) for all available options.

### cpu

**Type**: `str` or `CpuInstanceType`
**Default**: `None`

Specifies a CPU instance type. Mutually exclusive with `gpu`.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from runpod_flash import Endpoint, CpuInstanceType

# String shorthand
@Endpoint(name="data-processor", cpu="cpu5c-4-8")
async def process(data): ...

# Using enum
@Endpoint(name="data-processor", cpu=CpuInstanceType.CPU5C_4_8)
async def process(data): ...
```

See [CPU types](/flash/configuration/cpu-types) for all available options.

### workers

**Type**: `int` or `tuple[int, int]`
**Default**: `(0, 1)`

Controls worker scaling. Accepts either a single integer (max workers with min=0) or a tuple of (min, max).

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
# Just max: scales from 0 to 5
@Endpoint(name="elastic", gpu=GpuGroup.ANY, workers=5)

# Min and max: always keep 2 warm, scale up to 10
@Endpoint(name="always-on", gpu=GpuGroup.ANY, workers=(2, 10))

# Default: (0, 1)
@Endpoint(name="default", gpu=GpuGroup.ANY)
```

**Recommendations**:

* `workers=N` or `workers=(0, N)`: Cost-optimized, allows scale to zero
* `workers=(1, N)`: Avoid cold starts by keeping at least one worker warm
* `workers=(N, N)`: Fixed worker count for consistent performance

### idle\_timeout

**Type**: `int`
**Default**: `60`
**Valid range**: 1-3600 seconds

Number of seconds workers will stay active (running) after completing a request, waiting for additional requests before scaling down (to minimum workers).

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
# Quick scale-down for cost savings
@Endpoint(name="batch", gpu=GpuGroup.ANY, idle_timeout=30)

# Keep workers longer for variable traffic
@Endpoint(name="api", gpu=GpuGroup.ANY, idle_timeout=120)
```

**Recommendations**:

* `30-60 seconds`: Cost-optimized, infrequent traffic
* `60-120 seconds`: Balanced, variable traffic patterns
* `120-300 seconds`: Latency-optimized, consistent traffic

### dependencies

**Type**: `list[str]`
**Default**: `None`

Python packages to install on the remote worker before executing your function. Supports standard pip syntax.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
@Endpoint(
    name="ml-worker",
    gpu=GpuGroup.ANY,
    dependencies=["torch>=2.0.0", "transformers==4.36.0", "pillow"]
)
async def process(data): ...
```

<Warning>
  Packages must be imported **inside** the function body, not at the top of your file.
</Warning>

### system\_dependencies

**Type**: `list[str]`
**Default**: `None`

System-level packages to install via apt before your function runs.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
@Endpoint(
    name="video-processor",
    gpu=GpuGroup.ANY,
    dependencies=["opencv-python"],
    system_dependencies=["libgl1-mesa-glx", "libglib2.0-0"]
)
async def process_video(data): ...
```

### accelerate\_downloads

**Type**: `bool`
**Default**: `True`

Enables faster downloads for dependencies, models, and large files. Disable if you encounter compatibility issues.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
@Endpoint(
    name="standard-downloads",
    gpu=GpuGroup.ANY,
    accelerate_downloads=False
)
async def process(data): ...
```

### volume

**Type**: `NetworkVolume` or `list[NetworkVolume]`
**Default**: `None`

Attaches network volume(s) for persistent storage. Volumes are mounted at `/runpod-volume/`. Flash uses the volume `name` to find an existing volume or create a new one. Each volume is tied to a specific datacenter.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from runpod_flash import Endpoint, GpuGroup, DataCenter, NetworkVolume

# Single volume in a specific datacenter
vol = NetworkVolume(name="model-cache", size=100, datacenter=DataCenter.US_GA_2)

@Endpoint(
    name="model-server",
    gpu=GpuGroup.ANY,
    datacenter=DataCenter.US_GA_2,
    volume=vol
)
async def serve(data):
    # Access files at /runpod-volume/
    model = load_model("/runpod-volume/models/bert")
    ...
```

For multi-datacenter deployments, pass a list of volumes (one per datacenter):

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from runpod_flash import Endpoint, GpuGroup, DataCenter, NetworkVolume

volumes = [
    NetworkVolume(name="models-us", size=100, datacenter=DataCenter.US_GA_2),
    NetworkVolume(name="models-eu", size=100, datacenter=DataCenter.EU_RO_1),
]

@Endpoint(
    name="global-server",
    gpu=GpuGroup.ANY,
    datacenter=[DataCenter.US_GA_2, DataCenter.EU_RO_1],
    volume=volumes
)
async def serve(data):
    ...
```

<Warning>
  Only one network volume is allowed per datacenter. If you specify multiple volumes in the same datacenter, deployment will fail.
</Warning>

**Use cases**:

* Share large models across workers
* Persist data between runs
* Share datasets across endpoints

See [Storage](/flash/configuration/storage) for setup instructions.

### datacenter

**Type**: `DataCenter`, `list[DataCenter]`, `str`, `list[str]`, or `None`
**Default**: `None` (all available datacenters)

Specifies the datacenter(s) for worker deployment. When set to `None`, the endpoint is available in all datacenters.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from runpod_flash import Endpoint, GpuGroup, DataCenter

# Deploy to all available datacenters (default)
@Endpoint(name="global", gpu=GpuGroup.ANY)
async def process(data): ...

# Deploy to a single datacenter
@Endpoint(
    name="us-workers",
    gpu=GpuGroup.ANY,
    datacenter=DataCenter.US_GA_2
)
async def process(data): ...

# Deploy to multiple datacenters
@Endpoint(
    name="multi-region",
    gpu=GpuGroup.ANY,
    datacenter=[DataCenter.US_GA_2, DataCenter.EU_RO_1]
)
async def process(data): ...

# String DC IDs also work
@Endpoint(
    name="us-workers",
    gpu=GpuGroup.ANY,
    datacenter="US-GA-2"
)
async def process(data): ...
```

**Available datacenters**:

| Value                 | Location                |
| --------------------- | ----------------------- |
| `DataCenter.US_CA_2`  | US - California         |
| `DataCenter.US_GA_2`  | US - Georgia            |
| `DataCenter.US_IL_1`  | US - Illinois           |
| `DataCenter.US_KS_2`  | US - Kansas             |
| `DataCenter.US_MD_1`  | US - Maryland           |
| `DataCenter.US_MO_1`  | US - Missouri           |
| `DataCenter.US_MO_2`  | US - Missouri           |
| `DataCenter.US_NC_1`  | US - North Carolina     |
| `DataCenter.US_NC_2`  | US - North Carolina     |
| `DataCenter.US_NE_1`  | US - Nebraska           |
| `DataCenter.US_WA_1`  | US - Washington         |
| `DataCenter.EU_CZ_1`  | Europe - Czech Republic |
| `DataCenter.EU_RO_1`  | Europe - Romania        |
| `DataCenter.EUR_IS_1` | Europe - Iceland        |
| `DataCenter.EUR_NO_1` | Europe - Norway         |

<Note>
  CPU endpoints are restricted to `CPU_DATACENTERS`, which currently only includes `EU_RO_1`.
</Note>

### env

**Type**: `dict[str, str]`
**Default**: `None`

Environment variables passed to all workers. Useful for API keys, configuration, and feature flags.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
@Endpoint(
    name="ml-worker",
    gpu=GpuGroup.ANY,
    env={
        "HF_TOKEN": "your_huggingface_token",
        "MODEL_ID": "gpt2",
        "LOG_LEVEL": "INFO"
    }
)
async def load_model():
    import os
    token = os.getenv("HF_TOKEN")
    model_id = os.getenv("MODEL_ID")
    ...
```

<Warning>
  Values in your project's `.env` file are only available locally for CLI commands and development. They are **not** passed to deployed endpoints. You must declare environment variables explicitly using the `env` parameter.
</Warning>

To pass a local environment variable to your deployed endpoint, read it from `os.environ`:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
import os

@Endpoint(
    name="ml-worker",
    gpu=GpuGroup.ANY,
    env={"HF_TOKEN": os.environ["HF_TOKEN"]}  # Read from local env, pass to workers
)
async def load_model():
    ...
```

<Note>
  Environment variables are excluded from configuration hashing. Changing environment values won't trigger endpoint recreation, making it easy to rotate API keys.
</Note>

### gpu\_count

**Type**: `int`
**Default**: `1`

Number of GPUs per worker. Use for multi-GPU workloads.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
@Endpoint(
    name="multi-gpu-training",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    gpu_count=4,  # Each worker gets 4 GPUs
    workers=2     # Maximum 2 workers = 8 GPUs total
)
async def train(data): ...
```

### execution\_timeout\_ms

**Type**: `int`
**Default**: `0` (no limit)

Maximum execution time for a single job in milliseconds. Jobs exceeding this timeout are terminated.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
# 5 minute timeout
@Endpoint(
    name="training",
    gpu=GpuGroup.ANY,
    execution_timeout_ms=300000  # 5 * 60 * 1000
)
async def train(data): ...

# 30 second timeout for quick inference
@Endpoint(
    name="quick-inference",
    gpu=GpuGroup.ANY,
    execution_timeout_ms=30000
)
async def infer(data): ...
```

<Note>
  The Flash SDK's `runsync()` method uses your `execution_timeout_ms` value as the client-side HTTP timeout. If set to a positive value, the SDK waits that duration for the job to complete. If unset or set to `0`, the SDK defaults to a 60-second timeout. For long-running inference jobs, set `execution_timeout_ms` to prevent premature timeouts.
</Note>

### flashboot

**Type**: `bool`
**Default**: `True`

Enables Flashboot for faster cold starts by pre-loading container images.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
@Endpoint(
    name="fast-startup",
    gpu=GpuGroup.ANY,
    flashboot=True  # Default
)
async def process(data): ...
```

Set to `False` for debugging or compatibility reasons.

### image

**Type**: `str`
**Default**: `None`

Custom Docker image to deploy. When specified, the endpoint runs your Docker image instead of Flash's managed workers.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from runpod_flash import Endpoint, GpuType

vllm = Endpoint(
    name="vllm-server",
    image="runpod/worker-vllm:stable-cuda12.1.0",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    env={"MODEL_NAME": "meta-llama/Llama-3.2-3B-Instruct"}
)

# Make HTTP calls to the deployed image
result = await vllm.post("/v1/completions", {"prompt": "Hello"})
```

See [Custom Docker images](/flash/custom-docker-images) for complete documentation.

### scaler\_type

**Type**: `ServerlessScalerType`
**Default**: Auto-selected based on endpoint type

Scaling algorithm strategy. Defaults are automatically set:

* Queue-based: `QUEUE_DELAY` (scales based on queue depth)
* Load-balanced: `REQUEST_COUNT` (scales based on active requests)

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from runpod_flash import Endpoint, ServerlessScalerType

@Endpoint(
    name="custom-scaler",
    gpu=GpuGroup.ANY,
    scaler_type=ServerlessScalerType.QUEUE_DELAY
)
async def process(data): ...
```

### scaler\_value

**Type**: `int`
**Default**: `4`

Parameter value for the scaling algorithm. With `QUEUE_DELAY`, represents target jobs per worker before scaling up.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
# Scale up when > 2 jobs per worker (more aggressive)
@Endpoint(
    name="responsive",
    gpu=GpuGroup.ANY,
    scaler_value=2
)
async def process(data): ...
```

### template

**Type**: `PodTemplate`
**Default**: `None`

Advanced pod configuration overrides.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from runpod_flash import Endpoint, GpuGroup, PodTemplate

@Endpoint(
    name="custom-pod",
    gpu=GpuGroup.ANY,
    template=PodTemplate(
        containerDiskInGb=100,
        env=[{"key": "PYTHONPATH", "value": "/workspace"}]
    )
)
async def process(data): ...
```

## PodTemplate

`PodTemplate` provides advanced pod configuration options:

| Parameter           | Type         | Description                                                       | Default |
| ------------------- | ------------ | ----------------------------------------------------------------- | ------- |
| `containerDiskInGb` | `int`        | Container disk size in GB                                         | 64      |
| `env`               | `list[dict]` | Environment variables as list of `{"key": "...", "value": "..."}` | `None`  |

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from runpod_flash import PodTemplate

template = PodTemplate(
    containerDiskInGb=100,
    env=[
        {"key": "PYTHONPATH", "value": "/workspace"},
        {"key": "CUDA_VISIBLE_DEVICES", "value": "0"}
    ]
)
```

<Tip>
  For simple environment variables, use the `env` parameter on `Endpoint` instead of `PodTemplate.env`.
</Tip>

### min\_cuda\_version

**Type**: `str` or `CudaVersion`
**Default**: `"12.8"` for GPU endpoints, `None` for CPU endpoints

Specifies the minimum CUDA driver version required on the host machine. GPU endpoints default to `"12.8"` to ensure workers run on hosts with recent CUDA drivers.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from runpod_flash import Endpoint, GpuType, CudaVersion

# Use the default (12.8)
@Endpoint(name="ml-inference", gpu=GpuType.NVIDIA_A100_80GB_PCIe)
async def infer(data): ...

# Override with string value
@Endpoint(
    name="legacy-compatible",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    min_cuda_version="12.4"
)
async def infer_legacy(data): ...

# Override with CudaVersion enum
@Endpoint(
    name="cuda-12",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    min_cuda_version=CudaVersion.V12_0
)
async def infer_cuda12(data): ...
```

This parameter has no effect on CPU endpoints.

<Note>
  Valid CUDA versions: `CudaVersion.V11_1`, `V11_4`, `V11_7`, `V11_8`, `V12_0`, `V12_1`, `V12_2`, `V12_3`, `V12_4`, `V12_6`, `V12_8` (or equivalent strings like `"12.4"`). Invalid values raise a `ValueError`.
</Note>

### python\_version

**Type**: `str`
**Default**: Local Python version

Sets the Python version for the worker image. Supported values: `"3.10"`, `"3.11"`, `"3.12"`, and `"3.13"`.

When you don't specify a Python version, Flash matches your local interpreter (the Python version you run Flash from). The resolution order is:

1. `--python-version` CLI flag (highest priority)
2. `python_version` declared on resource configs
3. Your local Python version (`sys.version_info`)

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from runpod_flash import Endpoint, GpuGroup

# Explicitly set Python 3.11 for this endpoint
@Endpoint(
    name="legacy-model",
    gpu=GpuGroup.ANY,
    python_version="3.11"
)
async def process(data): ...

# Uses your local Python version (e.g., 3.12 if that's what you're running)
@Endpoint(name="modern-model", gpu=GpuGroup.ANY)
async def infer(data): ...
```

All resources in a Flash app must use the same Python version because Flash ships a single tarball for the entire app. If resources declare conflicting versions, the build fails.

<Warning>
  **Breaking change:** Flash now matches your local Python version by default instead of always defaulting to Python 3.12. If your local Python differs from 3.12, your first deploy after upgrading Flash will trigger a rolling release. For consistent behavior across team members, declare `python_version` explicitly or use the `--python-version` CLI flag.
</Warning>

<Warning>
  Python 3.10, 3.11, and 3.13 workers incur approximately 7 GB of additional cold-start overhead on GPU endpoints because the alternative Python interpreter must be installed alongside the base image's PyTorch environment.
</Warning>

<Note>
  Python 3.10 reaches end-of-life on 2026-10-31. Consider migrating to Python 3.11 or later.
</Note>

If your local Python version is not supported (for example, 3.9 or 3.14), the build fails with an actionable error message listing the supported versions.

The `--python-version` CLI flag on `flash build` and `flash deploy` overrides both per-resource declarations and local interpreter detection.

## EndpointJob

When using `Endpoint(id=...)` or `Endpoint(image=...)`, the `.run()` method returns an `EndpointJob` object for async operations:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
ep = Endpoint(id="abc123")

# Submit a job
job = await ep.run({"prompt": "hello"})

# Check status
status = await job.status()  # "IN_PROGRESS", "COMPLETED", etc.

# Wait for completion
await job.wait(timeout=60)  # Optional timeout in seconds

# Access results
print(job.id)      # Job ID
print(job.output)  # Result payload
print(job.error)   # Error message if failed
print(job.done)    # True if completed/failed

# Cancel a job
await job.cancel()
```

## Configuration change behavior

When you change configuration and redeploy, Flash automatically updates your endpoint.

### Changes that recreate workers

These changes restart all workers:

* GPU configuration (`gpu`, `gpu_count`)
* CPU instance type (`cpu`)
* Docker image (`image`)
* Storage (`volume`)
* Datacenter (`datacenter`)
* Flashboot setting (`flashboot`)
* CUDA version requirement (`min_cuda_version`)
* Python version (`python_version`)

Workers are temporarily unavailable during recreation (typically 30-90 seconds).

### Changes that update settings only

These changes apply immediately with no downtime:

* Worker scaling (`workers`)
* Timeouts (`idle_timeout`, `execution_timeout_ms`)
* Scaler settings (`scaler_type`, `scaler_value`)
* Environment variables (`env`)
* Endpoint name (`name`)

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
# First deployment
@Endpoint(
    name="inference-api",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    workers=5,
    env={"MODEL": "v1"}
)
async def infer(data): ...

# Update scaling - no worker recreation
@Endpoint(
    name="inference-api",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,  # Same GPU
    workers=10,                          # Changed - updates settings only
    env={"MODEL": "v2"}                  # Changed - updates settings only
)
async def infer(data): ...

# Change GPU type - workers recreated
@Endpoint(
    name="inference-api",
    gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,  # Changed - triggers recreation
    workers=10,
    env={"MODEL": "v2"}
)
async def infer(data): ...
```
