> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpod.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Execution model

> Understand how Flash executes your code on Runpod's infrastructure.

export const MachineTooltip = () => {
  return <Tooltip headline="Machine" tip="The physical server hardware within a data center that hosts your compute resources.">machine</Tooltip>;
};

Flash runs your Python functions on remote GPU/CPU workers while you maintain local control flow. This page explains what happens when you call an `@Endpoint` function.

## What runs where

The `@Endpoint` decorator marks functions for remote execution. Everything else runs locally.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
import asyncio
from runpod_flash import Endpoint, GpuType

@Endpoint(name="demo", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090)
def process_on_gpu(data):
    # This runs on Runpod worker
    import torch
    return {"result": "processed"}

async def main():
    # This runs on your machine
    result = await process_on_gpu({"input": "data"})
    print(result)  # This runs on your machine

if __name__ == "__main__":
    asyncio.run(main())  # This runs on your machine
```

| Code                    | Location                      |
| ----------------------- | ----------------------------- |
| `@Endpoint` decorator   | Your machine (marks function) |
| Inside `process_on_gpu` | Runpod worker                 |
| Everything else         | Your machine                  |

### Flash apps

When you build a [Flash app](/flash/apps/overview):

**Development (`flash dev`)**:

* FastAPI server runs **locally**.
* `@Endpoint` functions run on **Runpod workers**.

**Production (`flash deploy`)**:

* Each endpoint configuration becomes a **separate Serverless endpoint**.
* All endpoints run on **Runpod**.

## Execution flow

Here's what happens when you call an `@Endpoint` function:

```mermaid theme={"theme":{"light":"github-light","dark":"github-dark"}}
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'14px','fontFamily':'font-inter'}}}%%

sequenceDiagram
    participant Local as Your Machine
    participant Flash as Flash SDK
    participant Runpod as Runpod API
    participant Worker as Remote Worker

    Local->>Flash: Call remote function
    Flash->>Flash: Look up endpoint by name
    Flash->>Runpod: Check for existing endpoint

    alt Endpoint exists
        Runpod-->>Flash: Return endpoint ID
    else New endpoint needed
        Flash->>Runpod: Create endpoint
        Runpod-->>Flash: Return endpoint ID
    end

    Flash->>Flash: Serialize function + args
    Flash->>Runpod: Submit job
    Runpod->>Worker: Route to worker

    Worker->>Worker: Execute function
    Worker->>Runpod: Return result

    Runpod-->>Flash: Return result
    Flash-->>Local: Return Python object
```

## Endpoint naming

Flash identifies endpoints by their `name` parameter:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
@Endpoint(
    name="inference",  # This identifies the endpoint
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    workers=3
)
def run_inference(data): ...
```

* **Same name, same config**: Reuses the existing endpoint.
* **Same name, different config**: Updates the endpoint automatically.
* **New name**: Creates a new endpoint.

This means you can change parameters like `workers` without creating a new endpoint—Flash detects the change and updates it.

## Worker lifecycle

Workers scale up and down based on demand and your configuration.

### Worker states

| State            | Description                                                                   | Billing                |
| ---------------- | ----------------------------------------------------------------------------- | ---------------------- |
| **Initializing** | Downloading image, loading code                                               | Yes                    |
| **Idle**         | Scaled down, waiting for requests                                             | No                     |
| **Running**      | Processing requests                                                           | Yes                    |
| **Throttled**    | Temporarily unable to run due to host <MachineTooltip /> resource constraints | No                     |
| **Outdated**     | Marked for replacement after update                                           | Yes (while processing) |
| **Unhealthy**    | Crashed; auto-retries for up to 7 days                                        | No                     |

### Scaling behavior

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
@Endpoint(
    name="demo",
    gpu=GpuGroup.ANY,
    workers=(0, 5),   # (min, max) - Scale to zero when idle, up to 5 workers
    idle_timeout=60   # Seconds before running workers scale down
)
def process(data): ...
```

**Example**:

1. First job arrives → Scale to 1 worker (cold start).
2. More jobs arrive while worker busy → Scale up to max workers.
3. Jobs complete → Workers stay running for `idle_timeout` seconds before scaling down to idle.
4. No new jobs → Scale down to min workers.

## Cold starts and warm starts

Understanding cold and warm starts helps you predict latency and set expectations.

### Cold start

A cold start occurs when no workers are available to handle your job, because:

* You're calling an endpoint for the first time.
* All workers have been scaled down after not processing requests for `idle_timeout` seconds.
* All running workers are busy processing requests.

**What happens during a cold start**:

1. Runpod provisions a new worker with your configured GPU/CPU.
2. The worker image starts (dependencies are pre-installed during build).
3. Your function executes.

**Typical timing**: 10-60 seconds total, depending on GPU availability and image size.

<Note>
  When using `flash build` or `flash deploy`, dependencies are pre-installed in the worker image, eliminating pip installation at request time. When running standalone scripts with `@Endpoint` functions outside of a Flash app, dependencies may be installed on the worker at request time.
</Note>

### Warm start

A warm start occurs when a worker is already running and idle:

* Worker completed a previous job and is waiting for more work.
* Worker is within its `idle_timeout` period.

**What happens during a warm start**:

1. Job is routed immediately to the idle worker.
2. Your function executes.

**Typical timing**: \~1 second + your function's execution time.

### The relationship between configuration and starts

Your `workers` and `idle_timeout` settings directly affect cold start frequency:

* `workers=(0, n)`: Workers scale to zero when not processing. Every request after the `idle_timeout` period triggers a cold start.
* `workers=(1, n)`: At least one worker stays ready. First concurrent request is warm, additional requests may cold start.
* Higher `idle_timeout`: Workers stay running longer before scaling down, reducing cold starts for sporadic traffic.

See [configuration best practices](/flash/configuration/best-practices) for specific recommendations based on your workload.
