Execution model - Runpod Documentation

Flash runs your Python functions on remote GPU/CPU workers while you maintain local control flow. This page explains what happens when you call an @Endpoint function.

What runs where

The @Endpoint decorator marks functions for remote execution. Everything else runs locally.

import asyncio
from runpod_flash import Endpoint, GpuType

@Endpoint(name="demo", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090)
def process_on_gpu(data):
    # This runs on Runpod worker
    import torch
    return {"result": "processed"}

async def main():
    # This runs on your machine
    result = await process_on_gpu({"input": "data"})
    print(result)  # This runs on your machine

if __name__ == "__main__":
    asyncio.run(main())  # This runs on your machine

Code	Location
`@Endpoint` decorator	Your machine (marks function)
Inside `process_on_gpu`	Runpod worker
Everything else	Your machine

Flash apps

When you build a Flash app: Development (flash dev):

FastAPI server runs locally.
@Endpoint functions run on Runpod workers.

Production (flash deploy):

Each endpoint configuration becomes a separate Serverless endpoint.
All endpoints run on Runpod.

Execution flow

Here’s what happens when you call an @Endpoint function:

Endpoint naming

Flash identifies endpoints by their name parameter:

@Endpoint(
    name="inference",  # This identifies the endpoint
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    workers=3
)
def run_inference(data): ...

Same name, same config: Reuses the existing endpoint.
Same name, different config: Updates the endpoint automatically.
New name: Creates a new endpoint.

This means you can change parameters like workers without creating a new endpoint—Flash detects the change and updates it.

Worker lifecycle

Workers scale up and down based on demand and your configuration.

Worker states

State	Description	Billing
Initializing	Downloading image, loading code	Yes
Idle	Scaled down, waiting for requests	No
Running	Processing requests	Yes
Throttled	Temporarily unable to run due to host resource constraints	No
Outdated	Marked for replacement after update	Yes (while processing)
Unhealthy	Crashed; auto-retries for up to 7 days	No

Scaling behavior

@Endpoint(
    name="demo",
    gpu=GpuGroup.ANY,
    workers=(0, 5),   # (min, max) - Scale to zero when idle, up to 5 workers
    idle_timeout=60   # Seconds before running workers scale down
)
def process(data): ...

Example:

First job arrives → Scale to 1 worker (cold start).
More jobs arrive while worker busy → Scale up to max workers.
Jobs complete → Workers stay running for idle_timeout seconds before scaling down to idle.
No new jobs → Scale down to min workers.

Cold starts and warm starts

Understanding cold and warm starts helps you predict latency and set expectations.

Cold start

A cold start occurs when no workers are available to handle your job, because:

You’re calling an endpoint for the first time.
All workers have been scaled down after not processing requests for idle_timeout seconds.
All running workers are busy processing requests.

What happens during a cold start:

Runpod provisions a new worker with your configured GPU/CPU.
The worker image starts (dependencies are pre-installed during build).
Your function executes.

Typical timing: 10-60 seconds total, depending on GPU availability and image size.

When using flash build or flash deploy, dependencies are pre-installed in the worker image, eliminating pip installation at request time. When running standalone scripts with @Endpoint functions outside of a Flash app, dependencies may be installed on the worker at request time.

Warm start

A warm start occurs when a worker is already running and idle:

Worker completed a previous job and is waiting for more work.
Worker is within its idle_timeout period.

What happens during a warm start:

Job is routed immediately to the idle worker.
Your function executes.

Typical timing: ~1 second + your function’s execution time.

The relationship between configuration and starts

Your workers and idle_timeout settings directly affect cold start frequency:

workers=(0, n): Workers scale to zero when not processing. Every request after the idle_timeout period triggers a cold start.
workers=(1, n): At least one worker stays ready. First concurrent request is warm, additional requests may cold start.
Higher idle_timeout: Workers stay running longer before scaling down, reducing cold starts for sporadic traffic.

See configuration best practices for specific recommendations based on your workload.

​What runs where

​Flash apps

​Execution flow

​Endpoint naming

​Worker lifecycle

​Worker states

​Scaling behavior

​Cold starts and warm starts

​Cold start

​Warm start

​The relationship between configuration and starts

What runs where

Flash apps

Execution flow

Endpoint naming

Worker lifecycle

Worker states

Scaling behavior

Cold starts and warm starts

Cold start

Warm start

The relationship between configuration and starts