> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpod.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploy Flash apps to Runpod

> Build and deploy your Flash app for production serving.

When you're satisfied with your endpoint functions and ready to move to production, use `flash deploy` to build and deploy your Flash application:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
flash deploy

# If using uv:
uv run flash deploy
```

This command performs the following steps:

1. **Build**: Packages your code, dependencies, and manifest.
2. **Upload**: Sends the artifact to Runpod's storage.
3. **Provision**: Creates or updates Serverless endpoints.
4. **Configure**: Sets up environment variables and service discovery.

### Deployment architecture

Flash deploys your application as multiple independent Serverless endpoints. Each endpoint configuration in your worker files becomes a separate endpoint.

**How Flash deployments work:**

* **One Endpoint class = one Serverless endpoint**: Each unique endpoint configuration (defined by its `name` parameter) creates a separate Serverless endpoint with its own URL.
* **Call any endpoint**: After deployment, you can call whichever endpoint you need—`lb_worker` for API requests, `gpu_worker` for GPU tasks, `cpu_worker` for CPU tasks.
* [Load balancing endpoints](/flash/create-endpoints#load-balanced-endpoints): Create HTTP APIs with custom routes using `.get()`, `.post()`, etc. decorators.
* [Queue-based endpoints](/flash/create-endpoints#queue-based-endpoints): Run compute tasks using the `/runsync` or `/run` routes.
* **Inter-endpoint communication**: Endpoints can call each other's functions when needed, using the Runpod GraphQL service for discovery.

### Deploy to a specific environment

Flash organizes deployments using [apps and environments](/flash/apps/apps-and-environments). Deploy to a specific environment using the `--env` flag:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
# Deploy to staging
flash deploy --env staging

# Deploy to production
flash deploy --env production

# If using uv:
uv run flash deploy --env staging
uv run flash deploy --env production
```

If the app doesn't exist, Flash creates it along with the target environment. If only the environment doesn't exist, Flash creates it within the existing app.

### Post-deployment

After a successful deployment, Flash displays all deployed endpoints grouped by type:

```text theme={"theme":{"light":"github-light","dark":"github-dark"}}
✓ Deployment Complete

Load-balanced endpoints:
  https://abc123xyz.api.runpod.ai  (lb_worker)
    POST   /process
    GET    /health

  Try it:
    curl -X POST https://abc123xyz.api.runpod.ai/process \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $RUNPOD_API_KEY" \
        -d '{"input_data": {"message": "Hello from Flash"}}'

Queue-based endpoints:
  https://api.runpod.ai/v2/def456xyz  (gpu_worker)
  https://api.runpod.ai/v2/ghi789xyz  (cpu_worker)

  Try it:
    curl -X POST https://api.runpod.ai/v2/def456xyz/runsync \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $RUNPOD_API_KEY" \
        -d '{"input": {"input_data": {"message": "Hello from the GPU"}}}'
```

Each endpoint is independent with its own URL and authentication.

<Accordion title="Understanding endpoint architecture">
  The relationship between endpoint configurations and deployed endpoints differs between load-balanced and queue-based endpoints:

  ### Queue-based endpoints (one function per endpoint)

  For queue-based endpoints, each `@Endpoint` function must have its own unique name:

  ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
  from runpod_flash import Endpoint, GpuType

  # Each function needs its own endpoint name
  @Endpoint(
      name="run-model",
      gpu=GpuType.NVIDIA_A100_80GB_PCIe,
      dependencies=["torch"]
  )
  def run_model(input: dict): ...

  @Endpoint(
      name="preprocess",
      gpu=GpuType.NVIDIA_A100_80GB_PCIe,
      dependencies=["transformers"]
  )
  def preprocess(data: dict): ...
  ```

  This creates two separate Serverless endpoints:

  * `https://api.runpod.ai/v2/abc123xyz` (run-model)
  * `https://api.runpod.ai/v2/def456xyz` (preprocess)

  **Calling queue-based endpoints:**

  ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
  # Call run_model endpoint (synchronous):
  curl -X POST https://api.runpod.ai/v2/abc123xyz/runsync \
      -H "Authorization: Bearer $RUNPOD_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"input": {"your": "data"}}'

  # Or call asynchronously with /run:
  curl -X POST https://api.runpod.ai/v2/abc123xyz/run \
      -H "Authorization: Bearer $RUNPOD_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"input": {"your": "data"}}'
  ```

  <Warning>
    **Important:** For deployed queue-based endpoints, you must use **one function per endpoint name**. Each function creates its own Serverless endpoint. Do not create multiple `@Endpoint` functions with the same `name` when building Flash apps.
  </Warning>

  ### Load-balanced endpoints (multiple routes per endpoint)

  For load-balanced endpoints, you can define multiple HTTP routes on a single endpoint:

  ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
  from runpod_flash import Endpoint

  api = Endpoint(name="api", cpu="cpu5c-4-8", workers=(1, 5))

  # Multiple routes on a single Serverless endpoint:
  @api.post("/generate")
  def generate_text(prompt: str): ...

  @api.post("/translate")
  def translate_text(text: str): ...

  @api.get("/health")
  def health_check(): ...
  ```

  This creates:

  * **One Serverless endpoint**: `https://abc123xyz.api.runpod.ai` (named "api")
  * **Three HTTP routes**: `POST /generate`, `POST /translate`, `GET /health`

  **Calling load-balanced endpoints:**

  ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
  # Call the /generate route:
  curl -X POST https://abc123xyz.api.runpod.ai/generate \
      -H "Authorization: Bearer $RUNPOD_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"prompt": "hello"}'

  # Call the /health route (same endpoint URL):
  curl -X GET https://abc123xyz.api.runpod.ai/health \
      -H "Authorization: Bearer $RUNPOD_API_KEY"
  ```
</Accordion>

## Preview before deploying

You can test your deployment locally using Docker before pushing to production using the `--preview` flag:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
flash deploy --preview

# If using uv:
uv run flash deploy --preview
```

This command:

1. Builds your project (creates the deployment artifact and manifest).
2. Creates a Docker network for inter-container communication.
3. Starts one container per endpoint configuration (`lb_worker`, `gpu_worker`, `cpu_worker`, etc.).
4. Exposes all endpoints for local testing.

Press `Ctrl+C` to stop the preview environment.

## Managing deployment size

Runpod Serverless has a **1.5GB deployment limit**. Flash automatically excludes packages that are pre-installed in the base image:

* `torch`, `torchvision`, `torchaudio`
* `numpy`, `triton`

If your deployment still exceeds the limit, use the `--exclude` flag to skip additional packages:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
flash deploy --exclude scipy,pandas

# If using uv:
uv run flash deploy --exclude scipy,pandas
```

### Base image packages

| Configuration type | Base image      | Auto-excluded packages                                  |
| ------------------ | --------------- | ------------------------------------------------------- |
| GPU (`gpu=`)       | PyTorch base    | `torch`, `torchvision`, `torchaudio`, `numpy`, `triton` |
| CPU (`cpu=`)       | Python slim     | `torch`, `torchvision`, `torchaudio`, `numpy`, `triton` |
| Load-balanced      | Same as GPU/CPU | Same as GPU/CPU                                         |

<Tip>
  Check the [worker-flash repository](https://github.com/runpod-workers/worker-flash) for current base images and pre-installed packages.
</Tip>

## Build process

When you run `flash deploy` (or `flash build`), Flash:

1. **Discovers** all `@Endpoint` decorated functions.
2. **Groups** functions by their endpoint name.
3. **Generates** handler files for each endpoint.
4. **Creates** a `flash_manifest.json` file for service discovery.
5. **Installs** dependencies with Linux x86\_64 compatibility.
6. **Packages** everything into `.flash/artifact.tar.gz`.

### Build artifacts

After building, these artifacts are created in the `.flash/` directory:

| Artifact                     | Description                                    |
| ---------------------------- | ---------------------------------------------- |
| `.flash/artifact.tar.gz`     | Deployment package                             |
| `.flash/flash_manifest.json` | Service discovery configuration                |
| `.flash/.build/`             | Temporary build directory (removed by default) |

## What gets deployed

When you deploy a Flash app, you're deploying a **build artifact** (tarball) onto pre-built Flash Docker images. This architecture is similar to AWS Lambda layers: the base runtime is pre-built, and your code and dependencies are layered on top.

### The build artifact

The `.flash/artifact.tar.gz` file (max 1.5 GB) contains:

<Tree>
  <Tree.Folder name="artifact.tar.gz" defaultOpen>
    <Tree.File name="lb_worker.py" />

    <Tree.File name="gpu_worker.py" />

    <Tree.File name="cpu_worker.py" />

    <Tree.File name="flash_manifest.json" />

    <Tree.File name="requirements.txt" />

    <Tree.Folder name="[installed dependencies]" defaultOpen>
      <Tree.Folder name="torch" />

      <Tree.Folder name="transformers" />

      <Tree.File name="..." />
    </Tree.Folder>
  </Tree.Folder>
</Tree>

Dependencies are installed locally during the build process and bundled into the tarball. They are **not** installed at runtime on endpoints.

### The deployment manifest

The `flash_manifest.json` file is the brain of your deployment. It tells each endpoint:

* Which functions to execute.
* What Docker image to use.
* How to configure resources (GPUs, workers, scaling).
* Environment variables for workers.
* How to route HTTP requests (for load balancer endpoints).

```json theme={"theme":{"light":"github-light","dark":"github-dark"}}
{
  "resources": {
    "lb_worker": {
      "is_load_balanced": true,
      "imageName": "runpod/flash-lb-cpu:latest",
      "workersMin": 1,
      "functions": [
        {"name": "process", "module": "lb_worker"},
        {"name": "health", "module": "lb_worker"}
      ]
    },
    "gpu_worker": {
      "imageName": "runpod/flash:latest",
      "gpuIds": "AMPERE_16",
      "workersMax": 3,
      "env": {
        "HF_TOKEN": "your_token",
        "MODEL_ID": "gpt2"
      },
      "functions": [
        {"name": "gpu_hello", "module": "gpu_worker"}
      ]
    },
    "cpu_worker": {
      "imageName": "runpod/flash-cpu:latest",
      "workersMax": 2,
      "functions": [
        {"name": "cpu_hello", "module": "cpu_worker"}
      ]
    }
  },
  "routes": {
    "lb_worker": {
      "POST /process": "process",
      "GET /health": "health"
    }
  }
}
```

### What gets created on Runpod

For each endpoint configuration in the manifest, Flash creates an independent Serverless endpoint, identified by its `name` parameter.

### Cross-endpoint communication

When one endpoint needs to call a function on another endpoint:

1. **Manifest lookup**: The calling endpoint checks `flash_manifest.json` for function-to-resource mapping.
2. **Service discovery**: It queries the state manager (Runpod GraphQL API) for target endpoint URL.
3. **Direct call**: It makes an HTTP request directly to the target endpoint.
4. **Response**: The target endpoint executes the function and returns the result.

Each endpoint maintains its own connection to the state manager, querying for peer endpoint URLs as needed and caching results for 300 seconds to minimize API calls.

#### Calling another endpoint from your code

To call one endpoint from another, import the target endpoint function **inside** your function body. Flash automatically detects these imports and generates the necessary dispatch stubs.

For example, if you have a GPU worker for inference:

```python gpu_worker.py theme={"theme":{"light":"github-light","dark":"github-dark"}}
from runpod_flash import Endpoint, GpuType

@Endpoint(
    name="gpu-inference",
    gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
    dependencies=["torch"]
)
async def gpu_inference(payload: dict) -> dict:
    import torch
    # GPU inference logic
    return {"result": "processed"}
```

You can call it from a CPU-based pipeline endpoint:

```python cpu_worker.py theme={"theme":{"light":"github-light","dark":"github-dark"}}
from runpod_flash import Endpoint

@Endpoint(name="pipeline", cpu="cpu5c-4-8")
async def classify(text: str) -> dict:
    # Import the GPU endpoint inside the function body
    from gpu_worker import gpu_inference

    # Flash routes this call to the gpu-inference endpoint
    result = await gpu_inference({"text": text})
    return {"classification": result}
```

## Call deployed endpoints from scripts

After deploying your Flash app, you can call your `@Endpoint` functions directly from Python scripts. Flash automatically resolves the app context from your project structure, so in most cases you can run scripts without any additional configuration.

### How it works

When you run a script that calls an `@Endpoint` function, Flash:

1. Detects the app context from the project directory structure.
2. Looks up the deployed endpoint by name within the resolved app and environment.
3. Routes the request to that endpoint using Flash's sentinel service.
4. Returns the result to your script.

This lets you reuse the same `@Endpoint` function definitions to interact with deployed endpoints without modifying your code.

### Example: calling within the same script

The simplest approach is to call the endpoint directly in the same file where it's defined:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
# gpu_worker.py
import asyncio
from runpod_flash import Endpoint, GpuType

@Endpoint(
    name="inference",
    gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
    dependencies=["torch"]
)
async def run_inference(data: dict) -> dict:
    import torch
    # Inference logic
    return {"result": "processed"}

async def main():
    result = await run_inference({"input": "data"})
    print(result)

if __name__ == "__main__":
    asyncio.run(main())
```

Run the script:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
python gpu_worker.py
```

### Example: importing from another script

You can also import and call endpoints from a separate script:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
# call_inference.py
import asyncio
from gpu_worker import run_inference

async def main():
    # Flash resolves the app context automatically
    result = await run_inference({"input": "data"})
    print(result)

if __name__ == "__main__":
    asyncio.run(main())
```

Run the script:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
python call_inference.py
```

### Override the resolved context

Flash resolves the app name from your project's directory structure. Use `FLASH_APP` and `FLASH_ENV` environment variables to override this automatic resolution when needed.

A common use case is when you move a script to a different directory. Since the resolved app name depends on the directory location, moving the script changes the resolved context. To continue targeting the original app, set `FLASH_APP` explicitly:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
FLASH_APP=my-app python call_inference.py
```

You can also override the environment:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
FLASH_APP=my-app FLASH_ENV=production python call_inference.py
```

### Error without context

If Flash cannot resolve the app context and you haven't set the environment variables, it raises an error:

```text theme={"theme":{"light":"github-light","dark":"github-dark"}}
RuntimeError: no flash context for endpoint 'inference'. either:
  - use 'flash dev' for local development
  - set FLASH_APP and FLASH_ENV to target a deployed environment
```

### Automatic context in deployed workers

When Flash deploys your app, it automatically sets `FLASH_APP` and `FLASH_ENV` environment variables on each worker. This enables cross-endpoint communication within your deployed application without additional configuration.

## Troubleshooting

### No @Endpoint functions found

If the build process can't find your endpoint functions:

* Ensure functions are decorated with `@Endpoint(...)`.
* Check that Python files aren't excluded by `.gitignore` or Flash's [built-in ignore patterns](/flash/cli/build#built-in-ignore-patterns).
* Verify decorator syntax is correct.

### Deployment size limit exceeded

Base image packages are auto-excluded. If your deployment still exceeds 1.5GB, use `--exclude` to skip additional packages:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
flash deploy --exclude scipy,pandas
```

### Authentication errors

Verify your API key is set correctly:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
echo $RUNPOD_API_KEY
```

If not set, add it to your `.env` file or export it:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
export RUNPOD_API_KEY=your_api_key_here
```

### Import errors in endpoint functions

Import packages inside the endpoint function, not at the top of the file:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
@Endpoint(name="fetch-data", gpu=GpuGroup.ANY, dependencies=["requests"])
def fetch_data(url):
    import requests  # Import here
    return requests.get(url).json()
```

## Next steps

* [Learn about apps and environments](/flash/apps/apps-and-environments) for managing deployments.
* [View the CLI reference](/flash/cli/overview) for all available commands.
* [Configure hardware resources](/flash/configuration/parameters) for your endpoints.
* [Monitor and troubleshoot](/flash/troubleshooting) your deployments.
