Troubleshooting - Runpod Documentation

This guide covers how to monitor your Flash deployments, debug issues, and resolve common errors.

Monitoring and debugging

Viewing logs

When running Flash functions, logs are displayed in your terminal:

2025-11-19 12:35:15,109 | INFO  | Created endpoint: rb50waqznmn2kg - flash-quickstart
2025-11-19 12:35:15,114 | INFO  | Endpoint:rb50waqznmn2kg | API /run
2025-11-19 12:35:15,655 | INFO  | Endpoint:rb50waqznmn2kg | Started Job:b0b341e7-...
2025-11-19 12:35:15,762 | INFO  | Job:b0b341e7-... | Status: IN_QUEUE
2025-11-19 12:36:09,983 | INFO  | Job:b0b341e7-... | Status: COMPLETED
2025-11-19 12:36:10,068 | INFO  | Worker:icmkdgnrmdf8gz | Delay Time: 51842 ms
2025-11-19 12:36:10,068 | INFO  | Worker:icmkdgnrmdf8gz | Execution Time: 1533 ms

Control log verbosity with the LOG_LEVEL environment variable:

LOG_LEVEL=DEBUG python your_script.py

Available levels: DEBUG, INFO, WARNING, ERROR.

Runpod console

View detailed metrics and logs in the Runpod console:

Navigate to the Serverless section.
Click on your endpoint to view:
- Active workers and queue depth.
- Request history and job status.
- Worker logs and execution details.

The console provides metrics including request rate, queue depth, latency, worker count, and error rate.

View worker logs

Access detailed logs for specific workers:

Go to the Serverless console.
Select your endpoint.
Click on a worker to view its logs.

Logs include dependency installation output, function execution output (print statements, errors), and system-level messages.

Add logging to functions

Include print statements in your endpoint functions for debugging:

@Endpoint(name="processor", gpu=GpuGroup.ANY)
async def process(data: dict) -> dict:
    print(f"Received data: {data}")  # Visible in worker logs

    result = do_processing(data)
    print(f"Processing complete: {result}")

    return result

Configuration errors

API key not set

Error:

No RunPod API key found. Set one with:

  flash login                              # interactive setup
                 or
  export RUNPOD_API_KEY=<your-api-key>     # environment variable
                 or
  echo 'RUNPOD_API_KEY=<your-api-key>' >> .env

Get a key: https://docs.runpod.io/get-started/api-keys

Cause: Flash requires a valid Runpod API key to provision and manage endpoints. Solution:

Generate an API key from Settings > API Keys in the Runpod console. The key needs All access permissions.
Authenticate using one of these methods: Option 1: Use flash login (recommended)
```
flash login
```
Opens your browser for authentication and saves your credentials. Option 2: Environment variable
```
export RUNPOD_API_KEY="your_api_key"
```
Option 3: .env file for local CLI use
```
echo "RUNPOD_API_KEY=your_api_key" >> .env
```
Values in your .env file are only available locally for CLI commands. They are not passed to deployed endpoints.
Option 4: Shell profile for persistent local access
```
echo 'export RUNPOD_API_KEY="your_api_key"' >> ~/.bashrc
source ~/.bashrc
```

Corrupted credentials file

Error:

Error: ~/.runpod/config.toml is corrupted and cannot be parsed.
Run 'flash login' to re-authenticate, or delete the file and retry.

Cause: The credentials file at ~/.runpod/config.toml contains invalid TOML and cannot be read. This can also appear as “No API key found” even after a successful flash login. Solution: Delete the credentials file and re-authenticate:

rm ~/.runpod/config.toml
flash login

Invalid route configuration

Error:

Load-balanced endpoints require route decorators

Cause: Load-balanced endpoints require HTTP method decorators for each route. Solution: Ensure all routes use the correct decorator pattern:

from runpod_flash import Endpoint

api = Endpoint(name="api", cpu="cpu5c-4-8", workers=(1, 5))

# Correct - using route decorators
@api.post("/process")
async def process_data(data: dict) -> dict:
    return {"result": "processed"}

@api.get("/health")
async def health_check() -> dict:
    return {"status": "healthy"}

Invalid HTTP method

Error:

method must be one of {'GET', 'POST', 'PUT', 'DELETE', 'PATCH'}

Cause: The HTTP method specified is not supported. Solution: Use one of the supported HTTP methods: GET, POST, PUT, DELETE, or PATCH.

Invalid path format

Error:

path must start with '/'

Cause: HTTP paths must begin with a forward slash. Solution: Ensure paths start with /:

# Correct
@api.get("/health")

# Incorrect
@api.get("health")

Duplicate routes

Error:

Duplicate route 'POST /process' in endpoint 'my-api'

Cause: Two functions define the same HTTP method and path combination. Solution: Ensure each route is unique within an endpoint. Either change the path or method of one function.

Build errors

Unsupported Python version

Error:

Local Python 3.9 is not supported by Flash workers (supported: 3.10, 3.11, 3.12, 3.13).
Pass --python-version, declare python_version on a resource config, or run flash from a supported interpreter.

Cause: Flash supports Python 3.10, 3.11, 3.12, and 3.13. Your local Python version is outside this range. Solution: You have three options:

Use the --python-version CLI flag to override local detection:

flash build --python-version 3.12
flash deploy --python-version 3.12

Declare python_version on your resource configs:

@Endpoint(name="my-endpoint", gpu=GpuGroup.ANY, python_version="3.12")

Switch to a supported Python version using a virtual environment:

# Using pyenv
pyenv install 3.12
pyenv local 3.12

# Or using uv
uv venv --python 3.12
source .venv/bin/activate

Python 3.12 is recommended for best performance with no cold-start overhead. Python 3.10, 3.11, and 3.13 incur additional cold-start overhead on GPU workers because an alternative Python interpreter must be installed.

Deployment errors

Tarball too large

Error:

Tarball exceeds maximum size. File size: 1.6GB, Max: 1.5GB

Cause: The deployment package exceeds the 1.5GB limit. Solution:

Check for large files that shouldn’t be included (datasets, model weights, logs).
Add large files to .gitignore to exclude them from the build.
Use network volumes to store large models instead of bundling them.

Invalid tarball format

Error:

File is not a valid gzip file. Expected magic bytes (31, 139)

Cause: The build artifact is corrupted or not a valid gzip file. Solution: Delete the .flash directory and rebuild:

rm -rf .flash
flash build

SSL certificate verification failed

Error:

SSL certificate verification failed. This usually means Python cannot find your system's CA certificates.

Cause: Python cannot locate the system’s trusted CA certificates, preventing secure connections during deployment. This commonly occurs on fresh Python installations, especially on macOS. Solution: Try one of these fixes:

Install certifi and set the certificate bundle path:

pip install certifi
export REQUESTS_CA_BUNDLE=$(python -c "import certifi; print(certifi.where())")

macOS only: Run the certificate installer that comes with Python. Find it in your Python installation folder (typically /Applications/Python 3.x/) and run Install Certificates.command.

Add to shell profile for persistence:

echo 'export REQUESTS_CA_BUNDLE=$(python -c "import certifi; print(certifi.where())")' >> ~/.bashrc
source ~/.bashrc

Transient SSL errors (like connection resets) are automatically retried during upload. The certificate verification error requires manual intervention because it indicates a system configuration issue.

Resource provisioning failed

Error:

Failed to provision resources: [error details]

Cause: Flash couldn’t create the Serverless endpoint on Runpod. Solutions:

Check GPU availability: The requested GPU types may not be available. Add fallback options:

gpu=[GpuType.NVIDIA_A100_80GB_PCIe, GpuType.NVIDIA_RTX_A6000, GpuType.NVIDIA_GEFORCE_RTX_4090]

Check account limits: You may have hit worker capacity limits. Contact Runpod support to increase limits.
Check network volume: If using volume=, verify the volume exists and is in a compatible datacenter.

Runtime errors

Endpoint not deployed

Error:

Endpoint URL not available - endpoint may not be deployed

Cause: The endpoint function was called before the endpoint finished provisioning. Solutions:

For standalone scripts: Ensure the endpoint has time to provision. Flash handles this automatically, but network issues can cause delays.
For Flash apps: Deploy the app first with flash deploy, then call the endpoint.
Check endpoint status: View your endpoints in the Serverless console.

Execution timeout

Error:

Execution timeout on [endpoint] after [N]s

Cause: The endpoint function took longer than the configured timeout. Solutions:

Increase timeout: Set execution_timeout_ms in your configuration:

@Endpoint(
    name="long-running",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    execution_timeout_ms=600000  # 10 minutes
)

Optimize function: Profile your function to identify bottlenecks.
Use queue-based endpoints: For long-running tasks, use the @Endpoint decorator pattern. Queue-based endpoints are designed for longer operations.

Connection failed

Error:

Failed to connect to endpoint [name] ([url])

Cause: Network connectivity issue between your local environment and the Runpod endpoint. Solutions:

Check internet connection: Verify you have network access.
Retry: Transient network issues often resolve on retry. Flash includes automatic retry logic.
Check endpoint status: Verify the endpoint is running in the Serverless console.

HTTP errors from endpoint

Error:

HTTP error from endpoint [name]: 500 - Internal Server Error

Cause: The endpoint function raised an exception during execution. Solutions:

Check logs: View worker logs in the Serverless console for detailed error messages.
Test locally: Use flash dev to test your function locally before deploying.

Add error handling: Wrap your function logic in try/except to provide better error messages:

@Endpoint(name="processor", gpu=GpuGroup.ANY)
async def process(data: dict) -> dict:
    try:
        # Your logic here
        return {"result": "success"}
    except Exception as e:
        return {"error": str(e)}

Serialization errors

Error:

Failed to deserialize result: [error]

Cause: The function’s return value cannot be serialized/deserialized. Solutions:

Use simple types: Return dictionaries, lists, strings, numbers, and other JSON-serializable types.

Avoid complex objects: Don’t return PyTorch tensors, NumPy arrays, or custom classes directly. Convert them first:

# Correct
return {"result": tensor.tolist()}

# Incorrect - tensor is not serializable
return {"result": tensor}

Check argument types: Input arguments must also be serializable.

Payload too large

Error:

Payload size X MB exceeds limit of 10.0 MB

Cause: The serialized argument exceeds the 10 MB limit. Flash uses base64 encoding, which expands data by approximately 33%, so roughly 7.5 MB of raw data becomes 10 MB when encoded. Solutions:

Use network volumes for large data: Save large data to a network volume and pass the file path:

@Endpoint(name="processor", gpu=GpuGroup.ANY, volume="vol_abc123")
async def process(file_path: str) -> dict:
    import numpy as np
    data = np.load(file_path)  # Load from volume
    return {"result": process_data(data)}

Compress data before sending: For data that must be passed directly, use compression:

import gzip

compressed = gzip.compress(data.tobytes())
# Pass compressed bytes instead

Split large requests: Break large datasets into smaller chunks and process them in multiple requests.

Deserialization timeout

Error:

Deserialization timed out after 30s

Cause: The deserialization process took longer than 30 seconds. This usually indicates malformed or corrupted serialized data that causes the unpickle operation to hang. Solution: Verify your input data is properly serialized. If you’re manually constructing payloads, ensure the data was serialized using cloudpickle and encoded with base64. The Flash SDK handles this automatically for programmatic calls.

Circuit breaker open

Error:

Circuit breaker is open. Retry in [N] seconds

Cause: Too many consecutive failures to the endpoint triggered the circuit breaker protection. Solutions:

Wait and retry: The circuit breaker will automatically attempt recovery after the timeout (typically 60 seconds).
Check endpoint health: Multiple failures usually indicate an underlying issue. Check logs and endpoint status.
Fix the root cause: Address whatever is causing the repeated failures before retrying.

GPU availability issues

Job stuck in queue

Symptom: Job status shows IN_QUEUE for extended periods. Cause: The requested GPU types are not available. Solutions:

Add fallback GPUs: Expand your gpu list with additional options:

@Endpoint(
    name="flexible",
    gpu=[
        GpuType.NVIDIA_A100_80GB_PCIe,    # First choice
        GpuType.NVIDIA_RTX_A6000,         # Fallback
        GpuType.NVIDIA_GEFORCE_RTX_4090   # Second fallback
    ]
)

Use GpuGroup.ANY: For development, accept any available GPU:
```
gpu=GpuGroup.ANY
```
Check availability: View GPU availability in the Serverless console.
Contact support: For guaranteed capacity, contact Runpod support.

Dependency errors

Module not found

Error (in worker logs):

ModuleNotFoundError: No module named 'transformers'

Cause: A required dependency was not specified in the @Endpoint decorator. Solution: Add all required packages to the dependencies parameter:

@Endpoint(
    name="processor",
    gpu=GpuGroup.ANY,
    dependencies=["transformers", "torch", "pillow"]
)
async def process(data: dict) -> dict:
    from transformers import pipeline
    # ...

Version conflicts

Symptom: Function fails with import errors or unexpected behavior. Cause: Dependency version conflicts between packages. Solution: Pin specific versions:

@Endpoint(
    name="processor",
    gpu=GpuGroup.ANY,
    dependencies=[
        "transformers==4.36.0",
        "torch==2.1.0",
        "accelerate>=0.25.0"
    ]
)

Getting help

If you’re still stuck:

Discord: Join the Runpod Discord for community support.
GitHub Issues: Report bugs or request features on the Flash repository.
Support: Contact Runpod support for account-specific issues.

​Monitoring and debugging

​Viewing logs

​Runpod console

​View worker logs

​Add logging to functions

​Configuration errors

​API key not set

​Corrupted credentials file

​Invalid route configuration

​Invalid HTTP method

​Invalid path format

​Duplicate routes

​Build errors

​Unsupported Python version

​Deployment errors

​Tarball too large

​Invalid tarball format

​SSL certificate verification failed

​Resource provisioning failed

​Runtime errors

​Endpoint not deployed

​Execution timeout

​Connection failed

​HTTP errors from endpoint

​Serialization errors

​Payload too large

​Deserialization timeout

​Circuit breaker open

​GPU availability issues

​Job stuck in queue

​Dependency errors

​Module not found

​Version conflicts

​Getting help

Monitoring and debugging

Viewing logs

Runpod console

View worker logs

Add logging to functions

Configuration errors

API key not set

Corrupted credentials file

Invalid route configuration

Invalid HTTP method

Invalid path format

Duplicate routes

Build errors

Unsupported Python version

Deployment errors

Tarball too large

Invalid tarball format

SSL certificate verification failed

Resource provisioning failed

Runtime errors

Endpoint not deployed

Execution timeout

Connection failed

HTTP errors from endpoint

Serialization errors

Payload too large

Deserialization timeout

Circuit breaker open

GPU availability issues

Job stuck in queue

Dependency errors

Module not found

Version conflicts

Getting help