Skip to main content
Flash follows the same pricing model as Runpod Serverless. You pay per second of compute time, with no charges when your code isn’t running. Pricing depends on the GPU or CPU type you configure for your endpoints.

How pricing works

You’re billed from when a worker starts until it completes your request, plus any idle time before scaling down. If a worker is already warm, you skip the cold start and only pay for execution time.

Compute cost breakdown

Flash workers incur charges during these periods:
  1. Start time: The time required to initialize a worker and load models into GPU memory. This includes starting the container, installing dependencies, and preparing the runtime environment.
  2. Execution time: The time spent processing your request (running your @Endpoint decorated function).
  3. Idle time: The period a worker remains active after completing a request, waiting for additional requests before scaling down.

Pricing by resource type

Flash supports both GPU and CPU workers. Pricing varies based on the hardware type:
  • GPU workers: Use @Endpoint(gpu=...) configuration. Pricing depends on the GPU type (e.g., RTX 4090, A100 80GB).
  • CPU workers: Use @Endpoint(cpu=...) configuration. Pricing depends on the CPU instance type.
See the Serverless pricing page for current rates by GPU and CPU type.

How to estimate and optimize costs

To estimate costs for your Flash workloads, consider:
  • How long each function takes to execute.
  • How many concurrent workers you need (workers setting).
  • Which GPU or CPU types you’ll use.
  • Your idle timeout configuration (idle_timeout setting).

Cost optimization strategies

Choose appropriate hardware

Select the smallest GPU or CPU that meets your performance requirements. For example, if your workload fits in 24GB of VRAM, use an RTX 4090 or L4 instead of larger GPUs like the A100.
from runpod_flash import Endpoint, GpuType

# Cost-effective configuration for workloads that fit in 24GB VRAM
@Endpoint(
    name="cost-optimized",
    gpu=[GpuType.NVIDIA_GEFORCE_RTX_4090, GpuType.NVIDIA_L4]
)
def process(data): ...

Configure idle timeouts

Balance responsiveness and cost by adjusting the idle_timeout parameter. Shorter timeouts reduce idle costs but increase cold starts for sporadic traffic.
# Lower idle timeout for cost savings (more cold starts)
@Endpoint(
    name="low-idle",
    gpu=GpuGroup.ANY,
    idle_timeout=5  # 5 seconds
)
def process(data): ...

# Higher idle timeout for responsiveness (higher idle costs)
@Endpoint(
    name="responsive",
    gpu=GpuGroup.ANY,
    idle_timeout=30  # 30 seconds
)
def process(data): ...

Use CPU workers for non-GPU tasks

For data preprocessing, postprocessing, or other tasks that don’t require GPU acceleration, use CPU workers instead of GPU workers.
from runpod_flash import Endpoint

# CPU configuration for non-GPU tasks
@Endpoint(
    name="data-processor",
    cpu="cpu5c-2-4"  # 2 vCPU, 4GB RAM
)
def process_data(data): ...

Limit maximum workers

Set workers to prevent runaway scaling and unexpected costs:
@Endpoint(
    name="controlled-scaling",
    gpu=GpuGroup.ANY,
    workers=3  # Limit to 3 concurrent workers (same as workers=(0, 3))
)
def process(data): ...

Monitoring costs

Monitor your usage in the Runpod console to track:
  • Total compute time across endpoints.
  • Worker utilization and idle time.
  • Cost breakdown by endpoint.

Next steps