Production workloads
Here are some best practices for production deployments requiring reliability and consistent performance:General recommendations
- Pin specific GPU types instead of using
GpuGroup.ANYfor predictable performance and costs. - Use network volumes for large models to avoid downloading on each worker startup.
- Set appropriate
execution_timeout_msto prevent runaway jobs and control costs. - Use environment variables for configuration and secrets, not hardcoded values.
Queue-based endpoints
Queue-based endpoints handle asynchronous batch processing where jobs can wait in queue:workers=(1, n): Set min to 1 to avoid cold starts for first job in queue.workers=(n, max): Set max based on expected peak concurrent jobs.idle_timeout: 900-1800 seconds (15-30 minutes) for production workloads.
Load-balanced endpoints
Load-balanced endpoints handle synchronous HTTP requests where immediate response is critical:workers=(n, max): Set min ≥ 1 for production APIs to avoid cold starts. Unlike queue-based endpoints where jobs can wait, API clients expect immediate responses.workers=(min, n): Set max based on expected peak concurrent requests.idle_timeout: 1200-1800 seconds (20-30 minutes) to keep workers ready.- Include health check routes (e.g.,
GET /health) for monitoring.
Development
Here are some best practices for development and testing environments prioritizing fast iteration:General recommendations
- Use
GpuGroup.ANYfor fastest GPU provisioning during development. - Set
workers=(0, n)to minimize costs when not actively testing. - Keep max workers low (1-3) to control development expenses.
- Use short
idle_timeout(300 seconds / 5 minutes) to scale down quickly between test runs. - Test locally with
flash runbefore deploying to production.
Example configuration
Cost optimization
Here are some best practices for minimizing costs on infrequent or batch workloads:General recommendations
- Set
workers=(0, n)to scale to zero when idle (no usage = no cost). - Use smaller GPU types when workload allows (e.g.,
GpuType.NVIDIA_GEFORCE_RTX_4090instead ofGpuType.NVIDIA_A100_80GB_PCIe). - Use CPU endpoints when GPU acceleration isn’t needed.
- Reduce
idle_timeoutfor sporadic workloads (300-600 seconds / 5-10 minutes). - Batch operations into fewer job submissions when possible.
Cost-optimized queue-based endpoint
Cost-optimized CPU endpoint
For workloads that don’t require GPU acceleration:Configuration trade-offs
Understanding the trade-offs helps you balance cost, latency, and performance:| Configuration | Cost | Cold Start Latency | Best For |
|---|---|---|---|
workers=(0, n) | Lowest | 20-90 seconds first run | Batch jobs, development, infrequent workloads |
workers=(1, n) | Medium | <1 second for queued jobs | Production batch, variable traffic |
workers=(3, n) | Highest | Always ready | Production APIs, high-traffic endpoints |
| GPU Choice | Cost | Availability | Best For |
|---|---|---|---|
GpuGroup.ANY | Variable | Highest | Development, fastest provisioning |
Specific type (e.g., GpuType.NVIDIA_GEFORCE_RTX_4090) | Predictable | Medium | Production with specific hardware |
Specific type (e.g., GpuType.NVIDIA_A100_80GB_PCIe) | Predictable | Lower | Production requiring specific hardware |
Configuration checklist
Before deploying to production, verify:- GPU selection: Using specific GPU types (not
GpuGroup.ANY) for predictable performance - Worker scaling:
workers=(1, n)or higher min for load balancers and latency-sensitive workloads - Timeouts:
execution_timeout_msset appropriately for your workload - Storage: Network volume attached if using large models or datasets
- Environment variables: All configuration and secrets passed via
envparameter - Monitoring: Health check routes implemented (load balancers)
- Testing: Tested locally with
flash runbefore production deployment