Deployment workflow
After creating your handler function, package it into a Docker image and deploy it to an endpoint:Create a Dockerfile
Package your handler function and all its dependencies into a Docker image.
Deploy to an endpoint
Push your image and create an endpoint using one of two methods:
- Deploy from Docker Hub: Build locally and push to a container registry.
- Deploy from GitHub: Auto-build and deploy directly from your repository.
Model deployment
To deploy workers with AI/ML models, follow this order of preference:- Use cached models: For models on Hugging Face (public or gated), this is the recommended approach. Cached models provide the fastest cold starts and persist across worker restarts.
- Bake the model into your Docker image: For private models not on Hugging Face, embed them directly in your container image. This ensures the model is always available but increases image size.
- Use network volumes: For development workflows or very large models (500GB+), store models on a network volume. This is slower than cached or baked models but offers flexibility for iteration.
Worker types
Workers can run in two modes depending on your latency and cost requirements:- Active workers run continuously (24/7) and are always ready to process requests instantly. They eliminate cold starts entirely and receive a discounted rate, making them ideal for latency-sensitive or high-traffic applications.
- Flex workers scale dynamically based on demand, spinning down to zero when idle. They incur cold starts when scaling up but cost nothing when not in use, making them ideal for variable or sporadic workloads.
Worker states
| State | Description | Billing |
|---|---|---|
| Initializing | Downloading image, loading code | Yes |
| Idle | Ready, waiting for requests | No |
| Running | Processing requests | Yes |
| Throttled | Ready but host constrained | No |
| Outdated | Marked for replacement after update | Yes (while processing) |
| Unhealthy | Crashed; auto-retries for up to 7 days | No |
Max worker limits
Account balance determines your maximum workers (flex + active combined):| Balance | Max workers |
|---|---|
| Default | 5 |
| $100+ | 10 |
| $200+ | 20 |
| $300+ | 30 |
| $500+ | 40 |
| $700+ | 50 |
| $900+ | 60 |
Best practices
| Practice | Benefit |
|---|---|
| Optimize image size | Faster downloads, reduced cold starts |
| Use model caching | Fastest cold starts |
| Test locally first | Catch issues before deployment |
| Use logs and SSH | Debug and optimize effectively |