Learn how to create and configure hardware and scaling behavior with the Flash Endpoint class.
In Flash, endpoints are the bridge between your local Python functions and Runpod’s cloud infrastructure. When you decorate a function with @Endpoint, you’re marking it to run remotely on Runpod instead of your local machine:
Copy
from runpod_flash import Endpoint, GpuType@Endpoint( name="my-inference", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, dependencies=["torch"])def run_model(data): import torch # This code runs on a Runpod GPU, not locally return {"result": "processed"}
When you call run_model(data), Flash provisions a GPU on Runpod (or reuses an existing one), sends your function code and input to the worker, executes it, and returns the result to your local environment.Each unique endpoint name creates one Serverless endpoint on Runpod with its own URL, scaling configuration, and hardware allocation. The endpoint manages workers that scale up and down based on demand.
from runpod_flash import Endpoint, GpuType, GpuGroup# Use a specific GPU type@Endpoint(name="ml-inference", gpu=GpuType.NVIDIA_A100_80GB_PCIe)async def infer(data: dict) -> dict: ...# Use another specific GPU type@Endpoint(name="rtx-worker", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090)async def render(data: dict) -> dict: ...# Use multiple GPU types for better availability@Endpoint(name="flexible", gpu=[GpuType.NVIDIA_GEFORCE_RTX_4090, GpuType.NVIDIA_RTX_A5000])async def process(data: dict) -> dict: ...
If neither gpu= nor cpu= is specified, GPU defaults to GpuGroup.ANY.
Control how many workers run for your endpoint with the workers parameter:
Copy
# Just a max: scales from 0 to 5@Endpoint(name="elastic", gpu=GpuGroup.ANY, workers=5)# Min and max tuple: always keep 2 warm, scale up to 10@Endpoint(name="always-on", gpu=GpuGroup.ANY, workers=(2, 10))# Default is (0, 1) if not specified@Endpoint(name="default", gpu=GpuGroup.ANY)
Setting workers=(1, N) keeps at least one worker warm, avoiding cold starts.
You must import packages inside the decorated function body, not at the top of your file. This ensures imports happen on the remote worker.Correct: imports inside the function.
Copy
@Endpoint(name="compute", gpu=GpuGroup.ANY, dependencies=["numpy"])def compute(data): import numpy as np # Import here return np.sum(data)
Incorrect: imports at top of file won’t work.
Copy
import numpy as np # This import happens locally, not on the worker@Endpoint(name="compute", gpu=GpuGroup.ANY, dependencies=["numpy"])def compute(data): return np.sum(data) # numpy not available on the remote worker
Pass environment variables using the env parameter:
Copy
@Endpoint( name="api-worker", gpu=GpuGroup.ANY, env={ "HF_TOKEN": "your_huggingface_token", "MODEL_ID": "gpt2" })async def load_model(): import os from transformers import AutoModel hf_token = os.getenv("HF_TOKEN") model_id = os.getenv("MODEL_ID") model = AutoModel.from_pretrained(model_id, token=hf_token) return {"model_loaded": model_id}
Environment variables are excluded from configuration hashing. Changing environment values won’t trigger endpoint recreation, making it easy to rotate API keys.