To convert a serverless template into an endpoint, navigate to Serverless > Endpoints and press
New Endpoint to open up the endpoint creation dialog. After you are done configuring your endpoint, press
Any name you would like to use for this endpoint configuration, the resulting endpoint will be assigned a random ID to be used when making calls. This name is only visible to you.
Select a serverless template that you would like to use for this particular endpoint.
Select one or more GPUs you want your endpoint to run on.
When multiple GPU sizes are selected, they will be prioritized based on the chosen order. When an endpoint is created, RunPod will allocate as many works to the first available GPU size selected based on priority. Roughly every 60 minutes, the number of allocated workers is reviewed, and a rebalance will occur to either move more to your first priority or spill over to different sizes until your active and max workers are met.
Setting this amount to >1 will result in "always on" workers. This will allow you to have a worker ready to respond to job requests without incurring any cold start delay.
You will incour the cost of any active workers you have set regardless if they are working on a job.
This will establish a ceiling or upper limit to the number of active workers your endpoint will have running at any given point.
The number of GPUs you would like assigned to your worker.
Note: Currently only available for 48GB GPUs
The amount of time in seconds a worker not currently processing a job will remain active until it is put back into standby. During the idle period, your worker is considered running and will incur a charge.
RunPod magic to further reduce the average cold-start time of your endpoint. FlashBoot works best when an endpoint receives consistent utilization. There is no additional cost associated with FlashBoot.
Additional controls to help you control where your endpoint is deployed and how it responds to incoming requests.
Control which datacenters you would like your workers deployed and cached. By default all datacenters are selected.
Attatch a network storage volume to your deployed workers.
Network volumes will be mounted to
- Queue Delay scaling strategy adjusts worker numbers based on request wait times. With zero workers initially, the first request adds one worker. Subsequent requests add workers only after waiting in the queue for the defined number of delay secons.
- Request Count scaling strategy adjusts worker numbers according to total requests in the queue and in progress. It automatically adds workers as the number of requests increases, ensuring tasks are handled efficiently. Total Workers Formula: Math.ceil((requestsInQueue + requestsInProgress) / )
Within the select GPU size category you can further select which GPU models you would like your endpoint workers to run on.
Updated about 1 month ago