serverless - Runpod Documentation

Manage Serverless endpoints, including creating, listing, updating, and deleting endpoints.

runpodctl serverless <subcommand> [flags]

Alias

You can use sls as a shorthand for serverless:

runpodctl sls list

Subcommands

List endpoints

List all your Serverless endpoints:

runpodctl serverless list

List flags

--include-template

bool

Include template information in the output.

--include-workers

bool

Include workers information in the output.

Get endpoint details

Get detailed information about a specific endpoint:

runpodctl serverless get <endpoint-id>

Get flags

--include-template

bool

Include template information in the output.

--include-workers

bool

Include workers information in the output.

Create an endpoint

Create a new Serverless endpoint from a template or from a Hub repo:

# Create from a template
runpodctl serverless create --name "my-endpoint" --template-id "tpl_abc123"

# Create from a Hub repo
runpodctl hub search vllm                                         # Find the hub ID
runpodctl serverless create --hub-id cm8h09d9n000008jvh2rqdsmb --name "my-vllm"

# Create from a Hub repo with custom environment variables
runpodctl serverless create --hub-id cm8h09d9n000008jvh2rqdsmb --name "my-vllm" \
  --env MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
  --env MAX_TOKENS=4096

When using --hub-id, GPU IDs and container disk size are automatically pulled from the Hub release config. You can override the GPU type with --gpu-id. Environment variables from the Hub release are included automatically, and you can override or add to them with --env.

Serverless templates vs Pod templates: Serverless endpoints require a Serverless-specific template. Pod templates (like runpod-torch-v21) cannot be used because they include configuration, which Serverless does not support. When creating a template with runpodctl template create, use the --serverless flag to create a Serverless template.Each Serverless template can only be bound to one endpoint at a time. To create multiple endpoints with the same configuration, create separate templates for each.

Create flags

--name

string

Name for the endpoint.

--template-id

string

Template ID to use (required if --hub-id is not specified). Use runpodctl template search to find templates.

--hub-id

string

Hub listing ID to deploy from (alternative to --template-id). Use runpodctl hub search to find repos.

--gpu-id

string

GPU type for workers. Use runpodctl gpu list to see available GPUs.

--gpu-count

int

default:"1"

Number of GPUs per worker.

--compute-type

string

default:"GPU"

Compute type (GPU or CPU).

--workers-min

int

default:"0"

Minimum number of workers.

--workers-max

int

default:"3"

Maximum number of workers.

--data-center-ids

string

Comma-separated list of preferred datacenter IDs. Use runpodctl datacenter list to see available datacenters.

--network-volume-id

string

Network volume ID to attach. Use runpodctl network-volume list to see available network volumes.

--network-volume-ids

string

Comma-separated list of network volume IDs to attach. Use this when attaching multiple network volumes to an endpoint.

--min-cuda-version

string

Minimum CUDA version required for workers (e.g., 12.4). Workers will only be scheduled on machines that meet this CUDA version requirement.

--scaler-type

string

default:"QUEUE_DELAY"

Autoscaler type (QUEUE_DELAY or REQUEST_COUNT). QUEUE_DELAY scales based on queue wait time; REQUEST_COUNT scales based on concurrent requests.

--scaler-value

int

Scaler threshold value. For QUEUE_DELAY, this is the target delay in seconds. For REQUEST_COUNT, this is the number of concurrent requests per worker before scaling.

--idle-timeout

int

Idle timeout in seconds. Workers shut down after being idle for this duration. Valid range: 5-3600 seconds.

--flash-boot

bool

Enable or disable flash boot for faster worker startup. When enabled, workers start from cached container images.

--execution-timeout

int

Execution timeout in seconds. Jobs that exceed this duration are terminated. The CLI accepts seconds but converts to milliseconds internally.

--env

string

Environment variable in KEY=VALUE format. Use multiple --env flags to set multiple variables. When deploying from --hub-id, these values override the Hub release defaults.

Update an endpoint

Update endpoint configuration:

runpodctl serverless update <endpoint-id> --workers-max 5

Update flags

--name

string

New name for the endpoint.

--template-id

string

New template ID to swap to. Use this to change the template attached to an existing endpoint without recreating it.

--workers-min

int

New minimum number of workers.

--workers-max

int

New maximum number of workers.

--idle-timeout

int

New idle timeout in seconds.

--scaler-type

string

Scaler type (QUEUE_DELAY or REQUEST_COUNT).

--scaler-value

int

Scaler value.

--flash-boot

bool

Enable or disable flash boot for faster worker startup.

--execution-timeout

int

Execution timeout in seconds. Jobs that exceed this duration are terminated.

Delete an endpoint

Delete an endpoint:

runpodctl serverless delete <endpoint-id>

Serverless URLs

Access your Serverless endpoint using these URL patterns:

Operation	URL
Async request	`https://api.runpod.ai/v2/<endpoint-id>/run`
Sync request	`https://api.runpod.ai/v2/<endpoint-id>/runsync`
Health check	`https://api.runpod.ai/v2/<endpoint-id>/health`
Job status	`https://api.runpod.ai/v2/<endpoint-id>/status/<job-id>`

​Alias

​Subcommands

​List endpoints

​List flags

​Get endpoint details

​Get flags

​Create an endpoint

​Create flags

​Update an endpoint

​Update flags

​Delete an endpoint

​Serverless URLs

​Related commands

Alias

Subcommands

List endpoints

List flags

Get endpoint details

Get flags

Create an endpoint

Create flags

Update an endpoint

Update flags

Delete an endpoint

Serverless URLs

Related commands