This guide is for queue-based (i.e. traditional) Serverless endpoints. If you’re using load balancing endpoints, the request structure and endpoints will depend on how you define your HTTP servers.
Serverless endpoints provide synchronous and asynchronous job processing with automatic worker scaling based on demand. This page covers everything from basic input structure and job submission, to monitoring, troubleshooting, and advanced options for queue-based endpoints.

How requests work

After creating a Serverless endpoint, you can start sending it requests to submit jobs and retrieve results. A request can include parameters, payloads, and headers that define what the endpoint should process. For example, you can send a POST request to submit a job, or a GET request to check status of a job, retrieve results, or check endpoint health. A job is a unit of work containing the input data from the request, packaged for processing by your workers. If no worker is immediately available, the job is queued. Once a worker is available, the job is processed by the worker using your handler function. When you submit a job request, it can be either synchronous or asynchronous depending on the operation you use:
  • /runsync submits a synchronous job. A response is returned as soon as the job is complete.
  • /run submits an asynchronous job. The job is processed in the background, and you can retrieve the result by sending a GET request to the /status endpoint.
Queue-based endpoints provide a fixed set of operations for submitting and managing jobs. You can find a full list of operations and examples in the sections below.
If you need to create an endpoint that supports custom API paths, use load balancing endpoints.

Request input structure

When submitting a job with /runsync or /run, your request must include a JSON object the the key input, containing the parameters required by your worker’s handler function. For example:
{
  "input": {
    "prompt": "Your input here"
  }
}
The exact parameters inside the input object depend on your specific worker implementation. Check your worker’s documentation for required and optional parameters.

Send requests from the console

The quickest way to test your endpoint is directly in the Runpod console. Navigate to the Serverless section, select your endpoint, and click the Requests tab.
You’ll see a default test request that you can modify as needed, then click Run to test your endpoint. On first execution, your workers will need to initialize, which may take a moment. The initial response will look something like this:
{
  "id": "6de99fd1-4474-4565-9243-694ffeb65218-u1",
  "status": "IN_QUEUE"
}
You’ll see the full response after the job completes. If there are any errors, the console will display error logs to help you troubleshoot.

Operation overview

Queue-based endpoints support comprehensive job lifecycle management through multiple operations that allow you to submit, monitor, manage, and retrieve results from jobs. Here’s a quick overview of the operations available for queue-based endpoints:
OperationHTTP methodDescription
/runsyncPOSTSubmit a synchronous job and wait for the complete results in a single response.
/runPOSTSubmit an asynchronous job that processes in the background, and returns an immediate job ID.
/statusGETCheck the current status, execution details, and results of a submitted job.
/streamGETReceive incremental results from a job as they become available.
/cancelPOSTStop a job that is in progress or waiting in the queue.
/retryPOSTRequeue a failed or timed-out job using the same job ID and input parameters.
/purge-queuePOSTClear all pending jobs from the queue without affecting jobs already in progress.
/healthGETMonitor the operational status of your endpoint, including worker and job statistics.

Operation reference

Below you’ll find detailed explanations and examples for each operation using cURL and the Runpod SDK.
You can also send requests using standard HTTP request APIs and libraries, such as fetch (for JavaScript) and requests (for Python).
Before running these examples, you’ll need to install the Runpod SDK:
# Python
python -m pip install runpod

# JavaScript
npm install --save runpod-sdk

# Go
go get github.com/runpod/go-sdk && go mod tidy
You should also set your API key and endpoint ID (found on the Overview tab for your endpoint in the Runpod console) as environment variables. Run the following commands in your local terminal, replacing YOUR_API_KEY and YOUR_ENDPOINT_ID with your actual API key and endpoint ID:
export RUNPOD_API_KEY="YOUR_API_KEY"
export ENDPOINT_ID="YOUR_ENDPOINT_ID"

/runsync

Synchronous jobs wait for completion and return the complete result in a single response. This approach works best for shorter tasks where you need immediate results, interactive applications, and simpler client code without status polling.
  • Payload limit: 20 MB
  • Job availability: Results are available for 60 seconds after completion
curl --request POST \
     --url https://api.runpod.ai/v2/$ENDPOINT_ID/runsync \
     -H "accept: application/json" \
     -H "authorization: $RUNPOD_API_KEY" \
     -H "content-type: application/json" \
     -d '{ "input": {  "prompt": "Hello, world!" }}'

/run

Asynchronous jobs process in the background and return immediately with a job ID. This approach works best for longer-running tasks that don’t require immediate results, operations requiring significant processing time, and managing multiple concurrent jobs.
  • Payload limit: 10 MB
  • Job availability: Results are available for 30 minutes after completion
curl --request POST \
     --url https://api.runpod.ai/v2/$ENDPOINT_ID/run \
     -H "accept: application/json" \
     -H "authorization: $RUNPOD_API_KEY" \
     -H "content-type: application/json" \
    -d '{"input": {"prompt": "Hello, world!"}}'

/status

Check the current state, execution statistics, and results of previously submitted jobs. The status endpoint provides the current job state, execution statistics like queue delay and processing time, and job output if completed.
Replace YOUR_JOB_ID with the actual job ID you received in the response to the /run request.
curl --request GET \
     --url https://api.runpod.ai/v2/$ENDPOINT_ID/status/YOUR_JOB_ID \
     -H "authorization: $RUNPOD_API_KEY" \
You can configure time-to-live (TTL) for individual jobs by appending a TTL parameter: https://api.runpod.ai/v2/$ENDPOINT_ID/status/YOUR_JOB_ID?ttl=6000 sets the TTL to 6 seconds.

/stream

Receive incremental results as they become available from jobs that generate output progressively. This works especially well for text generation tasks where you want to display output as it’s created, long-running jobs where you want to show progress, and large outputs that benefit from incremental processing. To enable streaming, your handler must support the "return_aggregate_stream": True option on the start method of your handler. Once enabled, use the stream method to receive data as it becomes available. For implementation details, see Streaming handlers.
Replace YOUR_JOB_ID with the actual job ID you received in the response to the /run request.
curl --request GET \
     --url https://api.runpod.ai/v2/$ENDPOINT_ID/stream/YOUR_JOB_ID \
     -H "accept: application/json" \
     -H "authorization: $RUNPOD_API_KEY" \
The maximum size for a single streamed payload chunk is 1 MB. Larger outputs will be split across multiple chunks.

/cancel

Stop jobs that are no longer needed or taking too long to complete. This operation stops in-progress jobs, removes queued jobs before they start, and returns immediately with the canceled status.
Replace YOUR_JOB_ID with the actual job ID you received in the response to the /run request.
curl --request POST \
  --url https://api.runpod.ai/v2/$ENDPOINT_ID/cancel/YOUR_JOB_ID \
  -H "authorization: $RUNPOD_API_KEY" \

/retry

Requeue jobs that have failed or timed out without submitting a new request. This operation maintains the same job ID for tracking, requeues with original input parameters, and removes previous output. It can only be used for jobs with FAILED or TIMED_OUT status. Replace YOUR_JOB_ID with the actual job ID you received in the response to the /run request.
curl --request POST \
     --url https://api.runpod.ai/v2/$ENDPOINT_ID/retry/YOUR_JOB_ID \
     -H "authorization: $RUNPOD_API_KEY"
You’ll see the job status updated to IN_QUEUE when the job is retried:
{
  "id": "60902e6c-08a1-426e-9cb9-9eaec90f5e2b-u1",
  "status": "IN_QUEUE"
}
Job results expire after a set period. Asynchronous jobs (/run) results are available for 30 minutes, while synchronous jobs (/runsync) results are available for 1 minute. Once expired, jobs cannot be retried.

/purge-queue

Remove all pending jobs from the queue when you need to reset or handle multiple cancellations at once. This is useful for error recovery, clearing outdated requests, resetting after configuration changes, and managing resource allocation.
curl --request POST \
     --url https://api.runpod.ai/v2/$ENDPOINT_ID/purge-queue \
     -H "authorization: $RUNPOD_API_KEY"
    -H 'Authorization: Bearer RUNPOD_API_KEY'
/purge-queue operation only affects jobs waiting in the queue. Jobs already in progress will continue to run.

/health

Get a quick overview of your endpoint’s operational status including worker availability, job queue status, potential bottlenecks, and scaling requirements.
curl --request GET \
     --url https://api.runpod.ai/v2/$ENDPOINT_ID/health \
     -H "authorization: $RUNPOD_API_KEY"

vLLM and OpenAI requests

vLLM workers are specialized containers designed to efficiently deploy and serve large language models (LLMs) on Runpod Serverless. vLLM requests use the standard format for endpoint operations, while providing additional flexibility and control over your requests compared to standard endpoints. vLLM workers also support OpenAI compatible requests, enabling you to use familiar OpenAI client libraries with your vLLM endpoints.

Advanced options

Beyond the required input object, you can include optional top-level parameters to enable additional functionality for your queue-based endpoints.

Webhook notifications

Receive notifications when jobs complete by specifying a webhook URL. When your job completes, Runpod will send a POST request to your webhook URL containing the same information as the /status/JOB_ID endpoint.
{
  "input": {
    "prompt": "Your input here"
  },
  "webhook": "https://your-webhook-url.com"
}
Your webhook should return a 200 status code to acknowledge receipt. If the call fails, Runpod will retry up to 2 more times with a 10-second delay between attempts.

Execution policies

Control job execution behavior with custom policies. By default, jobs automatically terminate after 10 minutes without completion to prevent runaway costs.
{
  "input": {
    "prompt": "Your input here"
  },
  "policy": {
    "executionTimeout": 900000,
    "lowPriority": false,
    "ttl": 3600000
  }
}
Policy options:
OptionDescriptionDefaultConstraints
executionTimeoutMaximum job runtime in milliseconds600000 (10 minutes)Must be > 5000 ms
lowPriorityWhen true, job won’t trigger worker scalingfalse-
ttlMaximum job lifetime in milliseconds86400000 (24 hours)Must be ≥ 10000 ms, max 1 week
Setting executionTimeout in a request overrides the default endpoint setting for that specific job only.

S3-compatible storage integration

Configure S3-compatible storage for endpoints working with large files. This configuration is passed directly to your worker but not included in responses.
{
  "input": {
    "prompt": "Your input here"
  },
  "s3Config": {
    "accessId": "BUCKET_ACCESS_KEY_ID",
    "accessSecret": "BUCKET_SECRET_ACCESS_KEY",
    "bucketName": "BUCKET_NAME",
    "endpointUrl": "BUCKET_ENDPOINT_URL"
  }
}
Your worker must contain logic to use this information for storage operations.
S3 integration works with any S3-compatible provider including MinIO, Backblaze B2, DigitalOcean Spaces, and others.

Rate limits and quotas

Runpod enforces rate limits to ensure fair platform usage. These limits apply per endpoint and operation:
OperationMethodRate LimitConcurrent Limit
/runsyncPOST2000 requests per 10 seconds400 concurrent
/runPOST1000 requests per 10 seconds200 concurrent
/statusGET2000 requests per 10 seconds400 concurrent
/streamGET2000 requests per 10 seconds400 concurrent
/cancelPOST100 requests per 10 seconds20 concurrent
/purge-queuePOST2 requests per 10 secondsN/A
/openai/*POST2000 requests per 10 seconds400 concurrent
/requestsGET10 requests per 10 seconds2 concurrent
Requests receive a 429 (Too Many Requests) status if queue size exceeds 50 jobs AND queue size exceeds MAX_WORKERS * 500. Implement appropriate retry logic with exponential backoff to handle rate limiting gracefully.

Best practices

Follow these practices to optimize your queue-based endpoint usage:
  • Use asynchronous requests for jobs that take more than a few seconds to complete.
  • Implement polling with backoff when checking status of asynchronous jobs.
  • Set appropriate timeouts in your client applications and monitor endpoint health regularly to detect issues early.
  • Implement comprehensive error handling for all API calls.
  • Use webhooks for notification-based workflows instead of polling to reduce API calls.
  • Cancel unneeded jobs to free up resources and reduce costs.
  • During development, use the console testing interface before implementing programmatic integration.

Error handling and troubleshooting

When sending requests, be prepared to handle these common errors:
HTTP StatusMeaningSolution
400Bad RequestCheck your request format and parameters
401UnauthorizedVerify your API key is correct and has permission
404Not FoundCheck your endpoint ID
429Too Many RequestsImplement backoff and retry logic
500Internal Server ErrorCheck endpoint logs; worker may have crashed
Here are some common issues and suggested solutions:
IssuePossible CausesSolutions
Job stuck in queueNo available workers, max workers limit reachedIncrease max workers, check endpoint health
Timeout errorsJob takes longer than execution timeoutIncrease timeout in job policy, optimize job processing
Failed jobsWorker errors, input validation issuesCheck endpoint logs, verify input format, retry with fixed input
Rate limitingToo many requests in short timeImplement backoff strategy, batch requests when possible
Missing resultsResults expiredRetrieve results within expiration window (30 min for async, 1 min for sync)
Implementing proper error handling and retry logic will make your integrations more robust and reliable.