This guide is for queue-based (i.e. traditional) Serverless endpoints. If you’re using load balancing endpoints, the request structure and endpoints will depend on how you define your HTTP servers.
How requests work
After creating a Serverless endpoint, you can start sending it requests to submit jobs and retrieve results. A request can include parameters, payloads, and headers that define what the endpoint should process. For example, you can send aPOST
request to submit a job, or a GET
request to check status of a job, retrieve results, or check endpoint health.
A job is a unit of work containing the input data from the request, packaged for processing by your workers. If no worker is immediately available, the job is queued. Once a worker is available, the job is processed by the worker using your handler function.
When you submit a job request, it can be either synchronous or asynchronous depending on the operation you use:
/runsync
submits a synchronous job. A response is returned as soon as the job is complete./run
submits an asynchronous job. The job is processed in the background, and you can retrieve the result by sending aGET
request to the/status
endpoint.
If you need to create an endpoint that supports custom API paths, use load balancing endpoints.
Request input structure
When submitting a job with/runsync
or /run
, your request must include a JSON object the the key input
, containing the parameters required by your worker’s handler function.
For example:
input
object depend on your specific worker implementation. Check your worker’s documentation for required and optional parameters.
Send requests from the console
The quickest way to test your endpoint is directly in the Runpod console. Navigate to the Serverless section, select your endpoint, and click the Requests tab.
Operation overview
Queue-based endpoints support comprehensive job lifecycle management through multiple operations that allow you to submit, monitor, manage, and retrieve results from jobs. Here’s a quick overview of the operations available for queue-based endpoints:Operation | HTTP method | Description |
---|---|---|
/runsync | POST | Submit a synchronous job and wait for the complete results in a single response. |
/run | POST | Submit an asynchronous job that processes in the background, and returns an immediate job ID. |
/status | GET | Check the current status, execution details, and results of a submitted job. |
/stream | GET | Receive incremental results from a job as they become available. |
/cancel | POST | Stop a job that is in progress or waiting in the queue. |
/retry | POST | Requeue a failed or timed-out job using the same job ID and input parameters. |
/purge-queue | POST | Clear all pending jobs from the queue without affecting jobs already in progress. |
/health | GET | Monitor the operational status of your endpoint, including worker and job statistics. |
Operation reference
Below you’ll find detailed explanations and examples for each operation usingcURL
and the Runpod SDK.
You can also send requests using standard HTTP request APIs and libraries, such as
fetch
(for JavaScript) and requests
(for Python).YOUR_API_KEY
and YOUR_ENDPOINT_ID
with your actual API key and endpoint ID:
/runsync
Synchronous jobs wait for completion and return the complete result in a single response. This approach works best for shorter tasks where you need immediate results, interactive applications, and simpler client code without status polling.
- Payload limit: 20 MB
- Job availability: Results are available for 60 seconds after completion
/run
Asynchronous jobs process in the background and return immediately with a job ID. This approach works best for longer-running tasks that don’t require immediate results, operations requiring significant processing time, and managing multiple concurrent jobs.
- Payload limit: 10 MB
- Job availability: Results are available for 30 minutes after completion
/status
Check the current state, execution statistics, and results of previously submitted jobs. The status endpoint provides the current job state, execution statistics like queue delay and processing time, and job output if completed.
Replace
YOUR_JOB_ID
with the actual job ID you received in the response to the /run
request.You can configure time-to-live (TTL) for individual jobs by appending a TTL parameter:
https://api.runpod.ai/v2/$ENDPOINT_ID/status/YOUR_JOB_ID?ttl=6000
sets the TTL to 6 seconds./stream
Receive incremental results as they become available from jobs that generate output progressively. This works especially well for text generation tasks where you want to display output as it’s created, long-running jobs where you want to show progress, and large outputs that benefit from incremental processing.
To enable streaming, your handler must support the "return_aggregate_stream": True
option on the start
method of your handler. Once enabled, use the stream
method to receive data as it becomes available.
For implementation details, see Streaming handlers.
Replace
YOUR_JOB_ID
with the actual job ID you received in the response to the /run
request.The maximum size for a single streamed payload chunk is 1 MB. Larger outputs will be split across multiple chunks.
/cancel
Stop jobs that are no longer needed or taking too long to complete. This operation stops in-progress jobs, removes queued jobs before they start, and returns immediately with the canceled status.
Replace
YOUR_JOB_ID
with the actual job ID you received in the response to the /run
request./retry
Requeue jobs that have failed or timed out without submitting a new request. This operation maintains the same job ID for tracking, requeues with original input parameters, and removes previous output. It can only be used for jobs with FAILED
or TIMED_OUT
status.
Replace YOUR_JOB_ID
with the actual job ID you received in the response to the /run
request.
IN_QUEUE
when the job is retried:
Job results expire after a set period. Asynchronous jobs (
/run
) results are available for 30 minutes, while synchronous jobs (/runsync
) results are available for 1 minute. Once expired, jobs cannot be retried./purge-queue
Remove all pending jobs from the queue when you need to reset or handle multiple cancellations at once. This is useful for error recovery, clearing outdated requests, resetting after configuration changes, and managing resource allocation.
/purge-queue
operation only affects jobs waiting in the queue. Jobs already in progress will continue to run./health
Get a quick overview of your endpoint’s operational status including worker availability, job queue status, potential bottlenecks, and scaling requirements.
vLLM and OpenAI requests
vLLM workers are specialized containers designed to efficiently deploy and serve large language models (LLMs) on Runpod Serverless. vLLM requests use the standard format for endpoint operations, while providing additional flexibility and control over your requests compared to standard endpoints. vLLM workers also support OpenAI compatible requests, enabling you to use familiar OpenAI client libraries with your vLLM endpoints.Advanced options
Beyond the requiredinput
object, you can include optional top-level parameters to enable additional functionality for your queue-based endpoints.
Webhook notifications
Receive notifications when jobs complete by specifying a webhook URL. When your job completes, Runpod will send aPOST
request to your webhook URL containing the same information as the /status/JOB_ID
endpoint.
200
status code to acknowledge receipt. If the call fails, Runpod will retry up to 2 more times with a 10-second delay between attempts.
Execution policies
Control job execution behavior with custom policies. By default, jobs automatically terminate after 10 minutes without completion to prevent runaway costs.Option | Description | Default | Constraints |
---|---|---|---|
executionTimeout | Maximum job runtime in milliseconds | 600000 (10 minutes) | Must be > 5000 ms |
lowPriority | When true, job won’t trigger worker scaling | false | - |
ttl | Maximum job lifetime in milliseconds | 86400000 (24 hours) | Must be ≥ 10000 ms, max 1 week |
Setting
executionTimeout
in a request overrides the default endpoint setting for that specific job only.S3-compatible storage integration
Configure S3-compatible storage for endpoints working with large files. This configuration is passed directly to your worker but not included in responses.S3 integration works with any S3-compatible provider including MinIO, Backblaze B2, DigitalOcean Spaces, and others.
Rate limits and quotas
Runpod enforces rate limits to ensure fair platform usage. These limits apply per endpoint and operation:Operation | Method | Rate Limit | Concurrent Limit |
---|---|---|---|
/runsync | POST | 2000 requests per 10 seconds | 400 concurrent |
/run | POST | 1000 requests per 10 seconds | 200 concurrent |
/status | GET | 2000 requests per 10 seconds | 400 concurrent |
/stream | GET | 2000 requests per 10 seconds | 400 concurrent |
/cancel | POST | 100 requests per 10 seconds | 20 concurrent |
/purge-queue | POST | 2 requests per 10 seconds | N/A |
/openai/* | POST | 2000 requests per 10 seconds | 400 concurrent |
/requests | GET | 10 requests per 10 seconds | 2 concurrent |
429 (Too Many Requests)
status if queue size exceeds 50 jobs AND queue size exceeds MAX_WORKERS * 500
. Implement appropriate retry logic with exponential backoff to handle rate limiting gracefully.
Best practices
Follow these practices to optimize your queue-based endpoint usage:- Use asynchronous requests for jobs that take more than a few seconds to complete.
- Implement polling with backoff when checking status of asynchronous jobs.
- Set appropriate timeouts in your client applications and monitor endpoint health regularly to detect issues early.
- Implement comprehensive error handling for all API calls.
- Use webhooks for notification-based workflows instead of polling to reduce API calls.
- Cancel unneeded jobs to free up resources and reduce costs.
- During development, use the console testing interface before implementing programmatic integration.
Error handling and troubleshooting
When sending requests, be prepared to handle these common errors:HTTP Status | Meaning | Solution |
---|---|---|
400 | Bad Request | Check your request format and parameters |
401 | Unauthorized | Verify your API key is correct and has permission |
404 | Not Found | Check your endpoint ID |
429 | Too Many Requests | Implement backoff and retry logic |
500 | Internal Server Error | Check endpoint logs; worker may have crashed |
Issue | Possible Causes | Solutions |
---|---|---|
Job stuck in queue | No available workers, max workers limit reached | Increase max workers, check endpoint health |
Timeout errors | Job takes longer than execution timeout | Increase timeout in job policy, optimize job processing |
Failed jobs | Worker errors, input validation issues | Check endpoint logs, verify input format, retry with fixed input |
Rate limiting | Too many requests in short time | Implement backoff strategy, batch requests when possible |
Missing results | Results expired | Retrieve results within expiration window (30 min for async, 1 min for sync) |