Skip to main content
The @Endpoint decorator handles most use cases, allowing you to execute arbitrary Python code remotely without managing Docker images. However, for specialized environments that require custom Docker images, you can use Endpoint(image=...) to deploy your own Docker images.

When to use custom Docker images

Use custom Docker images when you need:
  • Pre-built inference servers: vLLM, TensorRT-LLM, or other specialized serving frameworks.
  • System-level dependencies: Custom CUDA versions, cuDNN, or system libraries not installable via pip.
  • Baked-in models: Large models pre-downloaded in the image to avoid runtime downloads.
  • Existing Serverless workers: You already have a working Runpod Serverless Docker image.
For most use cases, use @Endpoint with the dependencies parameter. It’s simpler, faster, and lets you execute arbitrary Python code remotely.

Available Docker images

Official Runpod workers

Runpod provides pre-built worker images for common frameworks:
FrameworkImage nameDocumentation
vLLMrunpod/worker-vllmvLLM docs
Automatic1111runpod/worker-a1111:stableDocker Hub
ComfyUIrunpod/worker-comfyDocker Hub

Custom images

To create a custom Docker image:
  1. Build a handler function to process requests.
  2. Create a Dockerfile to build the image.
  3. Push the image to a registry.
  4. Reference the image with Endpoint(image=...).

Deploy a custom image

1

Create an Endpoint with your image

from runpod_flash import Endpoint, GpuType

vllm = Endpoint(
    name="my-vllm-server",
    image="runpod/worker-vllm:stable-cuda12.1.0",
    gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
    workers=3,
    env={
        "MODEL_NAME": "microsoft/Phi-3.5-mini-instruct",
        "MAX_MODEL_LEN": "4096"
    }
)
2

Make requests

Use HTTP methods to call your deployed image:
import asyncio

async def main():
    # POST request
    result = await vllm.post("/v1/completions", {
        "prompt": "Explain quantum computing:",
        "max_tokens": 100
    })
    print(result)

    # GET request
    models = await vllm.get("/v1/models")
    print(models)

asyncio.run(main())
Or use queue-based calls:
import asyncio

async def main():
    # Submit job to queue
    job = await vllm.run({
        "input": {
            "prompt": "Explain quantum computing:",
            "max_tokens": 100
        }
    })

    # Wait for completion
    await job.wait()
    print(job.output)

asyncio.run(main())

Complete example: vLLM inference

This example deploys vLLM and makes inference requests:
import asyncio
from runpod_flash import Endpoint, GpuType

# Configure vLLM endpoint
vllm = Endpoint(
    name="vllm-phi",
    image="runpod/worker-vllm:stable-cuda12.1.0",
    gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
    workers=3,
    env={
        "MODEL_NAME": "microsoft/Phi-3.5-mini-instruct",
        "MAX_MODEL_LEN": "4096",
        "GPU_MEMORY_UTILIZATION": "0.9",
        "MAX_CONCURRENCY": "30",
    }
)

async def main():
    # Generate text using queue-based call
    job = await vllm.run({
        "input": {
            "prompt": "Explain quantum computing in simple terms:",
            "max_tokens": 100,
            "temperature": 0.7
        }
    })

    await job.wait()

    # Extract the generated text
    text = job.output[0]['choices'][0]['tokens'][0]
    print(f"Generated text: {text}")

if __name__ == "__main__":
    asyncio.run(main())

Configuration options

All standard Endpoint parameters work with custom images:
from runpod_flash import Endpoint, GpuType, NetworkVolume, PodTemplate

vol = NetworkVolume(name="model-storage")

vllm = Endpoint(
    name="custom-vllm",
    image="your-registry/image:tag",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    workers=(0, 5),
    idle_timeout=600,  # 10 minutes
    env={
        "MODEL_PATH": "/models/llama",
        "MAX_BATCH_SIZE": "32"
    },
    volume=vol,
    execution_timeout_ms=300000,  # 5 minutes
    template=PodTemplate(containerDiskInGb=100)
)

CPU endpoints

For CPU workloads, use the cpu parameter:
from runpod_flash import Endpoint

cpu_worker = Endpoint(
    name="cpu-worker",
    image="your-registry/cpu-worker:latest",
    cpu="cpu5c-4-8"  # 4 vCPU, 8GB RAM
)

Request/response format

Queue-based requests

Use .run() with a dictionary payload in the format {"input": {...}}:
job = await endpoint.run({
    "input": {
        "param1": "value1",
        "param2": "value2"
    }
})

await job.wait()
print(job.output)  # Worker response
print(job.error)   # Error message if failed

HTTP requests

Use .get(), .post(), .put(), .delete() for direct HTTP calls:
# POST request
result = await endpoint.post("/v1/completions", {"prompt": "Hello"})

# GET request
models = await endpoint.get("/v1/models")

# With custom headers
result = await endpoint.post(
    "/v1/completions",
    {"prompt": "Hello"},
    headers={"X-Custom-Header": "value"}
)

EndpointJob reference

The .run() method returns an EndpointJob for async operations:
job = await endpoint.run({"input": {...}})

# Properties
job.id        # Job ID
job.output    # Result payload (after completion)
job.error     # Error message if failed
job.done      # True if completed/failed

# Methods
await job.status()           # Get current status
await job.wait(timeout=60)   # Wait for completion
await job.cancel()           # Cancel the job

Limitations

  • Input format: Queue-based calls require {"input": {...}} format.
  • Code execution: Cannot execute arbitrary Python code remotely. Your Docker image must include all logic.
  • @Endpoint decorator: The decorator pattern doesn’t work with image=. Use the instance pattern instead.
  • Handler required: Your Docker image must implement a Runpod Serverless handler function.

Troubleshooting

Endpoint fails to initialize

Problem: Workers fail to start or crash immediately. Solutions:
  • Verify your Docker image is compatible with Runpod Serverless.
  • Check environment variables are correct.
  • Ensure the image includes a valid handler function.
  • Check worker logs in the Runpod console.

Out of memory errors

Problem: Workers crash with CUDA OOM or RAM errors. Solutions:
  • Use a larger GPU: gpu=GpuType.NVIDIA_A100_80GB_PCIe
  • Reduce GPU_MEMORY_UTILIZATION for vLLM.
  • Lower MAX_MODEL_LEN or batch size.
  • Reduce workers to limit parallel execution.

Authentication errors

Problem: Cannot download gated models or private images. Solutions:
  • Add HF_TOKEN to env for Hugging Face gated models.
  • Configure Docker registry authentication in Runpod console for private images.

Next steps