Skip to main content

Overview

Use the runpod/worker-v1-vllm:stable-cuda12.1.0 image to deploy a vLLM Worker. The vLLM Worker can use most Hugging Face LLMs and is compatible with OpenAI's API, by specifying the MODEL_NAME parameter. You can also use RunPod's input request format.

RunPod's vLLM Serverless Endpoint Worker are a highly optimized solution for leveraging the power of various LLMs.

For more information, see the vLLM Worker repository.

Key features

  • Ease of Use: Deploy any LLM using the pre-built Docker image without the hassle of building custom Docker images yourself, uploading heavy models, or waiting for lengthy downloads.
  • OpenAI Compatibility: Seamlessly integrate with OpenAI's API by changing 2 lines of code, supporting Chat Completions, Completions, and Models, with both streaming and non-streaming.
  • Dynamic Batch Size: Experience the rapid time-to-first-token high of no batching combined with the high throughput of larger batch sizes. (Related to batching tokens when streaming output)
  • Extensive Model Support: Deploy almost any LLM from Hugging Face, including your own.
  • Customization: Have full control over the configuration of every aspect of your deployment, from the model settings, to tokenizer options, to system configurations, and much more, all done through environment variables.
  • Speed: Experience the speed of the vLLM Engine.
  • Serverless Scalability and Cost-Effectiveness: Scale your deployment to handle any number of requests and only pay for active usage.

Compatible models

You can deploy most models from Hugging Face. For a full list of supported models architectures, see Compatible model architectures.

Getting started

At a high level, you can set up the vLLM Worker by:

  • Selecting your deployment options
  • Configure any necessary environment variables
  • Deploy your model

For detailed guidance on setting up, configuring, and deploying your vLLM Serverless Endpoint Worker, including compatibility details, environment variable settings, and usage examples, see Get started.

Deployment options

  • Configurable Endpoints: (recommended) Use RunPod's Web UI to quickly deploy the OpenAI compatable LLM with the vLLM Worker.

  • Pre-Built docker image: Leverage pre-configured Docker image for hassle-free deployment. Ideal for users seeking a quick and straightforward setup process

  • Custom docker image: For advanced users, customize and build your Docker image with the model baked in, offering greater control over the deployment process.

For more information see:

For more information on creating a custom docker image, see Build Docker Image with Model Inside.

Next steps

  • Get started: Learn how to deploy a vLLM Worker as a Serverless Endpoint, with detailed guides on configuration and sending requests.
  • Configurable Endpoints: Select your Hugging Face model and vLLM takes care of the low-level details of model loading, hardware configuration, and execution.
  • Environment variables: Explore the environment variables available for the vLLM Worker, including detailed documentation and examples.
  • Run Gemma 7b: Walk through deploying Google's Gemma model using RunPod's vLLM Worker, guiding you to set up a Serverless Endpoint with a gated large language model (LLM).