Skip to main content

Overview

Use the runpod/worker-v1-vllm:stable-cuda12.1.0 image to deploy a vLLM Worker. The vLLM Worker can use most Hugging Face LLMs and is compatible with OpenAI's API, by specifying the MODEL_NAME parameter. You can also use RunPod's input request format.

RunPod's vLLM Serverless Endpoint Worker are a highly optimized solution for leveraging the power of various LLMs.

For more information, see the vLLM Worker repository.

Key features

  • Ease of Use: Deploy any LLM using the pre-built Docker image without the hassle of building custom Docker images yourself, uploading heavy models, or waiting for lengthy downloads.
  • OpenAI Compatibility: Seamlessly integrate with OpenAI's API by changing 2 lines of code, supporting Chat Completions, Completions, and Models, with both streaming and non-streaming.
  • Dynamic Batch Size: Experience the rapid time-to-first-token high of no batching combined with the high throughput of larger batch sizes. (Related to batching tokens when streaming output)
  • Extensive Model Support: Deploy almost any LLM from Hugging Face, including your own.
  • Customization: Have full control over the configuration of every aspect of your deployment, from the model settings, to tokenizer options, to system configurations, and much more, all done through environment variables.
  • Speed: Experience the speed of the vLLM Engine.
  • Serverless Scalability and Cost-Effectiveness: Scale your deployment to handle any number of requests and only pay for active usage.

Compatible models

You can deploy most models from Hugging Face. For a full list of supported models architectures, see Compatible model architectures.

Getting started

At a high level, you can set up the vLLM Worker by:

  • Selecting your deployment options
  • Configure any necessary environment variables
  • Deploy your model

For detailed instructions, configuration options, and usage examples, see Get started.

Deployment options

  • Configurable Endpoints: (recommended) Use RunPod's Web UI to quickly deploy the OpenAI compatible LLM with the vLLM Worker.

  • Pre-Built docker image: Leverage pre-configured Docker image for hassle-free deployment. Ideal for users seeking a quick and straightforward setup process

  • Custom docker image: For advanced users, customize and build your Docker image with the model baked in, offering greater control over the deployment process.

For more information see:

For more information on creating a custom docker image, see Build Docker Image with Model Inside.

Next steps

  • Get started: Learn how to deploy a vLLM Worker as a Serverless Endpoint, with detailed guides on configuration and sending requests.
  • Configurable Endpoints: Select your Hugging Face model and vLLM takes care of the low-level details of model loading, hardware configuration, and execution.
  • Environment variables: Explore the environment variables available for the vLLM Worker, including detailed documentation and examples.
  • Run Gemma 7b: Walk through deploying Google's Gemma model using RunPod's vLLM Worker, guiding you to set up a Serverless Endpoint with a gated large language model (LLM).