Overview
Use the runpod/worker-v1-vllm:stable-cuda12.1.0
image to deploy a vLLM Worker.
The vLLM Worker can use most Hugging Face LLMs and is compatible with OpenAI's API, by specifying the MODEL_NAME
parameter.
You can also use RunPod's input
request format.
RunPod's vLLM Serverless Endpoint Worker are a highly optimized solution for leveraging the power of various LLMs.
For more information, see the vLLM Worker repository.
Key features
- Ease of Use: Deploy any LLM using the pre-built Docker image without the hassle of building custom Docker images yourself, uploading heavy models, or waiting for lengthy downloads.
- OpenAI Compatibility: Seamlessly integrate with OpenAI's API by changing 2 lines of code, supporting Chat Completions, Completions, and Models, with both streaming and non-streaming.
- Dynamic Batch Size: Experience the rapid time-to-first-token high of no batching combined with the high throughput of larger batch sizes. (Related to batching tokens when streaming output)
- Extensive Model Support: Deploy almost any LLM from Hugging Face, including your own.
- Customization: Have full control over the configuration of every aspect of your deployment, from the model settings, to tokenizer options, to system configurations, and much more, all done through environment variables.
- Speed: Experience the speed of the vLLM Engine.
- Serverless Scalability and Cost-Effectiveness: Scale your deployment to handle any number of requests and only pay for active usage.
Compatible models
You can deploy most models from Hugging Face. For a full list of supported models architectures, see Compatible model architectures.
Getting started
At a high level, you can set up the vLLM Worker by:
- Selecting your deployment options
- Configure any necessary environment variables
- Deploy your model
For detailed instructions, configuration options, and usage examples, see Get started.
Deployment options
-
Configurable Endpoints: (recommended) Use RunPod's Web UI to quickly deploy the OpenAI compatible LLM with the vLLM Worker.
-
Pre-Built docker image: Leverage pre-configured Docker image for hassle-free deployment. Ideal for users seeking a quick and straightforward setup process
-
Custom docker image: For advanced users, customize and build your Docker image with the model baked in, offering greater control over the deployment process.
For more information see:
For more information on creating a custom docker image, see Build Docker Image with Model Inside.
Next steps
- Get started: Learn how to deploy a vLLM Worker as a Serverless Endpoint, with detailed guides on configuration and sending requests.
- Configurable Endpoints: Select your Hugging Face model and vLLM takes care of the low-level details of model loading, hardware configuration, and execution.
- Environment variables: Explore the environment variables available for the vLLM Worker, including detailed documentation and examples.
- Run Gemma 7b: Walk through deploying Google's Gemma model using RunPod's vLLM Worker, guiding you to set up a Serverless Endpoint with a gated large language model (LLM).