Learn how to deploy a large language model (LLM) using Runpod’s preconfigured vLLM workers. By the end of this guide, you’ll have a fully functional API endpoint that you can use to handle LLM inference requests.
In this tutorial, you’ll learn how to:
First, decide which LLM you want to deploy. The vLLM worker supports most Hugging Face models, including:
meta-llama/Llama-3.2-3B-Instruct
)mistralai/Ministral-8B-Instruct-2410
)Qwen/Qwen3-8B
)openchat/openchat-3.5-0106
)google/gemma-3-1b-it
)deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
)microsoft/Phi-4-mini-instruct
)For this walkthrough, we’ll use openchat/openchat-3.5-0106
, but you can substitute this with any compatible model.
The easiest way to deploy a vLLM worker is through the Runpod console:
Navigate to the Serverless page.
Under Quick Deploy, find Serverless vLLM and click Configure.
In the deployment modal:
openchat/openchat-3.5-0106
.In the vLLM settings modal, under LLM Settings:
8192
(or an appropriate context length for your model).In the endpoint configuration modal:
Under Worker Configuration:
0
for cost savings or 1
for faster response times.2
(or higher for more concurrent capacity).1
(increase for larger models).Leave other settings at their defaults unless you have specific requirements.
Click Deploy.
Your endpoint will now begin initializing. This may take several minutes while Runpod provisions resources and downloads your model.
For more details on how to optimize your endpoint, see Endpoint configurations.
While your endpoint is initializing, let’s understand what’s happening and what you’ll be able to do with it:
Once deployment is complete, make a note of your Endpoint ID. You’ll need this to make API requests.
To test your worker, click the Requests tab in the endpoint detail page:
On the left you should see the default test request:
Leave the default input as is and click Run. The system will take a few minutes to initialize your workers.
When the workers finish processing your request, you should see output on the right side of the page similar to this:
If you need to customize your model deployment, you can edit your endpoint settings to add environment variables. Here are some useful environment variables you might want to set:
MAX_MODEL_LEN
: Maximum context length (e.g., 16384
)DTYPE
: Data type for model weights (float16
, bfloat16
, or float32
)GPU_MEMORY_UTILIZATION
: Controls VRAM usage (e.g., 0.95
for 95%)CUSTOM_CHAT_TEMPLATE
: For models that need a custom chat templateOPENAI_SERVED_MODEL_NAME_OVERRIDE
: Change the model name to use in OpenAI requestsTo add or modify environment variables:
You can find a full list of available environment variables in the vLLM worker GitHub README.
You may also wish to adjust the input parameters for your request. For example, use the max_tokens
parameter to increase the maximum number of tokens generated per reponse. To learn more, see Send vLLM requests.
If you encounter issues with your deployment:
MAX_MODEL_LEN
.Congratulations! You’ve successfully deployed a vLLM worker on Runpod’s Serverless platform. You now have a powerful, scalable LLM inference API that’s compatible with both the OpenAI client and Runpod’s native API.
Next you can try: