dstack is an open-source tool that simplifies the orchestration of Pods for AI and ML workloads. By defining your application and resource requirements in YAML configuration files, it automates the provisioning and management of cloud resources on Runpod, allowing you to focus on your application logic rather than the infrastructure.
In this guide, we’ll walk through setting up dstack with Runpod to deploy vLLM. We’ll serve the meta-llama/Llama-3.1-8B-Instruct
model from Hugging Face using a Python environment.
On your local machine:
pip
(or pip3
on macOS)curl
These instructions are applicable for macOS, Linux, and Windows systems.
Windows Users
Prepare Your Workspace
Open a terminal or command prompt and create a new directory for this tutorial:
Set Up a Python Virtual Environment
Command Prompt:
PowerShell:
Install dstack
Use pip
to install dstack:
Note: If pip3
is not available, you may need to install it or use pip
.
Note: If pip3
is not available, you may need to install it or use pip
.
Create the Global Configuration File
The following config.yml
file is a global configuration used by dstack for all deployments on your computer. It’s essential to place it in the correct configuration directory.
Create the configuration directory:
Command Prompt or PowerShell:
Navigate to the configuration directory:
Create the config.yml
File
In the configuration directory, create a file named config.yml
with the following content:
Replace YOUR_RUNPOD_API_KEY
with the API key you obtained from Runpod.
Start the dstack Server
From the configuration directory, start the dstack server:
You should see output indicating that the server is running:
The ADMIN-TOKEN
displayed is important for accessing the dstack web UI.
http://127.0.0.1:3000
.ADMIN-TOKEN
from the server output.Open a new terminal or command prompt window.
Navigate to your tutorial directory:
Activate the Python Virtual Environment
Command Prompt:
PowerShell:
Create and navigate to a new directory for the deployment task:
Create the .dstack.yml
File
Create a file named .dstack.yml
(or dstack.yml
if your system doesn’t allow filenames starting with a dot) with the following content:
Replace YOUR_HUGGING_FACE_HUB_TOKEN
with your actual Hugging Face access token (read-access is enough) or define the token in your environment variables. Without this token, the model cannot be downloaded as it is gated.
Run the following command in the directory where your .dstack.yml
file is located:
Deploy the task by applying the configuration:
You will see an output summarizing the deployment configuration and available instances.
When prompted:
Type y
and press Enter
to confirm.
The ports
configuration provides port forwarding from the deployed pod to localhost
, allowing you to access the deployed vLLM via localhost:8000
.
After executing dstack apply
, you’ll see all the steps that dstack performs:
The logs of vLLM will be displayed in the terminal.
To monitor the logs at any time, run:
Wait until you see logs indicating that vLLM is serving the model, such as:
Since the ports
configuration forwards port 8000
from the deployed pod to localhost
, you can access the vLLM server via http://localhost:8000
.
curl
Use the following curl
command to test the deployed model:
Command Prompt:
PowerShell:
You should receive a JSON response similar to the following:
This confirms that the model is running and responding as expected.
To avoid incurring additional costs, it’s important to stop the task when you’re finished.
In the terminal where you ran dstack apply
, you can stop the task by pressing Ctrl + C
.
You’ll be prompted:
Type y
and press Enter
to confirm stopping the task.
The instance will terminate automatically after stopping the task.
If you wish to ensure the instance is terminated immediately, you can run:
Check your Runpod dashboard or the dstack web UI to ensure that the instance has been terminated.
If you need to retain data between runs or cache models to reduce startup times, you can use volumes.
Create a separate dstack file named volume.dstack.yml
with the following content:
The region
ties your volume to a specific region, which then also ties your Pod to that same region.
Apply the volume configuration:
This will create the volume named llama31-volume
.
Modify your .dstack.yml
file to include the volume:
This configuration will mount the volume to the /data
directory inside your container.
By doing this, you can store models and data persistently, which can be especially useful for large models that take time to download.
For more information on using volumes with Runpod, refer to the dstack blog on volumes.
By leveraging dstack on Runpod, you can efficiently deploy and manage Pods, accelerating your development workflow and reducing operational overhead.