meta-llama/Llama-3.1-8B-Instruct
model from Hugging Face using a Python environment.
Prerequisites
- A Runpod account with an API key
-
On your local machine:
- Python 3.8 or higher
pip
(orpip3
on macOS)- Basic utilities:
curl
- These instructions are applicable for macOS, Linux, and Windows systems.
Windows Users
- It’s recommended to use WSL (Windows Subsystem for Linux) or tools like Git Bash to follow along with the Unix-like commands used in this tutorial
- Alternatively, Windows users can use PowerShell or Command Prompt and adjust commands accordingly
Installation
Setting Up the dstack Server
-
Prepare Your Workspace
Open a terminal or command prompt and create a new directory for this tutorial:
-
Set Up a Python Virtual Environment
-
Install dstack
Use
pip
to install dstack:Note: Ifpip3
is not available, you may need to install it or usepip
.
Configuring dstack for Runpod
-
Create the Global Configuration File
The following
config.yml
file is a global configuration used by dstack for all deployments on your computer. It’s essential to place it in the correct configuration directory.-
Create the configuration directory:
-
Navigate to the configuration directory:
-
Create the
config.yml
File In the configuration directory, create a file namedconfig.yml
with the following content:ReplaceYOUR_RUNPOD_API_KEY
with the API key you obtained from Runpod.
-
Create the configuration directory:
-
Start the dstack Server
From the configuration directory, start the dstack server:
You should see output indicating that the server is running:
The
ADMIN-TOKEN
displayed is important for accessing the dstack web UI.- Access the dstack Web UI
- Open your web browser and navigate to
http://127.0.0.1:3000
. - When prompted for an admin token, enter the
ADMIN-TOKEN
from the server output. - The web UI allows you to monitor and manage your deployments.

Deploying vLLM as a Task
Step 1: Configure the Deployment Task
- Prepare for Deployment
- Open a new terminal or command prompt window.
-
Navigate to your tutorial directory:
-
Activate the Python Virtual Environment
- Create a Directory for the Task
- Create the dstack Configuration File
-
Create the
.dstack.yml
File Create a file named.dstack.yml
(ordstack.yml
if your system doesn’t allow filenames starting with a dot) with the following content:
Replace
YOUR_HUGGING_FACE_HUB_TOKEN
with your actual Hugging Face access token (read-access is enough) or define the token in your environment variables. Without this token, the model cannot be downloaded as it is gated.Step 2: Initialize and Deploy the Task
- Initialize dstack
.dstack.yml
file is located:
- Apply the Configuration
- You will see an output summarizing the deployment configuration and available instances.
-
When prompted:
Type
y
and pressEnter
to confirm. -
The
ports
configuration provides port forwarding from the deployed pod tolocalhost
, allowing you to access the deployed vLLM vialocalhost:8000
.
- Monitor the Deployment
-
After executing
dstack apply
, you’ll see all the steps that dstack performs:- Provisioning the pod on Runpod.
- Downloading the Docker image.
- Installing required packages.
- Downloading the model from Hugging Face.
- Starting the vLLM server.
- The logs of vLLM will be displayed in the terminal.
-
To monitor the logs at any time, run:
-
Wait until you see logs indicating that vLLM is serving the model, such as:
Step 3: Test the Model Server
- Access the Service
ports
configuration forwards port 8000
from the deployed pod to localhost
, you can access the vLLM server via http://localhost:8000
.
- Test the Service Using
curl
curl
command to test the deployed model:
- Verify the Response
Step 4: Clean Up
To avoid incurring additional costs, it’s important to stop the task when you’re finished.- Stop the Task
dstack apply
, you can stop the task by pressing Ctrl + C
.
You’ll be prompted:
y
and press Enter
to confirm stopping the task.
- Terminate the Instance
- Verify Termination
Additional Tips: Using Volumes for Persistent Storage
If you need to retain data between runs or cache models to reduce startup times, you can use volumes.Creating a Volume
Create a separate dstack file namedvolume.dstack.yml
with the following content:
The
region
ties your volume to a specific region, which then also ties your Pod to that same region.llama31-volume
.
Using the Volume in Your Task
Modify your.dstack.yml
file to include the volume:
/data
directory inside your container.
By doing this, you can store models and data persistently, which can be especially useful for large models that take time to download.
For more information on using volumes with Runpod, refer to the dstack blog on volumes.