Manage Pods with dstack on Runpod

dstack is an open-source tool that simplifies the orchestration of Pods for AI and ML workloads. By defining your application and resource requirements in YAML configuration files, it automates the provisioning and management of cloud resources on Runpod, allowing you to focus on your application logic rather than the infrastructure. In this guide, we’ll walk through setting up dstack with Runpod to deploy vLLM. We’ll serve the meta-llama/Llama-3.1-8B-Instruct model from Hugging Face using a Python environment.

Prerequisites

A Runpod account with an API key
On your local machine:
- Python 3.8 or higher
- pip (or pip3 on macOS)
- Basic utilities: curl
These instructions are applicable for macOS, Linux, and Windows systems.

Windows Users

It’s recommended to use WSL (Windows Subsystem for Linux) or tools like Git Bash to follow along with the Unix-like commands used in this tutorial
Alternatively, Windows users can use PowerShell or Command Prompt and adjust commands accordingly

Installation

Setting Up the dstack Server

Prepare Your Workspace Open a terminal or command prompt and create a new directory for this tutorial:
```
mkdir runpod-dstack-tutorial
cd runpod-dstack-tutorial
```
Set Up a Python Virtual Environment
python3 -m venv .venv source .venv/bin/activate
Install dstack Use pip to install dstack:
pip3 install -U "dstack[all]"
Note: If pip3 is not available, you may need to install it or use pip.

Configuring dstack for Runpod

Create the Global Configuration File The following config.yml file is a global configuration used by dstack for all deployments on your computer. It’s essential to place it in the correct configuration directory.
- Create the configuration directory:
  mkdir -p ~/.dstack/server
- Navigate to the configuration directory:
  cd ~/.dstack/server
- Create the config.yml File In the configuration directory, create a file named config.yml with the following content:
  projects: - name: main backends: - type: runpod creds: type: api_key api_key: YOUR_RUNPOD_API_KEY
  Replace YOUR_RUNPOD_API_KEY with the API key you obtained from Runpod.
Start the dstack Server From the configuration directory, start the dstack server:
```
dstack server
```
You should see output indicating that the server is running:

[INFO] Applying ~/.dstack/server/config.yml...
[INFO] The admin token is ADMIN-TOKEN
[INFO] The dstack server is running at http://127.0.0.1:3000

The ADMIN-TOKEN displayed is important for accessing the dstack web UI.

Access the dstack Web UI

Open your web browser and navigate to http://127.0.0.1:3000.
When prompted for an admin token, enter the ADMIN-TOKEN from the server output.
The web UI allows you to monitor and manage your deployments.

Deploying vLLM as a Task

Step 1: Configure the Deployment Task

Prepare for Deployment

Open a new terminal or command prompt window.
Navigate to your tutorial directory:
```
cd runpod-dstack-tutorial
```
Activate the Python Virtual Environment
source .venv/bin/activate

Create a Directory for the Task

Create and navigate to a new directory for the deployment task:

mkdir task-vllm-llama
cd task-vllm-llama

Create the dstack Configuration File

Create the .dstack.yml File Create a file named .dstack.yml (or dstack.yml if your system doesn’t allow filenames starting with a dot) with the following content:

type: task
name: vllm-llama-3.1-8b-instruct
python: "3.10"
env:
  - HUGGING_FACE_HUB_TOKEN=YOUR_HUGGING_FACE_HUB_TOKEN
  - MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
  - MAX_MODEL_LEN=8192
commands:
  - pip install vllm
  - vllm serve $MODEL_NAME --port 8000 --max-model-len $MAX_MODEL_LEN
ports:
  - 8000
spot_policy: on-demand
resources:
  gpu:
    name: "RTX4090"
    memory: "24GB"
  cpu: 16..

Replace YOUR_HUGGING_FACE_HUB_TOKEN with your actual Hugging Face access token (read-access is enough) or define the token in your environment variables. Without this token, the model cannot be downloaded as it is gated.

Step 2: Initialize and Deploy the Task

Initialize dstack

Run the following command in the directory where your .dstack.yml file is located:

dstack init

Apply the Configuration

Deploy the task by applying the configuration:

dstack apply

You will see an output summarizing the deployment configuration and available instances.

When prompted:

Submit the run vllm-llama-3.1-8b-instruct? [y/n]:

Type y and press Enter to confirm.

The ports configuration provides port forwarding from the deployed pod to localhost, allowing you to access the deployed vLLM via localhost:8000.

Monitor the Deployment

After executing dstack apply, you’ll see all the steps that dstack performs:
- Provisioning the pod on Runpod.
- Downloading the Docker image.
- Installing required packages.
- Downloading the model from Hugging Face.
- Starting the vLLM server.
The logs of vLLM will be displayed in the terminal.
To monitor the logs at any time, run:
```
dstack logs vllm-llama-3.1-8b-instruct
```

Wait until you see logs indicating that vLLM is serving the model, such as:

INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Step 3: Test the Model Server

Access the Service

Since the ports configuration forwards port 8000 from the deployed pod to localhost, you can access the vLLM server via http://localhost:8000.

Test the Service Using curl

Use the following curl command to test the deployed model:

curl -X POST http://localhost:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
          "model": "meta-llama/Llama-3.1-8B-Instruct",
          "messages": [
             {"role": "system", "content": "You are Poddy, a helpful assistant."},
             {"role": "user", "content": "What is your name?"}
          ],
          "temperature": 0,
          "max_tokens": 150
        }'

Verify the Response

You should receive a JSON response similar to the following:

{
  "id": "chat-f0566a5143244d34a0c64c968f03f80c",
  "object": "chat.completion",
  "created": 1727902323,
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "My name is Poddy, and I'm here to assist you with any questions or information you may need.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 49,
    "total_tokens": 199,
    "completion_tokens": 150
  },
  "prompt_logprobs": null
}

This confirms that the model is running and responding as expected.

Step 4: Clean Up

To avoid incurring additional costs, it’s important to stop the task when you’re finished.

Stop the Task

In the terminal where you ran dstack apply, you can stop the task by pressing Ctrl + C. You’ll be prompted:

Stop the run vllm-llama-3.1-8b-instruct before detaching? [y/n]:

Type y and press Enter to confirm stopping the task.

Terminate the Instance

The instance will terminate automatically after stopping the task. If you wish to ensure the instance is terminated immediately, you can run:

dstack stop vllm-llama-3.1-8b-instruct

Verify Termination

Check your Runpod dashboard or the dstack web UI to ensure that the instance has been terminated.

Additional Tips: Using Volumes for Persistent Storage

If you need to retain data between runs or cache models to reduce startup times, you can use volumes.

Creating a Volume

Create a separate dstack file named volume.dstack.yml with the following content:

type: volume
name: llama31-volume

backend: runpod
region: EUR-IS-1

# Required size
size: 100GB

The region ties your volume to a specific region, which then also ties your Pod to that same region.

Apply the volume configuration:

dstack apply -f volume.dstack.yml

This will create the volume named llama31-volume.

Using the Volume in Your Task

Modify your .dstack.yml file to include the volume:

volumes:
- name: llama31-volume
 path: /data

This configuration will mount the volume to the /data directory inside your container. By doing this, you can store models and data persistently, which can be especially useful for large models that take time to download. For more information on using volumes with Runpod, refer to the dstack blog on volumes.

Conclusion

By leveraging dstack on Runpod, you can efficiently deploy and manage Pods, accelerating your development workflow and reducing operational overhead.

Get started

Serverless

Hub

Pods

Instant Clusters

Fine-tuning

Integrations

Hosting

References

Prerequisites

Installation

Setting Up the dstack Server

Configuring dstack for Runpod

Deploying vLLM as a Task

Step 1: Configure the Deployment Task

Step 2: Initialize and Deploy the Task

Step 3: Test the Model Server

Step 4: Clean Up

Additional Tips: Using Volumes for Persistent Storage

Creating a Volume

Using the Volume in Your Task

Conclusion

Get started

Serverless

Hub

Pods

Instant Clusters

Fine-tuning

Integrations

Hosting

References

​Prerequisites

​Installation

​Setting Up the dstack Server

​Configuring dstack for Runpod

​Deploying vLLM as a Task

​Step 1: Configure the Deployment Task

​Step 2: Initialize and Deploy the Task

​Step 3: Test the Model Server

​Step 4: Clean Up

​Additional Tips: Using Volumes for Persistent Storage

​Creating a Volume

​Using the Volume in Your Task

​Conclusion

Prerequisites

Installation

Setting Up the dstack Server

Configuring dstack for Runpod

Deploying vLLM as a Task

Step 1: Configure the Deployment Task

Step 2: Initialize and Deploy the Task

Step 3: Test the Model Server

Step 4: Clean Up

Additional Tips: Using Volumes for Persistent Storage

Creating a Volume

Using the Volume in Your Task

Conclusion