Deploy an Instant Cluster with Slurm (unmanaged)

This guide is for advanced users who want to configure and manage their own Slurm deployment on Instant Clusters. If you’re looking for a pre-configured solution, see Slurm Clusters.

This tutorial demonstrates how to configure Runpod Instant Clusters with Slurm to manage and schedule distributed workloads across multiple nodes. Slurm is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging Slurm on Runpod’s high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs. Follow the steps below to deploy a cluster and start running distributed Slurm workloads efficiently.

Requirements

You’ve created a Runpod account and funded it with sufficient credits.
You have basic familiarity with Linux command line.
You’re comfortable working with Pods and understand the basics of Slurm.

Step 1: Deploy an Instant Cluster

Open the Instant Clusters page on the Runpod web interface.
Click Create Cluster.
Use the UI to name and configure your cluster. For this walkthrough, keep Pod Count at 2 and select the option for 16x H100 SXM GPUs. Keep the Pod Template at its default setting (Runpod PyTorch).
Click Deploy Cluster. You should be redirected to the Instant Clusters page after a few seconds.

Step 2: Clone demo and install Slurm on each Pod

To connect to a Pod:

On the Instant Clusters page, click on the cluster you created to expand the list of Pods.
Click on a Pod, for example CLUSTERNAME-pod-0, to expand the Pod.

On each Pod:

Click Connect, then click Web Terminal.
In the terminal that opens, run this command to clone the Slurm demo files into the Pod’s main directory:
```
git clone https://github.com/pandyamarut/slurm_example.git && cd slurm_example
```

Run this command to install Slurm:

apt update && apt install -y slurm-wlm slurm-client munge

Step 3: Overview of Slurm demo scripts

The repository contains several essential scripts for setting up Slurm. Let’s examine what each script does:

create_gres_conf.sh: Generates the Slurm Generic Resource (GRES) configuration file that defines GPU resources for each node.
create_slurm_conf.sh: Creates the main Slurm configuration file with cluster settings, node definitions, and partition setup.
install.sh: The primary installation script that sets up MUNGE authentication, configures Slurm, and prepares the environment.
test_batch.sh: A sample Slurm job script for testing cluster functionality.

Step 4: Install Slurm on each Pod

Now run the installation script on each Pod, replacing [MUNGE_SECRET_KEY] with any secure random string (like a password). The secret key is used for authentication between nodes, and must be identical across all Pods in your cluster.

./install.sh "[MUNGE_SECRET_KEY]" node-0 node-1 10.65.0.2 10.65.0.3

This script automates the complex process of configuring a two-node Slurm cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes.

Step 5: Start Slurm services

If you’re not sure which Pod is the primary node, run the command echo $HOSTNAME on the web terminal of each Pod and look for node-0.

On the primary node (node-0), run both Slurm services:
```
slurmctld -D
```
Use the web interface to open a second terminal on the primary node and run:
```
slurmd -D
```
On the secondary node (node-1), run:
```
slurmd -D
```

After running these commands, you should see output indicating that the services have started successfully. The -D flag keeps the services running in the foreground, so each command needs its own terminal.

Step 6: Test your Slurm Cluster

Run this command on the primary node (node-0) to check the status of your nodes:
```
sinfo
```
You should see output showing both nodes in your cluster, with a state of “idle” if everything is working correctly.
Run this command to test GPU availability across both nodes:
```
srun --nodes=2 --gres=gpu:1 nvidia-smi -L
```
This command should list all GPUs across both nodes.

Step 7: Submit the Slurm job script

Run the following command on the primary node (node-0) to submit the test job script and confirm that your cluster is working properly:

sbatch test_batch.sh

Check the output file created by the test (test_simple_[JOBID].out) and look for the hostnames of both nodes. This confirms that the job ran successfully across the cluster.

Step 8: Clean up

If you no longer need your cluster, make sure you return to the Instant Clusters page and delete your cluster to avoid incurring extra charges.

You can monitor your cluster usage and spending using the Billing Explorer at the bottom of the Billing page section under the Cluster tab.

Next steps

Now that you’ve successfully deployed and tested a Slurm cluster on Runpod, you can:

Adapt your own distributed workloads to run using Slurm job scripts.
Scale your cluster by adjusting the number of Pods to handle larger models or datasets.
Try different frameworks like Axolotl for fine-tuning large language models.
Optimize performance by experimenting with different distributed training strategies.

Get started

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Fine-tuning

Hub

Reference

Deploy an Instant Cluster with Slurm (unmanaged)

Requirements

Step 1: Deploy an Instant Cluster

Step 2: Clone demo and install Slurm on each Pod

Step 3: Overview of Slurm demo scripts

Step 4: Install Slurm on each Pod

Step 5: Start Slurm services

Step 6: Test your Slurm Cluster

Step 7: Submit the Slurm job script

Step 8: Clean up

Next steps

Get started

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Fine-tuning

Hub

Reference

​Requirements

​Step 1: Deploy an Instant Cluster

​Step 2: Clone demo and install Slurm on each Pod

​Step 3: Overview of Slurm demo scripts

​Step 4: Install Slurm on each Pod

​Step 5: Start Slurm services

​Step 6: Test your Slurm Cluster

​Step 7: Submit the Slurm job script

​Step 8: Clean up

​Next steps

Requirements

Step 1: Deploy an Instant Cluster

Step 2: Clone demo and install Slurm on each Pod

Step 3: Overview of Slurm demo scripts

Step 4: Install Slurm on each Pod

Step 5: Start Slurm services

Step 6: Test your Slurm Cluster

Step 7: Submit the Slurm job script

Step 8: Clean up

Next steps