Slurm Clusters

Runpod Slurm Clusters provide a managed high-performance computing and scheduling solution that enables you to rapidly create and manage Slurm Clusters with minimal setup. For more information on working with Slurm, refer to the Slurm documentation.

Key features

Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing:

Zero configuration setup: Slurm and munge are pre-installed and fully configured.
Instant provisioning: Clusters deploy rapidly with minimal setup.
Automatic role assignment: Runpod automatically designates controller and agent nodes.
Built-in optimizations: Pre-configured for optimal NCCL performance.
Full Slurm compatibility: All standard Slurm commands work out-of-the-box.

If you prefer to manually configure your Slurm deployment, see Deploy an Instant Cluster with Slurm (unmanaged) for a step-by-step guide.

Deploy a Slurm Cluster

Open the Instant Clusters page on the Runpod console.
Click Create Cluster.
Select Slurm Cluster from the cluster type dropdown menu.
Configure your cluster specifications:
- Cluster name: Enter a descriptive name for your cluster.
- Pod count: Choose the number of Pods in your cluster.
- GPU type: Select your preferred GPU type.
- Region: Choose your deployment region.
- Network volume (optional): Add a network volume for persistent/shared storage. If using a network volume, ensure the region matches your cluster region.
- Pod template: Select a Pod template or click Edit Template to customize start commands, environment variables, ports, or container/volume disk capacity.
  Slurm Clusters currently only support official Runpod Pytorch images. If you deploy using a different image, the Slurm process will not start.
Click Deploy Cluster.

Connect to a Slurm Cluster

Once deployment completes, you can access your cluster from the Instant Clusters page. From this page you can select a cluster to view it’s component nodes, including a label indicating the Slurm controller (primary node) and Slurm agents (secondary nodes). Expand a node to view details like availability, GPU/storage utilization, and options for connection and management. Connect to a node using the Connect button, or using any of the connection methods supported by Pods.

Submit and manage jobs

All standard Slurm commands are available without configuration. For example, you can: Check cluster status and available resources:

sinfo

Submit a job to the cluster from the Slurm controller node:

sbatch your-job-script.sh

Monitor job queue and status:

squeue

View detailed job information from the Slurm controller node:

scontrol show job JOB_ID

You can find the output of Slurm agents in their individual container logs.

Advanced configuration

While Runpod’s Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm controller node using the web terminal or SSH. Access Slurm configuration files in their standard locations:

/etc/slurm/slurm.conf - Main configuration file.
/etc/slurm/gres.conf - Generic resource configuration.

Modify these files as needed for your specific requirements.

Troubleshooting

If you encounter issues with your Slurm Cluster, try the following:

Jobs stuck in pending state: Check resource availability with sinfo and ensure requested resources are available. If you need more resources, you can add more nodes to your cluster.
Authentication errors: Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes.

For additional support, contact Runpod support with your cluster ID and specific error messages.

Get started

Serverless

Pods

Storage

Hub

Instant Clusters

Fine-tuning

Reference

Key features

Deploy a Slurm Cluster

Connect to a Slurm Cluster

Submit and manage jobs

Advanced configuration

Troubleshooting

Get started

Serverless

Pods

Storage

Hub

Instant Clusters

Fine-tuning

Reference

​Key features

​Deploy a Slurm Cluster

​Connect to a Slurm Cluster

​Submit and manage jobs

​Advanced configuration

​Troubleshooting

Key features

Deploy a Slurm Cluster

Connect to a Slurm Cluster

Submit and manage jobs

Advanced configuration

Troubleshooting