Slurm Clusters
Deploy Slurm Clusters on Runpod with zero configuration
Slurm Clusters are currently in beta. If you’d like to provide feedback, please join our Discord.
Runpod Slurm Clusters provide a managed high-performance computing and scheduling solution that enables you to rapidly create and manage Slurm Clusters with minimal setup.
For more information on working with Slurm, refer to the Slurm documentation.
Key features
Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing:
- Zero configuration setup: Slurm and munge are pre-installed and fully configured.
- Instant provisioning: Clusters deploy rapidly with minimal setup.
- Automatic role assignment: Runpod automatically designates controller and agent nodes.
- Built-in optimizations: Pre-configured for optimal NCCL performance.
- Full Slurm compatibility: All standard Slurm commands work out-of-the-box.
If you prefer to manually configure your Slurm deployment, see Deploy an Instant Cluster with Slurm (unmanaged) for a step-by-step guide.
Deploy a Slurm Cluster
- Open the Instant Clusters page on the Runpod console.
- Click Create Cluster.
- Select Slurm Cluster from the cluster type dropdown menu.
- Configure your cluster specifications:
- Cluster name: Enter a descriptive name for your cluster.
- Pod count: Choose the number of Pods in your cluster.
- GPU type: Select your preferred GPU type.
- Region: Choose your deployment region.
- Network volume (optional): Add a network volume for persistent/shared storage. If using a network volume, ensure the region matches your cluster region.
- Pod template: Select a Pod template or click Edit Template to customize start commands, environment variables, ports, or container/volume disk capacity.
Slurm Clusters currently only support official Runpod Pytorch images. If you deploy using a different image, the Slurm process will not start.
- Click Deploy Cluster.
Connect to a Slurm Cluster
Once deployment completes, you can access your cluster from the Instant Clusters page.
From this page you can select a cluster to view it’s component nodes, including a label indicating the Slurm controller (primary node) and Slurm agents (secondary nodes). Expand a node to view details like availability, GPU/storage utilization, and options for connection and management.
Connect to a node using the Connect button, or using any of the connection methods supported by Pods.
Submit and manage jobs
All standard Slurm commands are available without configuration. For example, you can:
Check cluster status and available resources:
Submit a job to the cluster from the Slurm controller node:
Monitor job queue and status:
View detailed job information from the Slurm controller node:
You can find the output of Slurm agents in their individual container logs.
Advanced configuration
While Runpod’s Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm controller node using the web terminal or SSH.
Access Slurm configuration files in their standard locations:
/etc/slurm/slurm.conf
- Main configuration file./etc/slurm/gres.conf
- Generic resource configuration.
Modify these files as needed for your specific requirements.
Troubleshooting
If you encounter issues with your Slurm Cluster, try the following:
- Jobs stuck in pending state: Check resource availability with
sinfo
and ensure requested resources are available. If you need more resources, you can add more nodes to your cluster. - Authentication errors: Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes.
For additional support, contact Runpod support with your cluster ID and specific error messages.