This page provides reference information for configuring and troubleshooting Instant Clusters.
Environment variables
The following environment variables are automatically set on all nodes in an Instant Cluster:
| Environment Variable | Description |
|---|
PRIMARY_ADDR / MASTER_ADDR | The address of the primary node. |
PRIMARY_PORT / MASTER_PORT | The port of the primary node. All ports are available. |
NODE_ADDR | The static IP of this node within the cluster network. |
NODE_RANK | The cluster rank (i.e. global rank) assigned to this node. NODE_RANK = 0 for the primary node. |
NUM_NODES | The number of nodes in the cluster. |
NUM_TRAINERS | The number of GPUs per node. |
HOST_NODE_ADDR | A convenience variable, defined as PRIMARY_ADDR:PRIMARY_PORT. |
WORLD_SIZE | The total number of GPUs in the cluster (NUM_NODES * NUM_TRAINERS). |
Each node receives a static IP address (NODE_ADDR) on the overlay network. When a cluster is deployed, the system designates one node as the primary node by setting the PRIMARY_ADDR and PRIMARY_PORT environment variables. This simplifies working with multiprocessing libraries that require a primary node.
The following variables are equivalent:
MASTER_ADDR and PRIMARY_ADDR
MASTER_PORT and PRIMARY_PORT
MASTER_* variables are available to provide compatibility with tools that expect these legacy names.
Network interfaces
Instant Clusters use dedicated high-bandwidth network interfaces for inter-node communication, separate from the management interface used for external traffic.
| Interface | Purpose |
|---|
ens1 - ens8 | High-bandwidth interfaces for inter-node communication. Each interface provides a private network connection. |
eth0 | Management interface for external traffic (internet connectivity). |
Instant Clusters support up to 8 high-bandwidth interfaces per node. The PRIMARY_ADDR environment variable corresponds to ens1, which enables launching and bootstrapping distributed processes.
Do not use eth0 for inter-node communication. The 172.xxx IP addresses on eth0 are reserved for internet connectivity only and will result in connection timeouts for distributed workloads.
NCCL configuration
NCCL (NVIDIA Collective Communications Library) handles GPU-to-GPU communication in distributed training. You must configure NCCL to use the internal network interfaces.
Required configuration
Set the NCCL_SOCKET_IFNAME environment variable to use the internal network:
export NCCL_SOCKET_IFNAME=ens1
By default, NCCL_SOCKET_IFNAME uses all available interfaces. However, explicitly setting it to ens1 ensures NCCL uses the high-bandwidth internal network.
Without this configuration, nodes may attempt to communicate using external IP addresses in the 172.xxx range, which are reserved for internet connectivity only. This will result in connection timeouts and failed distributed training jobs.
Debugging NCCL
To troubleshoot multi-node communication issues, enable NCCL debug logging:
This outputs detailed information about NCCL’s network discovery and communication attempts, helping identify configuration issues.
Troubleshooting
Connection timeouts during distributed training
Symptom: Training jobs fail with connection timeout errors between nodes.
Cause: NCCL is attempting to communicate over the external network interface (eth0) instead of the internal interfaces (ens1-ens8).
Solution: Set the NCCL_SOCKET_IFNAME environment variable:
export NCCL_SOCKET_IFNAME=ens1
Nodes cannot find the primary node
Symptom: Worker nodes fail to connect to the primary node during initialization.
Cause: The PRIMARY_ADDR or MASTER_ADDR environment variable is not being used correctly in your distributed training script.
Solution: Verify your script uses the PRIMARY_ADDR environment variable for the rendezvous address. For PyTorch distributed training:
import os
master_addr = os.environ.get("PRIMARY_ADDR", "localhost")
master_port = os.environ.get("PRIMARY_PORT", "29500")
Symptom: Training speed varies significantly between runs or degrades over time.
Cause: Network congestion or suboptimal NCCL configuration.
Solution:
- Ensure all nodes are in the same data center (Runpod handles this automatically).
- Enable NCCL debugging to identify bottlenecks:
export NCCL_DEBUG=INFO
- Verify your batch sizes are appropriate for the cluster size to maintain efficient GPU utilization.