Skip to main content

Instant Clusters

Instant Clusters enable high-performance computing across multiple GPU Pods, with high-speed networking capabilities.

Instant Clusters provide:

  • Fast local networking between Pods, with bandwidths from 100 Gbps to 3200 Gbps within a single data center.
  • Static IP assignment for each Pod in the cluster.
  • Automatic assignment of environment variables for seamless coordination between Pods.
note

All accounts have a default spending limit. To deploy a larger cluster, submit a support ticket at help@runpod.io.

Get started

Get started with Instant Clusters by following a step-by-step tutorial for your preferred framework:

Use cases for Instant Clusters

Instant Clusters provide powerful computing capabilities that benefit a wide range of applications:

Deep learning & AI

  • Large Language Model training: Distribute training of models across multiple GPUs for significantly faster convergence.
  • Federated Learning: Train models across distributed systems while preserving data privacy and security.

High-performance computing

  • Scientific simulations: Use multi-GPU acceleration to run complex simulations for weather forecasting, molecular dynamics, and climate modeling.
  • Computational physics: Solve large-scale physics problems requiring massive parallel computing power.
  • Fluid dynamics & engineering: Perform fluid dynamics computations for use in aerospace, automotive, and energy sectors.

Graphics computing & rendering

  • Large-scale rendering: Generate high-fidelity images and animations for film, gaming, and visualization.
  • Real-time graphics processing: Power complex visual effects and simulations requiring multiple GPUs.
  • Game development & testing: Render game environments, test AI-driven behaviors, and generate procedural content.
  • Virtual reality & augmented reality: Deliver real-time multi-view rendering for immersive AR/VR experiences.

Large-scale data analytics

  • Big data processing: Analyze large-scale datasets with distributed computing frameworks.
  • Social media analysis: Detect real-time trends, analyze sentiment, and identify misinformation.

Network interfaces

High-bandwidth interfaces (eth1, eth2, etc.) handle communication between Pods, while the management interface (eth0) manages external traffic. The NCCL environment variable NCCL_SOCKET_IFNAME uses all available interfaces by default. The PRIMARY_ADDR corresponds to eth1 to enable launching and bootstrapping distributed processes.

Instant Clusters support up to 8 interfaces per Pod. Each interface (eth1 - eth8) provides a private network connection for inter-node communication, made available to distributed backends such as NCCL or GLOO.

Environment variables

The following environment variables are available in all Pods:

Environment VariableDescription
PRIMARY_ADDR / MASTER_ADDRThe address of the primary Pod.
PRIMARY_PORT / MASTER_PORTThe port of the primary Pod (all ports are available).
NODE_ADDRThe static IP of this Pod within the cluster network.
NODE_RANKThe Cluster (i.e., global) rank assigned to this Pod (0 for the primary Pod).
NUM_NODESThe number of Pods in the Cluster.
NUM_TRAINERSThe number of GPUs per Pod.
HOST_NODE_ADDRDefined as PRIMARY_ADDR:PRIMARY_PORT for convenience.
WORLD_SIZEThe total number of GPUs in the Cluster (NUM_NODES * NUM_TRAINERS).

Each Pod receives a static IP (NODE_ADDR) on the overlay network. When a Cluster is deployed, the system designates one Pod as the primary node by setting the PRIMARY_ADDR and PRIMARY_PORT environment variables. This simplifies working with multiprocessing libraries that require a primary node.

The variables MASTER_ADDR/PRIMARY_ADDR and MASTER_PORT/PRIMARY_PORT are equivalent. The MASTER_* variables provide compatibility with tools that expect these legacy names.