> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpod.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploy an Instant Cluster with Axolotl

export const InferenceTooltip = () => {
  return <Tooltip headline="AI inference" tip="The execution phase where a trained model makes predictions on new data. When you prompt a model and it responds, that's inference.">inference</Tooltip>;
};

export const TrainingTooltip = () => {
  return <Tooltip headline="AI training" tip="The initial phase of AI model development, in which a model analyzes a dataset to learn patterns and relationships.">training</Tooltip>;
};

This tutorial demonstrates how to use Instant Clusters with [Axolotl](https://axolotl.ai/) to fine-tune large language models (LLMs) across multiple GPUs. By leveraging PyTorch's distributed <TrainingTooltip /> capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.

Follow the steps below to deploy a cluster and start training your models efficiently.

## Step 1: Deploy an Instant Cluster

1. Open the [Instant Clusters page](https://www.console.runpod.io/cluster) on the Runpod web interface.
2. Click **Create Cluster**.
3. Use the UI to name and configure your Cluster. For this walkthrough, keep **Pod Count** at **2** and select the option for **16x H100 SXM** GPUs. Keep the **Pod Template** at its default setting (Runpod PyTorch).
4. Click **Deploy Cluster**. You should be redirected to the Instant Clusters page after a few seconds.

## Step 2: Set up Axolotl on each Pod

1. Click your cluster to expand the list of Pods.

2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod.

3. Click **Connect**, then click **Web Terminal**.

4. In the terminal that opens, run this command to clone the Axolotl repository into the Pod's main directory:

   ```sh theme={"theme":{"light":"github-light","dark":"github-dark"}}
   git clone https://github.com/axolotl-ai-cloud/axolotl
   ```

5. Navigate to the `axolotl` directory:

   ```sh theme={"theme":{"light":"github-light","dark":"github-dark"}}
   cd axolotl
   ```

6. Install the required packages:

   ```sh theme={"theme":{"light":"github-light","dark":"github-dark"}}
   pip3 install -U packaging setuptools wheel ninja
   pip3 install --no-build-isolation -e '.[flash-attn,deepspeed]'
   ```

7. Navigate to the `examples/llama-3` directory:

   ```sh theme={"theme":{"light":"github-light","dark":"github-dark"}}
   cd examples/llama-3
   ```

Repeat these steps for **each Pod** in your cluster.

## Step 3: Start the training process on each Pod

Run this command in the web terminal of **each Pod**:

```php theme={"theme":{"light":"github-light","dark":"github-dark"}}
torchrun \
    --nnodes $NUM_NODES \
    --node_rank $NODE_RANK \
    --nproc_per_node $NUM_TRAINERS \
    --rdzv_id "myjob" \
    --rdzv_backend static \
    --rdzv_endpoint "$PRIMARY_ADDR:$PRIMARY_PORT" -m axolotl.cli.train lora-1b.yml
```

<Info>
  Currently, the dynamic `c10d` backend is not supported. Please keep the `rdzv_backend` flag set to `static`.
</Info>

After running the command on the last Pod, you should see output similar to this after the training process is complete:

```csharp theme={"theme":{"light":"github-light","dark":"github-dark"}}
...
{'loss': 1.2569, 'grad_norm': 0.11112671345472336, 'learning_rate': 5.418275829936537e-06, 'epoch': 0.9}
{'loss': 1.2091, 'grad_norm': 0.11100614815950394, 'learning_rate': 3.7731999690749585e-06, 'epoch': 0.92}
{'loss': 1.2216, 'grad_norm': 0.10450132936239243, 'learning_rate': 2.420361737256438e-06, 'epoch': 0.93}
{'loss': 1.223, 'grad_norm': 0.10873789340257645, 'learning_rate': 1.3638696597277679e-06, 'epoch': 0.95}
{'loss': 1.2529, 'grad_norm': 0.1063728854060173, 'learning_rate': 6.069322682050516e-07, 'epoch': 0.96}
{'loss': 1.2304, 'grad_norm': 0.10996092110872269, 'learning_rate': 1.518483566683826e-07, 'epoch': 0.98}
{'loss': 1.2334, 'grad_norm': 0.10642101615667343, 'learning_rate': 0.0, 'epoch': 0.99}
{'train_runtime': 61.7602, 'train_samples_per_second': 795.189, 'train_steps_per_second': 1.085, 'train_loss': 1.255359119443751, 'epoch': 0.99}

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [01:00<00:00,  1.11it/s]
[2025-04-01 19:24:22,603] [INFO] [axolotl.train.save_trained_model:211] [PID:1009] [RANK:0] Training completed! Saving pre-trained model to ./outputs/lora-out.
```

<Check>
  Congratulations! You've successfully trained a model using Axolotl on an Instant Cluster. Your fine-tuned model has been saved to the `./outputs/lora-out` directory. You can now use this model for <InferenceTooltip /> or continue training with different parameters.
</Check>

## Step 4: Clean up

If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.console.runpod.io/cluster) and delete your cluster to avoid incurring extra charges.

<Info>
  You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.console.runpod.io/user/billing) section under the **Cluster** tab.
</Info>

## Next steps

Now that you've successfully deployed and tested an Axolotl distributed training job on an Instant Cluster, you can:

* **Fine-tune your own models** by modifying the configuration files in Axolotl to suit your specific requirements.
* **Scale your training** by adjusting the number of Pods in your cluster (and the size of their containers and volumes) to handle larger models or datasets.
* **Try different optimization techniques** such as DeepSpeed, FSDP (Fully Sharded Data Parallel), or other distributed training strategies.

For more information on fine-tuning with Axolotl, refer to the [Axolotl documentation](https://github.com/OpenAccess-AI-Collective/axolotl).
