Step 1: Deploy an Instant Cluster
- Open the Instant Clusters page on the Runpod web interface.
- Click Create Cluster.
- Use the UI to name and configure your Cluster. For this walkthrough, keep Pod Count at 2 and select the option for 16x H100 SXM GPUs. Keep the Pod Template at its default setting (Runpod PyTorch).
- Click Deploy Cluster. You should be redirected to the Instant Clusters page after a few seconds.
Step 2: Set up Axolotl on each Pod
- Click your cluster to expand the list of Pods.
-
Click on a Pod, for example
CLUSTERNAME-pod-0
, to expand the Pod. - Click Connect, then click Web Terminal.
-
In the terminal that opens, run this command to clone the Axolotl repository into the Pod’s main directory:
-
Navigate to the
axolotl
directory: -
Install the required packages:
-
Navigate to the
examples/llama-3
directory:
Step 3: Start the training process on each Pod
Run this command in the web terminal of each Pod:Currently, the dynamic
c10d
backend is not supported. Please keep the rdzv_backend
flag set to static
../outputs/lora-out
directory. You can now use this model for inference or continue training with different parameters.
Step 4: Clean up
If you no longer need your cluster, make sure you return to the Instant Clusters page and delete your cluster to avoid incurring extra charges.You can monitor your cluster usage and spending using the Billing Explorer at the bottom of the Billing page section under the Cluster tab.
Next steps
Now that you’ve successfully deployed and tested an Axolotl distributed training job on an Instant Cluster, you can:- Fine-tune your own models by modifying the configuration files in Axolotl to suit your specific requirements.
- Scale your training by adjusting the number of Pods in your cluster (and the size of their containers and volumes) to handle larger models or datasets.
- Try different optimization techniques such as DeepSpeed, FSDP (Fully Sharded Data Parallel), or other distributed training strategies.