Requirements
The following requirements are minimal and are subject to change.
Software specifications
- Ubuntu Server 22.04 LTS:
- Basic Linux proficiency.
- Ability to remotely connect via SSH.
Operating system
- Ubuntu Server 22.04 LTS
- Use the same file as 22.04, but select HWE during install.
- That way, Kernel 6.5.0-15 is installed (please replace by any more recent production version if available).
BIOS
- For non-VM systems, make sure IOMMU is disabled in the BIOS.
- Another good practice is to update the server BIOS to the latest stable version when facing compatibility issues.
Drivers
- Nvidia drivers 550.54.15 (please replace by any more recent production version if available).
- Never use beta or new feature branch drivers except if you have been instructed otherwise.
- CUDA 12.4 (please replace by any more recent production version if available).
- Nvidia Persistence should be activated for GPUs of 48 GB or more.
HGX SXM Systems
- Nvidia Fabric Manager needs to be installed, activated, running, and tested.
- Mandatory: Fabric Manager version = Nvidia drivers version = Kernel drivers headers.
- A p2p bandwidth test should be passed.
- CUDA Toolkit, Nvidia NSCQ and Nvidia DCGM need to be installed.
- Ensure the topology of the NVLINK switch is right by leveraging nvidia-smi and dcgmi.
- Ensure the SXM is performing well leveraging the dcgmi diagnostic tool.