Applies to HGX A800 / Tesla series NVIDIA data center GPUs.

1. Official Resources#

NVIDIA Driver / CUDA#

NVIDIA Driver and CUDA Compatibility Matrix
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
Driver download
https://www.nvidia.cn/Download/index.aspx?lang=cn
- Product Type: Data Center / Tesla
- Product Series: HGX-Series
- Product: HGX-A800
- OS: Linux 64-bit
Offline packages for CUDA releases
https://developer.nvidia.com/cuda-toolkit-archive
NVIDIA Ubuntu Repository (FabricManager/driver packages)
https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/

Infiniband Driver#

MLNX OFED official download
https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed

Deployment Reference Docs#

Volcano Engine A800 GPU cluster deployment reference
https://www.volcengine.com/docs/6419/1123374

2. Basic System Configuration#

2.1 Update Package Mirrors#

1
bash <(curl -sSL https://linuxmirrors.cn/main.sh)

2.2 Set Hostname#

1
hostnamectl set-hostname ubuntu-2204

2.3 Disable Nouveau#

1
sudo bash -c "echo blacklist nouveau >> /etc/modprobe.d/blacklist.conf"
2
sudo bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist.conf"
3

4
sudo update-initramfs -u
5
reboot

Verify (no output means success):

1
lsmod | grep nouveau

3. Desktop Environment and Dependencies#

(Optional)

1
sudo apt update
2
sudo apt install lightdm
3
sudo apt install gcc make dkms

List available drivers:

1
apt install ubuntu-drivers-common
2
ubuntu-drivers devices

4. Install NVIDIA Driver#

Example: HGX A800 + 550 driver

1
wget https://cn.download.nvidia.cn/tesla/550.54.15/NVIDIA-Linux-x86_64-550.54.15.run
2
chmod +x NVIDIA-Linux-x86_64-550.54.15.run
3

4
sudo ./NVIDIA-Linux-x86_64-550.54.15.run \
5
  -no-x-check -no-nouveau-check -no-opengl-files

Verify:

1
nvidia-smi

5. Uninstall NVIDIA Driver#

1
sudo apt-get --purge remove nvidia*
2
sudo apt autoremove
3
sudo apt-get --purge remove "*nvidia*"
4
nvidia-uninstall
5
dpkg -l | grep nvidia
6
reboot

6. Install CUDA Toolkit#

Example (CUDA 12.4):

1
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
2

3
sudo sh cuda_12.4.0_550.54.14_linux.run

During installation:

Accept EULA
Uncheck driver installation
Install CUDA

7. Configure CUDA Environment Variables#

1
echo 'export PATH=$PATH:/usr/local/cuda-12.4/bin' >> ~/.bashrc
2
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.4/lib64' >> ~/.bashrc
3
source ~/.bashrc

Verify:

1
nvcc -V

8. Uninstall CUDA#

1
cd /usr/local/cuda-12.4/bin
2
sudo ./uninstall_cuda_12.4.pl
3
sudo rm -rf /usr/local/cuda-12.4

9. Install IB Driver (MLNX OFED)#

Example for Ubuntu 22.04:

1
wget https://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
2
tar -zxvf MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
3
cd MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64
4

5
sudo ./mlnxofedinstall
6
sudo /etc/init.d/openibd restart

Verify:

1
ibdev2netdev
2
ibstatus

If not installed:

1
apt install infiniband-diags

10. Critical Module Checks#

1
modprobe nvidia_peermem
2
apt-get install nvidia-modprobe
3
modprobe nvidia_peermem
4

5
lsmod | grep nvidia

Example of expected output:

1
nvidia_peermem
2
nvidia_drm
3
nvidia_modeset
4
nvidia

11. Install OpenMPI#

1
apt install -y gcc g++ make hwloc hwloc-nox libevent-dev
2

3
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.gz
4
tar zxf openmpi-4.1.5.tar.gz
5
cd openmpi-4.1.5
6

7
./configure --prefix=/usr/local/openmpi
8
make -j$(nproc)
9
make install

Environment variables:

1
echo 'export PATH=$PATH:/usr/local/openmpi/bin' >> ~/.bashrc
2
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/openmpi/lib' >> ~/.bashrc
3
source ~/.bashrc

Verify:

1
mpiexec --version

12. Configure IB NICs (Netplan)#

Example (multiple IB NICs):

1
network:
2
  version: 2
3
  ethernets:
4
    eth0:
5
      dhcp4: true
6
      match:
7
        macaddress: 52:54:00:13:3f:6d
8
      set-name: eth0
9
    ib0:
10
      dhcp4: no
11
      addresses: [10.1.12.1/24]
12
    ib1:
13
      dhcp4: no
14
      addresses: [10.1.12.2/24]
15
    ib2:
16
      dhcp4: no
17
      addresses: [10.1.12.3/24]
18
    ib3:
19
      dhcp4: no
20
      addresses: [10.1.12.4/24]

Apply:

1
netplan apply

13. Install NCCL#

13.1 NVIDIA Repository Keyring#

Ubuntu 22.04:

1
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
2
sudo dpkg -i cuda-keyring_1.0-1_all.deb
3
sudo apt update

13.2 Install a Specific NCCL Version#

1
sudo apt install libnccl2=2.20.5-1+cuda12.4 \
2
                 libnccl-dev=2.20.5-1+cuda12.4

13.3 Environment Variables#

1
vim /etc/profile

Add:

1
NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_6:1,mlx5_7:1
2
NCCL_IB_DISABLE=0
3
NCCL_SOCKET_IFNAME=eth0
4
NCCL_IB_GID_INDEX=3
5
NCCL_NET_GDR_LEVEL=2
6
NCCL_DEBUG=INFO

Activate:

1
source /etc/profile

Verify:

1
ldconfig -v | grep "libnccl.so"

14. Install NCCL Tests#

1
git clone https://github.com/NVIDIA/nccl-tests.git
2
cd nccl-tests
3
make MPI=1 MPI_HOME=/usr/local/openmpi

15. Install FabricManager#

Ubuntu 22.04 Example#

1
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-dev-550_550.54.15-1_amd64.deb
2
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-550_550.54.15-1_amd64.deb
3

4
sudo dpkg -i nvidia-fabricmanager-550_550.54.15-1_amd64.deb
5
sudo dpkg -i nvidia-fabricmanager-dev-550_550.54.15-1_amd64.deb
6

7
systemctl start nvidia-fabricmanager
8
systemctl enable nvidia-fabricmanager
9
systemctl status nvidia-fabricmanager

16. Single-Node NCCL Benchmark#

1
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

17. Multi-Node Benchmark#

17.1 Passwordless SSH#

1
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
2
chmod 600 ~/.ssh/id_rsa
3
scp ~/.ssh/id_rsa.pub [email protected]:~/.ssh/authorized_keys

17.2 NCCL + MPI Multi-Node Benchmark Example#

1
mpirun --oversubscribe --allow-run-as-root \
2
  -mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" \
3
  -n 16 -N 8 \
4
  -H node1,node2 \
5
  -bind-to socket -map-by slot \
6
  -mca pml ob1 -mca btl ^openib \
7
  -mca orte_base_help_aggregate 0 \
8
  -mca btl_tcp_if_include eth0 \
9
  -mca coll_hcoll_enable 0 \
10
  -x NCCL_DEBUG=INFO \
11
  -x NCCL_SOCKET_IFNAME=eth0 \
12
  -x NCCL_IB_DISABLE=0 \
13
  -x NCCL_NET_GDR_LEVEL=2 \
14
  -x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1 \
15
  -x NCCL_IB_GID_INDEX=3 \
16
  ~/nccl-tests/build/all_reduce_perf \
17
  -b 256M -e 8G -f 2 -g 1 -c 1 -n 100

18. Common Component Overview#

NVIDIA Components#

GPU Driver: Drives the GPU hardware
CUDA Toolkit: General-purpose GPU computing toolchain
cuDNN: Deep learning GPU acceleration library

OpenMPI#

MPI implementation for multi-node communication.

NCCL#

NVIDIA Collective Communications Library for efficient multi-GPU / multi-node communication.

NCCL Tests#

NCCL performance and correctness benchmark suite.

FabricManager#

Manages NVSwitch / NVLink topology; required for HGX servers.

Complete Installation Guide: Ubuntu NVIDIA GPU + CUDA + NCCL + OFED + OpenMPI High‑Performance Computing Environment

1. Official Resources#

NVIDIA Driver / CUDA#

Infiniband Driver#

Deployment Reference Docs#

2. Basic System Configuration#

2.1 Update Package Mirrors#

2.2 Set Hostname#

2.3 Disable Nouveau#

3. Desktop Environment and Dependencies#

4. Install NVIDIA Driver#

5. Uninstall NVIDIA Driver#

6. Install CUDA Toolkit#

7. Configure CUDA Environment Variables#

8. Uninstall CUDA#

9. Install IB Driver (MLNX OFED)#

10. Critical Module Checks#

11. Install OpenMPI#

12. Configure IB NICs (Netplan)#

13. Install NCCL#

13.1 NVIDIA Repository Keyring#

13.2 Install a Specific NCCL Version#

13.3 Environment Variables#

14. Install NCCL Tests#

15. Install FabricManager#

Ubuntu 22.04 Example#

16. Single-Node NCCL Benchmark#

17. Multi-Node Benchmark#

17.1 Passwordless SSH#

17.2 NCCL + MPI Multi-Node Benchmark Example#

18. Common Component Overview#

NVIDIA Components#

OpenMPI#

NCCL#

NCCL Tests#

FabricManager#