538 字
3 分钟

Complete Installation Guide: Ubuntu NVIDIA GPU + CUDA + NCCL + OFED + OpenMPI High‑Performance Computing Environment

Applies to HGX A800 / Tesla series NVIDIA data center GPUs.


1. Official Resources#

NVIDIA Driver / CUDA#

Infiniband Driver#

Deployment Reference Docs#


2. Basic System Configuration#

2.1 Update Package Mirrors#

Terminal window
bash <(curl -sSL https://linuxmirrors.cn/main.sh)

2.2 Set Hostname#

Terminal window
hostnamectl set-hostname ubuntu-2204

2.3 Disable Nouveau#

Terminal window
sudo bash -c "echo blacklist nouveau >> /etc/modprobe.d/blacklist.conf"
sudo bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist.conf"
sudo update-initramfs -u
reboot

Verify (no output means success):

Terminal window
lsmod | grep nouveau

3. Desktop Environment and Dependencies#

(Optional)

Terminal window
sudo apt update
sudo apt install lightdm
sudo apt install gcc make dkms

List available drivers:

Terminal window
apt install ubuntu-drivers-common
ubuntu-drivers devices

4. Install NVIDIA Driver#

Example: HGX A800 + 550 driver

Terminal window
wget https://cn.download.nvidia.cn/tesla/550.54.15/NVIDIA-Linux-x86_64-550.54.15.run
chmod +x NVIDIA-Linux-x86_64-550.54.15.run
sudo ./NVIDIA-Linux-x86_64-550.54.15.run \
-no-x-check -no-nouveau-check -no-opengl-files

Verify:

Terminal window
nvidia-smi

5. Uninstall NVIDIA Driver#

Terminal window
sudo apt-get --purge remove nvidia*
sudo apt autoremove
sudo apt-get --purge remove "*nvidia*"
nvidia-uninstall
dpkg -l | grep nvidia
reboot

6. Install CUDA Toolkit#

Example (CUDA 12.4):

Terminal window
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run

During installation:

  • Accept EULA
  • Uncheck driver installation
  • Install CUDA

7. Configure CUDA Environment Variables#

Terminal window
echo 'export PATH=$PATH:/usr/local/cuda-12.4/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.4/lib64' >> ~/.bashrc
source ~/.bashrc

Verify:

Terminal window
nvcc -V

8. Uninstall CUDA#

Terminal window
cd /usr/local/cuda-12.4/bin
sudo ./uninstall_cuda_12.4.pl
sudo rm -rf /usr/local/cuda-12.4

9. Install IB Driver (MLNX OFED)#

Example for Ubuntu 22.04:

Terminal window
wget https://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
tar -zxvf MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
cd MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64
sudo ./mlnxofedinstall
sudo /etc/init.d/openibd restart

Verify:

Terminal window
ibdev2netdev
ibstatus

If not installed:

Terminal window
apt install infiniband-diags

10. Critical Module Checks#

Terminal window
modprobe nvidia_peermem
apt-get install nvidia-modprobe
modprobe nvidia_peermem
lsmod | grep nvidia

Example of expected output:

nvidia_peermem
nvidia_drm
nvidia_modeset
nvidia

11. Install OpenMPI#

Terminal window
apt install -y gcc g++ make hwloc hwloc-nox libevent-dev
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.gz
tar zxf openmpi-4.1.5.tar.gz
cd openmpi-4.1.5
./configure --prefix=/usr/local/openmpi
make -j$(nproc)
make install

Environment variables:

Terminal window
echo 'export PATH=$PATH:/usr/local/openmpi/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/openmpi/lib' >> ~/.bashrc
source ~/.bashrc

Verify:

Terminal window
mpiexec --version

12. Configure IB NICs (Netplan)#

Example (multiple IB NICs):

network:
version: 2
ethernets:
eth0:
dhcp4: true
match:
macaddress: 52:54:00:13:3f:6d
set-name: eth0
ib0:
dhcp4: no
addresses: [10.1.12.1/24]
ib1:
dhcp4: no
addresses: [10.1.12.2/24]
ib2:
dhcp4: no
addresses: [10.1.12.3/24]
ib3:
dhcp4: no
addresses: [10.1.12.4/24]

Apply:

Terminal window
netplan apply

13. Install NCCL#

13.1 NVIDIA Repository Keyring#

Ubuntu 22.04:

Terminal window
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt update

13.2 Install a Specific NCCL Version#

Terminal window
sudo apt install libnccl2=2.20.5-1+cuda12.4 \
libnccl-dev=2.20.5-1+cuda12.4

13.3 Environment Variables#

Terminal window
vim /etc/profile

Add:

NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_6:1,mlx5_7:1
NCCL_IB_DISABLE=0
NCCL_SOCKET_IFNAME=eth0
NCCL_IB_GID_INDEX=3
NCCL_NET_GDR_LEVEL=2
NCCL_DEBUG=INFO

Activate:

Terminal window
source /etc/profile

Verify:

Terminal window
ldconfig -v | grep "libnccl.so"

14. Install NCCL Tests#

Terminal window
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1 MPI_HOME=/usr/local/openmpi

15. Install FabricManager#

Ubuntu 22.04 Example#

Terminal window
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-dev-550_550.54.15-1_amd64.deb
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-550_550.54.15-1_amd64.deb
sudo dpkg -i nvidia-fabricmanager-550_550.54.15-1_amd64.deb
sudo dpkg -i nvidia-fabricmanager-dev-550_550.54.15-1_amd64.deb
systemctl start nvidia-fabricmanager
systemctl enable nvidia-fabricmanager
systemctl status nvidia-fabricmanager

16. Single-Node NCCL Benchmark#

Terminal window
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

17. Multi-Node Benchmark#

17.1 Passwordless SSH#

Terminal window
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa
scp ~/.ssh/id_rsa.pub [email protected]:~/.ssh/authorized_keys

17.2 NCCL + MPI Multi-Node Benchmark Example#

Terminal window
mpirun --oversubscribe --allow-run-as-root \
-mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" \
-n 16 -N 8 \
-H node1,node2 \
-bind-to socket -map-by slot \
-mca pml ob1 -mca btl ^openib \
-mca orte_base_help_aggregate 0 \
-mca btl_tcp_if_include eth0 \
-mca coll_hcoll_enable 0 \
-x NCCL_DEBUG=INFO \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_NET_GDR_LEVEL=2 \
-x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1 \
-x NCCL_IB_GID_INDEX=3 \
~/nccl-tests/build/all_reduce_perf \
-b 256M -e 8G -f 2 -g 1 -c 1 -n 100

18. Common Component Overview#

NVIDIA Components#

  • GPU Driver: Drives the GPU hardware
  • CUDA Toolkit: General-purpose GPU computing toolchain
  • cuDNN: Deep learning GPU acceleration library

OpenMPI#

MPI implementation for multi-node communication.

NCCL#

NVIDIA Collective Communications Library for efficient multi-GPU / multi-node communication.

NCCL Tests#

NCCL performance and correctness benchmark suite.

FabricManager#

Manages NVSwitch / NVLink topology; required for HGX servers.

Complete Installation Guide: Ubuntu NVIDIA GPU + CUDA + NCCL + OFED + OpenMPI High‑Performance Computing Environment
https://catcat.blog/en/2025/12/ubuntu-nvidia-gpu-cuda-nccl-ofed-openmpi-installation-guide.html
作者
猫猫博客
发布于
2025-12-08
许可协议
CC BY-NC-SA 4.0