Complete Installation Guide: Ubuntu NVIDIA GPU + CUDA + NCCL + OFED + OpenMPI High‑Performance Computing Environment
Applies to HGX A800 / Tesla series NVIDIA data center GPUs.
1. Official Resources
NVIDIA Driver / CUDA
-
NVIDIA Driver and CUDA Compatibility Matrix
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html -
Driver download
https://www.nvidia.cn/Download/index.aspx?lang=cn- Product Type:
Data Center / Tesla - Product Series:
HGX-Series - Product:
HGX-A800 - OS:
Linux 64-bit
- Product Type:
-
Offline packages for CUDA releases
https://developer.nvidia.com/cuda-toolkit-archive -
NVIDIA Ubuntu Repository (FabricManager/driver packages)
https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/
Infiniband Driver
- MLNX OFED official download
https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed
Deployment Reference Docs
- Volcano Engine A800 GPU cluster deployment reference
https://www.volcengine.com/docs/6419/1123374
2. Basic System Configuration
2.1 Update Package Mirrors
bash <(curl -sSL https://linuxmirrors.cn/main.sh)2.2 Set Hostname
hostnamectl set-hostname ubuntu-22042.3 Disable Nouveau
sudo bash -c "echo blacklist nouveau >> /etc/modprobe.d/blacklist.conf"sudo bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist.conf"
sudo update-initramfs -urebootVerify (no output means success):
lsmod | grep nouveau3. Desktop Environment and Dependencies
(Optional)
sudo apt updatesudo apt install lightdmsudo apt install gcc make dkmsList available drivers:
apt install ubuntu-drivers-commonubuntu-drivers devices4. Install NVIDIA Driver
Example: HGX A800 + 550 driver
wget https://cn.download.nvidia.cn/tesla/550.54.15/NVIDIA-Linux-x86_64-550.54.15.runchmod +x NVIDIA-Linux-x86_64-550.54.15.run
sudo ./NVIDIA-Linux-x86_64-550.54.15.run \ -no-x-check -no-nouveau-check -no-opengl-filesVerify:
nvidia-smi5. Uninstall NVIDIA Driver
sudo apt-get --purge remove nvidia*sudo apt autoremovesudo apt-get --purge remove "*nvidia*"nvidia-uninstalldpkg -l | grep nvidiareboot6. Install CUDA Toolkit
Example (CUDA 12.4):
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.runDuring installation:
- Accept EULA
- Uncheck driver installation
- Install CUDA
7. Configure CUDA Environment Variables
echo 'export PATH=$PATH:/usr/local/cuda-12.4/bin' >> ~/.bashrcecho 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.4/lib64' >> ~/.bashrcsource ~/.bashrcVerify:
nvcc -V8. Uninstall CUDA
cd /usr/local/cuda-12.4/binsudo ./uninstall_cuda_12.4.plsudo rm -rf /usr/local/cuda-12.49. Install IB Driver (MLNX OFED)
Example for Ubuntu 22.04:
wget https://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgztar -zxvf MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgzcd MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64
sudo ./mlnxofedinstallsudo /etc/init.d/openibd restartVerify:
ibdev2netdevibstatusIf not installed:
apt install infiniband-diags10. Critical Module Checks
modprobe nvidia_peermemapt-get install nvidia-modprobemodprobe nvidia_peermem
lsmod | grep nvidiaExample of expected output:
nvidia_peermemnvidia_drmnvidia_modesetnvidia11. Install OpenMPI
apt install -y gcc g++ make hwloc hwloc-nox libevent-dev
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.gztar zxf openmpi-4.1.5.tar.gzcd openmpi-4.1.5
./configure --prefix=/usr/local/openmpimake -j$(nproc)make installEnvironment variables:
echo 'export PATH=$PATH:/usr/local/openmpi/bin' >> ~/.bashrcecho 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/openmpi/lib' >> ~/.bashrcsource ~/.bashrcVerify:
mpiexec --version12. Configure IB NICs (Netplan)
Example (multiple IB NICs):
network: version: 2 ethernets: eth0: dhcp4: true match: macaddress: 52:54:00:13:3f:6d set-name: eth0 ib0: dhcp4: no addresses: [10.1.12.1/24] ib1: dhcp4: no addresses: [10.1.12.2/24] ib2: dhcp4: no addresses: [10.1.12.3/24] ib3: dhcp4: no addresses: [10.1.12.4/24]Apply:
netplan apply13. Install NCCL
13.1 NVIDIA Repository Keyring
Ubuntu 22.04:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.debsudo dpkg -i cuda-keyring_1.0-1_all.debsudo apt update13.2 Install a Specific NCCL Version
sudo apt install libnccl2=2.20.5-1+cuda12.4 \ libnccl-dev=2.20.5-1+cuda12.413.3 Environment Variables
vim /etc/profileAdd:
NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_6:1,mlx5_7:1NCCL_IB_DISABLE=0NCCL_SOCKET_IFNAME=eth0NCCL_IB_GID_INDEX=3NCCL_NET_GDR_LEVEL=2NCCL_DEBUG=INFOActivate:
source /etc/profileVerify:
ldconfig -v | grep "libnccl.so"14. Install NCCL Tests
git clone https://github.com/NVIDIA/nccl-tests.gitcd nccl-testsmake MPI=1 MPI_HOME=/usr/local/openmpi15. Install FabricManager
Ubuntu 22.04 Example
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-dev-550_550.54.15-1_amd64.debwget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-550_550.54.15-1_amd64.deb
sudo dpkg -i nvidia-fabricmanager-550_550.54.15-1_amd64.debsudo dpkg -i nvidia-fabricmanager-dev-550_550.54.15-1_amd64.deb
systemctl start nvidia-fabricmanagersystemctl enable nvidia-fabricmanagersystemctl status nvidia-fabricmanager16. Single-Node NCCL Benchmark
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 817. Multi-Node Benchmark
17.1 Passwordless SSH
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsachmod 600 ~/.ssh/id_rsa17.2 NCCL + MPI Multi-Node Benchmark Example
mpirun --oversubscribe --allow-run-as-root \ -mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" \ -n 16 -N 8 \ -H node1,node2 \ -bind-to socket -map-by slot \ -mca pml ob1 -mca btl ^openib \ -mca orte_base_help_aggregate 0 \ -mca btl_tcp_if_include eth0 \ -mca coll_hcoll_enable 0 \ -x NCCL_DEBUG=INFO \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_IB_DISABLE=0 \ -x NCCL_NET_GDR_LEVEL=2 \ -x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1 \ -x NCCL_IB_GID_INDEX=3 \ ~/nccl-tests/build/all_reduce_perf \ -b 256M -e 8G -f 2 -g 1 -c 1 -n 10018. Common Component Overview
NVIDIA Components
- GPU Driver: Drives the GPU hardware
- CUDA Toolkit: General-purpose GPU computing toolchain
- cuDNN: Deep learning GPU acceleration library
OpenMPI
MPI implementation for multi-node communication.
NCCL
NVIDIA Collective Communications Library for efficient multi-GPU / multi-node communication.
NCCL Tests
NCCL performance and correctness benchmark suite.
FabricManager
Manages NVSwitch / NVLink topology; required for HGX servers.