752 字
4 分钟

Ubuntu NVIDIA GPU + CUDA + NCCL + OFED + OpenMPI 高性能计算环境完整安装指南

本文适用于 HGX A800 / Tesla 系列 NVIDIA 数据中心 GPU。


1. 官方资源链接#

NVIDIA 驱动 / CUDA#

Infiniband 驱动#

部署参考文档#


2. 系统基础配置#

2.1 更新软件源#

Terminal window
bash <(curl -sSL https://linuxmirrors.cn/main.sh)

2.2 设置主机名#

Terminal window
hostnamectl set-hostname ubuntu-2204

2.3 禁用 Nouveau#

Terminal window
sudo bash -c "echo blacklist nouveau >> /etc/modprobe.d/blacklist.conf"
sudo bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist.conf"
sudo update-initramfs -u
reboot

验证(无输出即成功):

Terminal window
lsmod | grep nouveau

3. 图形环境与依赖#

(可选)

Terminal window
sudo apt update
sudo apt install lightdm
sudo apt install gcc make dkms

查看可用驱动:

Terminal window
apt install ubuntu-drivers-common
ubuntu-drivers devices

4. 安装 NVIDIA 驱动#

以 HGX A800 + 550 驱动为例:

Terminal window
wget https://cn.download.nvidia.cn/tesla/550.54.15/NVIDIA-Linux-x86_64-550.54.15.run
chmod +x NVIDIA-Linux-x86_64-550.54.15.run
sudo ./NVIDIA-Linux-x86_64-550.54.15.run \
-no-x-check -no-nouveau-check -no-opengl-files

验证:

Terminal window
nvidia-smi

5. 卸载 NVIDIA 驱动#

Terminal window
sudo apt-get --purge remove nvidia*
sudo apt autoremove
sudo apt-get --purge remove "*nvidia*"
nvidia-uninstall
dpkg -l | grep nvidia
reboot

6. 安装 CUDA Toolkit#

示例(CUDA 12.4):

Terminal window
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run

安装时选择:

  • 接受协议
  • 取消安装驱动
  • 安装 CUDA

7. 配置 CUDA 环境变量#

Terminal window
echo 'export PATH=$PATH:/usr/local/cuda-12.4/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.4/lib64' >> ~/.bashrc
source ~/.bashrc

验证:

Terminal window
nvcc -V

8. 卸载 CUDA#

Terminal window
cd /usr/local/cuda-12.4/bin
sudo ./uninstall_cuda_12.4.pl
sudo rm -rf /usr/local/cuda-12.4

9. 安装 IB 驱动 (MLNX OFED)#

示例 Ubuntu 22.04:

Terminal window
wget https://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
tar -zxvf MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
cd MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64
sudo ./mlnxofedinstall
sudo /etc/init.d/openibd restart

验证:

Terminal window
ibdev2netdev
ibstatus

如无安装:

Terminal window
apt install infiniband-diags

10. 关键模块检查#

Terminal window
modprobe nvidia_peermem
apt-get install nvidia-modprobe
modprobe nvidia_peermem
lsmod | grep nvidia

正常输出示例:

nvidia_peermem
nvidia_drm
nvidia_modeset
nvidia

11. 安装 OpenMPI#

Terminal window
apt install -y gcc g++ make hwloc hwloc-nox libevent-dev
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.gz
tar zxf openmpi-4.1.5.tar.gz
cd openmpi-4.1.5
./configure --prefix=/usr/local/openmpi
make -j$(nproc)
make install

环境变量:

Terminal window
echo 'export PATH=$PATH:/usr/local/openmpi/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/openmpi/lib' >> ~/.bashrc
source ~/.bashrc

验证:

Terminal window
mpiexec --version

12. 配置 IB 网卡 (Netplan)#

示例(多 IB 卡):

network:
version: 2
ethernets:
eth0:
dhcp4: true
match:
macaddress: 52:54:00:13:3f:6d
set-name: eth0
ib0:
dhcp4: no
addresses: [10.1.12.1/24]
ib1:
dhcp4: no
addresses: [10.1.12.2/24]
ib2:
dhcp4: no
addresses: [10.1.12.3/24]
ib3:
dhcp4: no
addresses: [10.1.12.4/24]

应用:

Terminal window
netplan apply

13. 安装 NCCL#

13.1 NVIDIA Repository Keyring#

Ubuntu 22.04:

Terminal window
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt update

13.2 安装特定版本 NCCL#

Terminal window
sudo apt install libnccl2=2.20.5-1+cuda12.4 \
libnccl-dev=2.20.5-1+cuda12.4

13.3 环境变量#

Terminal window
vim /etc/profile

加入:

NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_6:1,mlx5_7:1
NCCL_IB_DISABLE=0
NCCL_SOCKET_IFNAME=eth0
NCCL_IB_GID_INDEX=3
NCCL_NET_GDR_LEVEL=2
NCCL_DEBUG=INFO

启用:

Terminal window
source /etc/profile

验证:

Terminal window
ldconfig -v | grep "libnccl.so"

14. 安装 NCCL Tests#

Terminal window
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1 MPI_HOME=/usr/local/openmpi

15. 安装 FabricManager#

Ubuntu 22.04 示例#

Terminal window
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-dev-550_550.54.15-1_amd64.deb
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-550_550.54.15-1_amd64.deb
sudo dpkg -i nvidia-fabricmanager-550_550.54.15-1_amd64.deb
sudo dpkg -i nvidia-fabricmanager-dev-550_550.54.15-1_amd64.deb
systemctl start nvidia-fabricmanager
systemctl enable nvidia-fabricmanager
systemctl status nvidia-fabricmanager

16. 单机 NCCL 测试#

Terminal window
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

17. 多机测试#

17.1 SSH 免密#

Terminal window
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa
scp ~/.ssh/id_rsa.pub [email protected]:~/.ssh/authorized_keys

17.2 NCCL + MPI 多机测试示例#

Terminal window
mpirun --oversubscribe --allow-run-as-root \
-mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" \
-n 16 -N 8 \
-H node1,node2 \
-bind-to socket -map-by slot \
-mca pml ob1 -mca btl ^openib \
-mca orte_base_help_aggregate 0 \
-mca btl_tcp_if_include eth0 \
-mca coll_hcoll_enable 0 \
-x NCCL_DEBUG=INFO \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_NET_GDR_LEVEL=2 \
-x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1 \
-x NCCL_IB_GID_INDEX=3 \
~/nccl-tests/build/all_reduce_perf \
-b 256M -e 8G -f 2 -g 1 -c 1 -n 100

18. 常用组件说明#

NVIDIA 组件#

  • GPU Driver:驱动 GPU 工作
  • CUDA Toolkit:GPU 通用计算工具链
  • cuDNN:深度学习 GPU 加速库

OpenMPI#

用于多机通信的 MPI 实现。

NCCL#

NVIDIA 集合通信库,多 GPU / 多机高效通信。

NCCL Tests#

NCCL 性能与正确性测试工具。

FabricManager#

管理 NVSwitch / NVLink 拓扑,HGX 服务器必需。


Ubuntu NVIDIA GPU + CUDA + NCCL + OFED + OpenMPI 高性能计算环境完整安装指南
https://catcat.blog/2025/12/ubuntu-nvidia-gpu-cuda-nccl-ofed-openmpi-installation-guide.html
作者
猫猫博客
发布于
2025-12-08
许可协议
CC BY-NC-SA 4.0