752 字
4 分钟
Ubuntu NVIDIA GPU + CUDA + NCCL + OFED + OpenMPI 高性能计算环境完整安装指南
本文适用于 HGX A800 / Tesla 系列 NVIDIA 数据中心 GPU。
1. 官方资源链接
NVIDIA 驱动 / CUDA
-
NVIDIA 驱动与 CUDA 兼容矩阵 https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
-
驱动下载 https://www.nvidia.cn/Download/index.aspx?lang=cn
- 产品类型:
Data Center / Tesla - 系列:
HGX-Series - 产品:
HGX-A800 - OS:
Linux 64-bit
- 产品类型:
-
CUDA 各版本离线包 https://developer.nvidia.com/cuda-toolkit-archive
-
NVIDIA Ubuntu Repository(FabricManager/驱动包) https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/
Infiniband 驱动
部署参考文档
- 火山引擎 A800 GPU 集群部署参考 https://www.volcengine.com/docs/6419/1123374
2. 系统基础配置
2.1 更新软件源
bash <(curl -sSL https://linuxmirrors.cn/main.sh)2.2 设置主机名
hostnamectl set-hostname ubuntu-22042.3 禁用 Nouveau
sudo bash -c "echo blacklist nouveau >> /etc/modprobe.d/blacklist.conf"sudo bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist.conf"
sudo update-initramfs -ureboot验证(无输出即成功):
lsmod | grep nouveau3. 图形环境与依赖
(可选)
sudo apt updatesudo apt install lightdmsudo apt install gcc make dkms查看可用驱动:
apt install ubuntu-drivers-commonubuntu-drivers devices4. 安装 NVIDIA 驱动
以 HGX A800 + 550 驱动为例:
wget https://cn.download.nvidia.cn/tesla/550.54.15/NVIDIA-Linux-x86_64-550.54.15.runchmod +x NVIDIA-Linux-x86_64-550.54.15.run
sudo ./NVIDIA-Linux-x86_64-550.54.15.run \ -no-x-check -no-nouveau-check -no-opengl-files验证:
nvidia-smi5. 卸载 NVIDIA 驱动
sudo apt-get --purge remove nvidia*sudo apt autoremovesudo apt-get --purge remove "*nvidia*"nvidia-uninstalldpkg -l | grep nvidiareboot6. 安装 CUDA Toolkit
示例(CUDA 12.4):
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run安装时选择:
- 接受协议
- 取消安装驱动
- 安装 CUDA
7. 配置 CUDA 环境变量
echo 'export PATH=$PATH:/usr/local/cuda-12.4/bin' >> ~/.bashrcecho 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.4/lib64' >> ~/.bashrcsource ~/.bashrc验证:
nvcc -V8. 卸载 CUDA
cd /usr/local/cuda-12.4/binsudo ./uninstall_cuda_12.4.plsudo rm -rf /usr/local/cuda-12.49. 安装 IB 驱动 (MLNX OFED)
示例 Ubuntu 22.04:
wget https://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgztar -zxvf MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgzcd MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64
sudo ./mlnxofedinstallsudo /etc/init.d/openibd restart验证:
ibdev2netdevibstatus如无安装:
apt install infiniband-diags10. 关键模块检查
modprobe nvidia_peermemapt-get install nvidia-modprobemodprobe nvidia_peermem
lsmod | grep nvidia正常输出示例:
nvidia_peermemnvidia_drmnvidia_modesetnvidia11. 安装 OpenMPI
apt install -y gcc g++ make hwloc hwloc-nox libevent-dev
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.gztar zxf openmpi-4.1.5.tar.gzcd openmpi-4.1.5
./configure --prefix=/usr/local/openmpimake -j$(nproc)make install环境变量:
echo 'export PATH=$PATH:/usr/local/openmpi/bin' >> ~/.bashrcecho 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/openmpi/lib' >> ~/.bashrcsource ~/.bashrc验证:
mpiexec --version12. 配置 IB 网卡 (Netplan)
示例(多 IB 卡):
network: version: 2 ethernets: eth0: dhcp4: true match: macaddress: 52:54:00:13:3f:6d set-name: eth0 ib0: dhcp4: no addresses: [10.1.12.1/24] ib1: dhcp4: no addresses: [10.1.12.2/24] ib2: dhcp4: no addresses: [10.1.12.3/24] ib3: dhcp4: no addresses: [10.1.12.4/24]应用:
netplan apply13. 安装 NCCL
13.1 NVIDIA Repository Keyring
Ubuntu 22.04:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.debsudo dpkg -i cuda-keyring_1.0-1_all.debsudo apt update13.2 安装特定版本 NCCL
sudo apt install libnccl2=2.20.5-1+cuda12.4 \ libnccl-dev=2.20.5-1+cuda12.413.3 环境变量
vim /etc/profile加入:
NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_6:1,mlx5_7:1NCCL_IB_DISABLE=0NCCL_SOCKET_IFNAME=eth0NCCL_IB_GID_INDEX=3NCCL_NET_GDR_LEVEL=2NCCL_DEBUG=INFO启用:
source /etc/profile验证:
ldconfig -v | grep "libnccl.so"14. 安装 NCCL Tests
git clone https://github.com/NVIDIA/nccl-tests.gitcd nccl-testsmake MPI=1 MPI_HOME=/usr/local/openmpi15. 安装 FabricManager
Ubuntu 22.04 示例
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-dev-550_550.54.15-1_amd64.debwget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-550_550.54.15-1_amd64.deb
sudo dpkg -i nvidia-fabricmanager-550_550.54.15-1_amd64.debsudo dpkg -i nvidia-fabricmanager-dev-550_550.54.15-1_amd64.deb
systemctl start nvidia-fabricmanagersystemctl enable nvidia-fabricmanagersystemctl status nvidia-fabricmanager16. 单机 NCCL 测试
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 817. 多机测试
17.1 SSH 免密
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsachmod 600 ~/.ssh/id_rsa17.2 NCCL + MPI 多机测试示例
mpirun --oversubscribe --allow-run-as-root \ -mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" \ -n 16 -N 8 \ -H node1,node2 \ -bind-to socket -map-by slot \ -mca pml ob1 -mca btl ^openib \ -mca orte_base_help_aggregate 0 \ -mca btl_tcp_if_include eth0 \ -mca coll_hcoll_enable 0 \ -x NCCL_DEBUG=INFO \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_IB_DISABLE=0 \ -x NCCL_NET_GDR_LEVEL=2 \ -x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1 \ -x NCCL_IB_GID_INDEX=3 \ ~/nccl-tests/build/all_reduce_perf \ -b 256M -e 8G -f 2 -g 1 -c 1 -n 10018. 常用组件说明
NVIDIA 组件
- GPU Driver:驱动 GPU 工作
- CUDA Toolkit:GPU 通用计算工具链
- cuDNN:深度学习 GPU 加速库
OpenMPI
用于多机通信的 MPI 实现。
NCCL
NVIDIA 集合通信库,多 GPU / 多机高效通信。
NCCL Tests
NCCL 性能与正确性测试工具。
FabricManager
管理 NVSwitch / NVLink 拓扑,HGX 服务器必需。
Ubuntu NVIDIA GPU + CUDA + NCCL + OFED + OpenMPI 高性能计算环境完整安装指南
https://catcat.blog/2025/12/ubuntu-nvidia-gpu-cuda-nccl-ofed-openmpi-installation-guide.html