1078 字

5 分钟

NVIDIA A800-SXM4-80GB 8卡基准测试

2025-10-28

GPU服务器

背景介绍#

NVIDIA A800-SXM4-80GB 是 NVIDIA 推出的一款高性能 GPU 服务器，采用 NVIDIA Ampere 架构，拥有 80GB 的显存，适用于高性能计算、人工智能、机器学习等场景。最近有个项目，需要测试一下A800-SXM4-80GB的集群性能，所以就有了这个测试。

测试环境#

服务器：NVIDIA A800-SXM4-80GB
操作系统：Ubuntu 22.04
显卡：NVIDIA A800-SXM4-80GB * 8
显存：80GB * 8 = 640GB
处理器：2 * Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80GHz
内存：956GB DDR4-3200MT/s

测试项目#

GPU Memory Bandwidth: Measure memory allocation and bandwidth across multiple GPUs.
GPU to CPU Transfer: Test PCIe transfer speeds between GPU and CPU.
GPU to GPU Transfer: Evaluate data transfer rates between GPUs.
Disk I/O: Benchmark read/write performance of the system’s storage.
Computationally Intensive Tasks: Run deep learning models and synthetic tasks to test compute performance.
Model Inference: Benchmark common AI models like ResNet, BERT, GPT-2 for inference throughput and latency.
CPU Performance: Evaluate both single-threaded and multi-threaded CPU performance.
Memory Bandwidth: Measure system memory performance.
Tensor Core Performance: Benchmark GPU Tensor Core capabilities.
System Overview Snapshot: Capture OS, CPU, GPU telemetry, storage, and environment metadata for reproducible benchmarking.

要求#

系统要求#

操作系统：Ubuntu 22.04/24.04 或 Rocky/Alma Linux 9
磁盘空间：至少 10GB 的可用磁盘空间用于基准测试操作。
fio：灵活的 I/O 测试器，用于磁盘 I/O 基准测试。
nvidia-smi：NVIDIA 系统管理接口，用于 GPU 监控（通常与 CUDA 一起安装）。
CUDA 库：GPU 操作所需（与 CUDA 工具包一起安装）。

Python 依赖项#

torch：用于深度学习操作的 PyTorch 框架。
numpy：用于数值运算。
psutil：用于系统和进程实用程序。
GPUtil：监控 GPU 使用情况。
tabulate：用于将输出格式化为表格。
transformers：适用于 BERT 和 GPT 推理等 Transformer 模型。
torchvision：用于ResNet和其他图像相关任务。

命令行选项#

常规选项#

--json：以 JSON 格式输出结果。
--detailed-output：显示详细的基准测试结果，并打印扩展的系统概览（磁盘分区、网络链接、环境变量）。
--num-iterations N：运行基准测试的次数（默认值：1）。
--log-gpu：在基准测试期间启用 GPU 日志记录。
--gpu-log-file FILE：指定 GPU 日志文件名（默认值：gpu_log.csv）。
--gpu-log-metrics METRICS：要记录的 GPU 指标的逗号分隔列表。
--gpus GPU_IDS：要使用的 GPU ID 的逗号分隔列表（例如 0,1,2,3）。
--precision {fp16,fp32,fp64,bf16}：用于计算的精度（默认值：fp16）。

GPU 基准测试#

--gpu-data-gen：运行 GPU 数据生成基准测试。
--gpu-to-cpu-transfer：运行 GPU 到 CPU 传输基准测试。
--gpu-to-gpu-transfer：运行 GPU 到 GPU 传输基准测试。
--gpu-memory-bandwidth：运行 GPU 内存带宽基准测试。
--gpu-tensor：运行 GPU Tensor Core 性能基准测试。
--gpu-compute：运行 GPU 计算任务基准测试。
--gpu-data-size-gb N：GPU 基准测试的数据大小（GB）（默认值：5.0）。
--gpu-memory-size-gb N：GPU 内存带宽基准的内存大小（GB）（默认值：5.0）。
--gpu-tensor-matrix-size N：GPU Tensor Core 基准的矩阵大小（默认值：4096）。
--gpu-tensor-iterations N：GPU Tensor Core 基准测试迭代次数（默认值：1000）。
--gpu-comp-epochs N：GPU 计算任务的周期数（默认值：200）。
--gpu-comp-batch-size N：GPU 计算任务的批量大小（默认值：2048）。
--gpu-comp-input-size N：GPU 计算任务的输入大小（默认值：4096）。
--gpu-comp-hidden-size N：GPU 计算任务的隐藏层大小（默认值：4096）。
--gpu-comp-output-size N：GPU 计算任务的输出大小（默认值：2000）。

GPU 推理基准#

--gpu-inference：运行 GPU 推理吞吐量和延迟基准测试。
--gpu-inference-model {custom,resnet50,bert,gpt2}：选择基准测试的模型（默认值：custom）。
--model-size N：自定义推理模型的深度（默认值：5）。
--batch-size N：推理基准的批量大小（默认值：256）。
--input-size N：推理基准的输入特征大小（默认值：224）。
--output-size N：推理基准的输出维度（默认值：1000）。
--iterations N：执行的推理迭代次数（默认值：100）。

CPU 基准测试#

--cpu-single-thread：运行 CPU 单线程性能基准测试。
--cpu-multi-thread：运行 CPU 多线程性能基准测试。
--cpu-to-disk-write：运行 CPU 到磁盘写入吞吐量基准测试。
--memory-bandwidth：运行内存带宽基准测试。
--cpu-num-threads N：用于多线程 CPU 基准测试的线程数（默认值：所有逻辑核心）。
--data-size-gb-cpu N：CPU 到磁盘写入基准的数据大小（GB）（默认值：5.0）。
--memory-size-mb-cpu N：CPU 内存带宽基准的内存大小（MB）（默认值：1024）。

磁盘 I/O 基准测试#

--disk-io：运行磁盘 I/O 基准测试（使用 fio）。
--disk-data-size N：磁盘 I/O 基准的数据大小（GB）（默认值：2.0）。
--disk-block-size N：磁盘 I/O 基准的块大小（KB）（默认值：4）。
--disk-io-depth N：磁盘 I/O 基准的队列深度（默认值：16）。
--disk-num-jobs N：运行的并发作业数（默认值：8）。
--disk-path PATH：磁盘基准临时文件的目标目录（默认值：当前目录）。

测试结果#

64310c2cdb2bc6c16c3e91092dfdabf6

8bc4761afad960954555a2a0527261ee

NVIDIA A800-SXM4-80GB 8卡基准测试

https://catcat.blog/nvidia-a800-sxm4-80gb-benchmark.html

作者

猫猫博客

发布于

2025-10-28

许可协议

CC BY-NC-SA 4.0

ByteCook · 字节小炒 —— 为开发者和自托管爱好者准备的社区

GoMami 香港 AMD-Turin-9575F 测试