637 字
3 分钟

NVIDIA A800-SXM4-80GB 8-GPU Benchmark

Background#

NVIDIA A800-SXM4-80GB is a high-performance GPU server based on the NVIDIA Ampere architecture, equipped with 80GB of HBM per GPU. It is designed for high performance computing, AI, and machine learning workloads. I recently had a project that required benchmarking the performance of an A800-SXM4-80GB cluster, which led to the tests summarized here.

Benchmark Environment#

  • Server: NVIDIA A800-SXM4-80GB
  • Operating System: Ubuntu 22.04
  • GPUs: NVIDIA A800-SXM4-80GB × 8
  • GPU Memory: 80GB × 8 = 640GB
  • CPU: 2 × Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80GHz
  • System Memory: 956GB DDR4-3200MT/s

Benchmarks#

  • GPU Memory Bandwidth: Measure memory allocation performance and bandwidth across multiple GPUs.
  • GPU to CPU Transfer: Test PCIe transfer speeds between GPU and CPU.
  • GPU to GPU Transfer: Evaluate inter-GPU data transfer rates.
  • Disk I/O: Benchmark read/write performance of the system storage.
  • Computationally Intensive Tasks: Run deep learning models and synthetic workloads to test compute performance.
  • Model Inference: Benchmark common AI models such as ResNet, BERT, and GPT-2 for inference throughput and latency.
  • CPU Performance: Evaluate both single-threaded and multi-threaded CPU performance.
  • Memory Bandwidth: Measure system memory performance.
  • Tensor Core Performance: Benchmark GPU Tensor Core capabilities.
  • System Overview Snapshot: Capture OS, CPU, GPU telemetry, storage, and environment metadata for reproducible benchmarking.

Requirements#

System Requirements#

  • OS: Ubuntu 22.04/24.04 or Rocky/Alma Linux 9
  • Disk Space: At least 10GB of free disk space for benchmarking operations.
  • fio: Flexible I/O tester used for disk I/O benchmarking.
  • nvidia-smi: NVIDIA System Management Interface for GPU monitoring (typically installed with CUDA).
  • CUDA Libraries: Required for GPU operations (installed with the CUDA Toolkit).

Python Dependencies#

  • torch: PyTorch framework for deep learning workloads.
  • numpy: For numerical computation.
  • psutil: System and process utilities.
  • GPUtil: Monitor GPU utilization.
  • tabulate: Format output as tables.
  • transformers: For Transformer models such as BERT and GPT used in inference benchmarks.
  • torchvision: For ResNet and other image-related tasks.

Command-Line Options#

General Options#

  • --json: Output results in JSON format.
  • --detailed-output: Show detailed benchmark results and print an extended system overview (disk partitions, network links, environment variables).
  • --num-iterations N: Number of times to run each benchmark (default: 1).
  • --log-gpu: Enable GPU logging during benchmarks.
  • --gpu-log-file FILE: Specify GPU log filename (default: gpu_log.csv).
  • --gpu-log-metrics METRICS: Comma-separated list of GPU metrics to log.
  • --gpus GPU_IDS: Comma-separated list of GPU IDs to use (e.g., 0,1,2,3).
  • --precision {fp16,fp32,fp64,bf16}: Precision to use for computations (default: fp16).

GPU Benchmarks#

  • --gpu-data-gen: Run GPU data generation benchmark.
  • --gpu-to-cpu-transfer: Run GPU-to-CPU transfer benchmark.
  • --gpu-to-gpu-transfer: Run GPU-to-GPU transfer benchmark.
  • --gpu-memory-bandwidth: Run GPU memory bandwidth benchmark.
  • --gpu-tensor: Run GPU Tensor Core performance benchmark.
  • --gpu-compute: Run GPU compute workload benchmark.
  • --gpu-data-size-gb N: Data size (in GB) for GPU benchmarks (default: 5.0).
  • --gpu-memory-size-gb N: Memory size (in GB) for GPU memory bandwidth benchmark (default: 5.0).
  • --gpu-tensor-matrix-size N: Matrix size for GPU Tensor Core benchmark (default: 4096).
  • --gpu-tensor-iterations N: Number of iterations for GPU Tensor Core benchmark (default: 1000).
  • --gpu-comp-epochs N: Number of epochs for GPU compute workload (default: 200).
  • --gpu-comp-batch-size N: Batch size for GPU compute workload (default: 2048).
  • --gpu-comp-input-size N: Input size for GPU compute workload (default: 4096).
  • --gpu-comp-hidden-size N: Hidden size for GPU compute workload (default: 4096).
  • --gpu-comp-output-size N: Output size for GPU compute workload (default: 2000).

GPU Inference Benchmarks#

  • --gpu-inference: Run GPU inference throughput and latency benchmarks.
  • --gpu-inference-model {custom,resnet50,bert,gpt2}: Select the model for inference benchmarking (default: custom).
  • --model-size N: Depth of the custom inference model (default: 5).
  • --batch-size N: Batch size for inference benchmarks (default: 256).
  • --input-size N: Input feature size for inference benchmarks (default: 224).
  • --output-size N: Output dimension for inference benchmarks (default: 1000).
  • --iterations N: Number of inference iterations to execute (default: 100).

CPU Benchmarks#

  • --cpu-single-thread: Run single-threaded CPU performance benchmark.
  • --cpu-multi-thread: Run multi-threaded CPU performance benchmark.
  • --cpu-to-disk-write: Run CPU-to-disk write throughput benchmark.
  • --memory-bandwidth: Run memory bandwidth benchmark.
  • --cpu-num-threads N: Number of threads for multi-threaded CPU benchmarks (default: all logical cores).
  • --data-size-gb-cpu N: Data size (in GB) for CPU-to-disk write benchmark (default: 5.0).
  • --memory-size-mb-cpu N: Memory size (in MB) for CPU memory bandwidth benchmark (default: 1024).

Disk I/O Benchmarks#

  • --disk-io: Run disk I/O benchmark (using fio).
  • --disk-data-size N: Data size (in GB) for disk I/O benchmark (default: 2.0).
  • --disk-block-size N: Block size (in KB) for disk I/O benchmark (default: 4).
  • --disk-io-depth N: Queue depth for disk I/O benchmark (default: 16).
  • --disk-num-jobs N: Number of concurrent jobs to run (default: 8).
  • --disk-path PATH: Target directory for temporary files used in the disk benchmark (default: current directory).

Benchmark Results#

image-20251028111140898

64310c2cdb2bc6c16c3e91092dfdabf6

8bc4761afad960954555a2a0527261ee

NVIDIA A800-SXM4-80GB 8-GPU Benchmark
https://catcat.blog/en/nvidia-a800-sxm4-80gb-benchmark.html
作者
猫猫博客
发布于
2025-10-28
许可协议
CC BY-NC-SA 4.0