Background#

NVIDIA A800-SXM4-80GB is a high-performance GPU server based on the NVIDIA Ampere architecture, equipped with 80GB of HBM per GPU. It is designed for high performance computing, AI, and machine learning workloads. I recently had a project that required benchmarking the performance of an A800-SXM4-80GB cluster, which led to the tests summarized here.

Benchmark Environment#

Server: NVIDIA A800-SXM4-80GB
Operating System: Ubuntu 22.04
GPUs: NVIDIA A800-SXM4-80GB × 8
GPU Memory: 80GB × 8 = 640GB
CPU: 2 × Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80GHz
System Memory: 956GB DDR4-3200MT/s

Benchmarks#

GPU Memory Bandwidth: Measure memory allocation performance and bandwidth across multiple GPUs.
GPU to CPU Transfer: Test PCIe transfer speeds between GPU and CPU.
GPU to GPU Transfer: Evaluate inter-GPU data transfer rates.
Disk I/O: Benchmark read/write performance of the system storage.
Computationally Intensive Tasks: Run deep learning models and synthetic workloads to test compute performance.
Model Inference: Benchmark common AI models such as ResNet, BERT, and GPT-2 for inference throughput and latency.
CPU Performance: Evaluate both single-threaded and multi-threaded CPU performance.
Memory Bandwidth: Measure system memory performance.
Tensor Core Performance: Benchmark GPU Tensor Core capabilities.
System Overview Snapshot: Capture OS, CPU, GPU telemetry, storage, and environment metadata for reproducible benchmarking.

Requirements#

System Requirements#

OS: Ubuntu 22.04/24.04 or Rocky/Alma Linux 9
Disk Space: At least 10GB of free disk space for benchmarking operations.
fio: Flexible I/O tester used for disk I/O benchmarking.
nvidia-smi: NVIDIA System Management Interface for GPU monitoring (typically installed with CUDA).
CUDA Libraries: Required for GPU operations (installed with the CUDA Toolkit).

Python Dependencies#

torch: PyTorch framework for deep learning workloads.
numpy: For numerical computation.
psutil: System and process utilities.
GPUtil: Monitor GPU utilization.
tabulate: Format output as tables.
transformers: For Transformer models such as BERT and GPT used in inference benchmarks.
torchvision: For ResNet and other image-related tasks.

Command-Line Options#

General Options#

--json: Output results in JSON format.
--detailed-output: Show detailed benchmark results and print an extended system overview (disk partitions, network links, environment variables).
--num-iterations N: Number of times to run each benchmark (default: 1).
--log-gpu: Enable GPU logging during benchmarks.
--gpu-log-file FILE: Specify GPU log filename (default: gpu_log.csv).
--gpu-log-metrics METRICS: Comma-separated list of GPU metrics to log.
--gpus GPU_IDS: Comma-separated list of GPU IDs to use (e.g., 0,1,2,3).
--precision {fp16,fp32,fp64,bf16}: Precision to use for computations (default: fp16).

GPU Benchmarks#

--gpu-data-gen: Run GPU data generation benchmark.
--gpu-to-cpu-transfer: Run GPU-to-CPU transfer benchmark.
--gpu-to-gpu-transfer: Run GPU-to-GPU transfer benchmark.
--gpu-memory-bandwidth: Run GPU memory bandwidth benchmark.
--gpu-tensor: Run GPU Tensor Core performance benchmark.
--gpu-compute: Run GPU compute workload benchmark.
--gpu-data-size-gb N: Data size (in GB) for GPU benchmarks (default: 5.0).
--gpu-memory-size-gb N: Memory size (in GB) for GPU memory bandwidth benchmark (default: 5.0).
--gpu-tensor-matrix-size N: Matrix size for GPU Tensor Core benchmark (default: 4096).
--gpu-tensor-iterations N: Number of iterations for GPU Tensor Core benchmark (default: 1000).
--gpu-comp-epochs N: Number of epochs for GPU compute workload (default: 200).
--gpu-comp-batch-size N: Batch size for GPU compute workload (default: 2048).
--gpu-comp-input-size N: Input size for GPU compute workload (default: 4096).
--gpu-comp-hidden-size N: Hidden size for GPU compute workload (default: 4096).
--gpu-comp-output-size N: Output size for GPU compute workload (default: 2000).

GPU Inference Benchmarks#

--gpu-inference: Run GPU inference throughput and latency benchmarks.
--gpu-inference-model {custom,resnet50,bert,gpt2}: Select the model for inference benchmarking (default: custom).
--model-size N: Depth of the custom inference model (default: 5).
--batch-size N: Batch size for inference benchmarks (default: 256).
--input-size N: Input feature size for inference benchmarks (default: 224).
--output-size N: Output dimension for inference benchmarks (default: 1000).
--iterations N: Number of inference iterations to execute (default: 100).

CPU Benchmarks#

--cpu-single-thread: Run single-threaded CPU performance benchmark.
--cpu-multi-thread: Run multi-threaded CPU performance benchmark.
--cpu-to-disk-write: Run CPU-to-disk write throughput benchmark.
--memory-bandwidth: Run memory bandwidth benchmark.
--cpu-num-threads N: Number of threads for multi-threaded CPU benchmarks (default: all logical cores).
--data-size-gb-cpu N: Data size (in GB) for CPU-to-disk write benchmark (default: 5.0).
--memory-size-mb-cpu N: Memory size (in MB) for CPU memory bandwidth benchmark (default: 1024).

Disk I/O Benchmarks#

--disk-io: Run disk I/O benchmark (using fio).
--disk-data-size N: Data size (in GB) for disk I/O benchmark (default: 2.0).
--disk-block-size N: Block size (in KB) for disk I/O benchmark (default: 4).
--disk-io-depth N: Queue depth for disk I/O benchmark (default: 16).
--disk-num-jobs N: Number of concurrent jobs to run (default: 8).
--disk-path PATH: Target directory for temporary files used in the disk benchmark (default: current directory).

Benchmark Results#

64310c2cdb2bc6c16c3e91092dfdabf6

8bc4761afad960954555a2a0527261ee

NVIDIA A800-SXM4-80GB 8-GPU Benchmark