How Can We Break Through the AI Computing Power Bottleneck? Starting with the Optimization of GPU Server Clusters
Since 2023, generative AI represented by large language models has swept across the globe. From ChatGPT to Sora, from text generation to video generation, the boundaries of AI capability continue to expand.
What many people do not realize, however, is that every leap in AI capability is backed by exponential growth in computing power investment.
Research institutions estimate that training GPT-3 required thousands of GPU cards running continuously for weeks. GPT-4, with an even larger parameter scale, demanded several times more computing power again.
Today, computing power has become the core bottleneck restricting AI development.
So how can we break through this bottleneck through GPU server cluster optimization? Let’s explore.
To solve the bottleneck problem, we first need to understand where the bottlenecks actually exist. In AI infrastructure, they typically appear on three levels.
This is the most intuitive bottleneck.
AI models are becoming increasingly large, and the computational capability of a single GPU is no longer sufficient.
Take the NVIDIA H100 as an example. Its computing power approaches 2000 TFLOPS, yet when training trillion-parameter models, even this level of performance becomes inadequate.
The obvious solution is:
Use multiple GPUs in parallel.
This is why 8-GPU, 16-GPU, and even 32-GPU servers have become increasingly common.
Multi-GPU parallelism introduces a new challenge:
GPUs must constantly communicate with each other.
During training, after every iteration, all GPUs exchange gradient data and synchronize model parameters. If communication speed cannot keep up with computation speed, GPUs end up waiting idle, wasting valuable computing resources.
It’s similar to a team collaboration problem:
If communication is inefficient, even highly capable individuals cannot achieve maximum productivity.
Even after solving communication inside a single server, a bigger challenge emerges:
When models become too large for a single machine, multiple servers must work together.
At this point, communication between servers becomes the new bottleneck.
Inside a data center, servers communicate through Ethernet networks, whose latency and bandwidth are far inferior to internal GPU interconnects like NVLink.
Maintaining high efficiency across multi-node clusters is now one of the core challenges in AI infrastructure.
To overcome communication bottlenecks, hardware optimization is critical.
NVIDIA addresses this challenge with:
NVLink
NVSwitch
NVLink is a high-speed GPU interconnect technology with significantly higher bandwidth than PCIe.
NVSwitch acts as an internal switching fabric, enabling full-mesh GPU interconnection so any GPU can communicate directly with any other GPU at high speed.
In an 8x H100 server, NVSwitch enables:
Up to 900GB/s inter-GPU bandwidth per GPU
Nearly 10x the bandwidth of PCIe 5.0
As a result, gradient synchronization time drops from seconds to milliseconds, dramatically reducing GPU idle time.
In systems without NVLink, PCIe topology becomes extremely important.
A common optimization is:
CPU direct connection instead of PCH bridge routing.
CPU-direct PCIe lanes offer:
Lower latency
Higher bandwidth
When designing AI servers, our engineers carefully plan GPU slot placement to ensure all GPUs connect directly to the CPU, avoiding unnecessary bottlenecks introduced by PCH routing.
Another frequently overlooked issue is memory consistency.
In multi-GPU systems, memory spaces accessed by different GPUs must remain synchronized. Poor design can lead to data inconsistency and computational errors.
Our solution combines:
Hardware-level consistency protocol support
Driver-level synchronization mechanisms
This architecture has already been validated across multiple AI server projects.
Hardware provides the foundation, but software unlocks the true performance potential.
With proper optimization, identical hardware can deliver over 30% better performance.
Large-model training generally uses three major parallelization methods:
Each GPU holds a complete model copy while processing different data batches.
Suitable when:
The model fits into a single GPU.
The model is split across multiple GPUs.
Suitable when:
The model is too large for a single GPU.
This approach requires intensive inter-GPU communication during forward and backward propagation.
Different GPUs process different model layers sequentially like an assembly line.
Advantages:
Reduces communication overhead
Challenges:
Pipeline bubbles may reduce efficiency
In real-world training, these methods are often combined together.
Finding the optimal combination is a highly specialized engineering task.
NVIDIA provides NCCL (NVIDIA Collective Communications Library) for multi-GPU communication.
However, NCCL parameter tuning significantly impacts performance.
For example, NCCL supports multiple communication algorithms:
Ring
Tree
AllReduce
Different algorithms perform differently depending on cluster scale and topology.
Our performance optimization team benchmarks various configurations directly on customer clusters to identify the best-performing setup.
To reduce communication overhead, gradient compression can be highly effective.
By quantizing or sparsifying gradient data before transmission:
Bandwidth consumption is reduced
Model accuracy is largely preserved
Mixed precision training is another key optimization.
It uses:
FP16 for computation
FP32 for gradient accumulation
This approach combines:
FP16’s performance advantages
FP32’s numerical stability
Modern AI frameworks now widely support automatic mixed precision training.
As AI clusters scale from dozens to thousands of GPUs, system-level optimization becomes increasingly critical.
Large AI clusters commonly use:
Fat-tree topologies
Non-blocking network architectures
This ensures sufficient bandwidth between any two servers.
For example, one major internet company deployed:
400G RoCE (RDMA over Converged Ethernet)
Intelligent traffic scheduling algorithms
Results included:
Cross-server latency below 10 microseconds
Over 95% bandwidth utilization
In clusters containing thousands of GPUs:
Failures are normal, not exceptional.
Every day may involve:
GPU failures
Network interruptions
Process crashes
The challenge is minimizing disruption.
Our solution integrates intelligent fault prediction algorithms directly into the server BMC system, enabling early warning before failures occur.
This gives operations teams time to intervene proactively.
Different AI workloads require different resources:
Some require massive GPU capacity
Others demand high memory
Others are network-sensitive
Efficient scheduling is essential to maximize cluster utilization.
Together with one customer, we developed an AI resource scheduling system capable of:
Dynamically allocating resources
Automatically reclaiming idle resources
Reassigning unused capacity to other tasks
This increased average GPU utilization from 45% to 72%.
Last year, we provided GPU cluster optimization services for an AI startup.
Their challenge was straightforward:
They needed to train a 100-billion-parameter model on their existing cluster, but training was projected to take three months.
They wanted us to accelerate the process.
We discovered that several servers had poorly configured PCIe layouts that restricted GPU communication.
After redesigning the PCIe topology:
Communication bandwidth improved by 40%
The customer originally used pure data parallelism.
The model size caused memory overflow issues.
We redesigned the training architecture using:
Pipeline parallelism
Combined with data parallelism
This solved the memory problem while maintaining high computational efficiency.
Through extensive testing, we identified the optimal NCCL configuration for the customer’s network environment.
Results:
Gradient synchronization time reduced by 25%
Related Recommendations
Learn more news and information