Alibaba Cloud engineer and researcher Ennan Zhai shared his research paper via GitHub, revealing the Cloud provider’s design for its data centers used for LLM training. The PDF document, entitled “Alibaba HPN: A Data Center Network for Large Language Model Training,” outlines how Alibaba used Ethernet to allow its 15,000 GPUs to communicate with each other.

General cloud computing generates consistent but small data flows with speeds lower than 10 Gbps. On the other hand, LLM training produces periodic bursts of data that can hit up to 400 Gbps. According to the paper, “This characteristic of LLM training predisposes Equal-Cost Multi-Path (ECMP), the commonly used load-balancing scheme in traditional data centers, to hash polarization, causing issues such as uneven traffic distribution.”

To avoid this, Zhai and his team developed the High-Performance Network (HPN), which used a “2-tier, dual-plane architecture” that decreases the number of possible ECMP occurrences while letting the system “precisely select network paths capable of holding elephant flows.” The HPN also used dual top-of-rack (ToR) switches, which allowed them to back up each other. These switches are the most common single-point failure for LLM training, requiring GPUs to complete iterations in sync.

Eight GPUs per host, 1,875 hosts per data center

Alibaba Cloud divided its data centers into hosts, with one host equipped with eight GPUs. Each GPU has its network interface card (NIC) with two ports, with each GPU-NIC system being called a ‘rail.’ The host also gets an extra NIC to connect to the backend network. Each rail then connects to two different ToR switches, ensuring that the entire host isn’t affected even if one switch fails.

Despite ditching NVlink for inter-host communication, Alibaba Cloud still uses Nvidia’s proprietary technology for the intra-host network, as communication between GPUs inside a host requires more bandwidth. However, since communication between rails is much slower, the “dedicated 400 Gbps of RDMA network throughput, resulting in a total bandwidth of 3.2 Tbps” per host, is more than enough to maximize the bandwidth of the PCIe Gen5x16 graphics cards.

Alibaba Cloud also uses a 51.2 Tb/sec Ethernet single-chip ToR switch, as multi-chip solutions are prone to more instability, with a four times greater failure rate than single-chip switches. However, these switches run hot, and no readily available heat sink on the market could stop them from shutting down due to overheating. So, the company created its novel solution by creating a vapor chamber heat sink with more pillars at the center to carry thermal energy much more efficiently.

Ennan Zhai and his team will present their work at the SIGCOMM (Special Interest Group on Data Communications) conference in Sydney, Australia, this August. Many companies, including AMD, Intel, Google, and Microsoft, would be interested in this project, primarily because they have banded together to create Ultra Accelerator Link—an open-standard interconnected set to compete with NVlink. This is especially true as Alibaba Cloud has been using the HPN for over eight months, meaning this technology has already been tried and tested.

However, HPN still has some drawbacks, the biggest being its complicated wiring structure. With each host having nine NICS and each NIC connected to two different ToR switches, there are a lot of chances to mix up which jack goes to which port. Nevertheless, this technology is presumably more affordable than NVlink, thus allowing any institution setting up a data center to save a ton of money on set-up costs (and perhaps even enable it to avoid Nvidia technology, especially if it’s one of the companies sanctioned by the U.S. in the ongoing chip war with China).

Signup bonus from $125 to $3000 | Signup now Football & Online Casino

0 0 votes

Article Rating

Eight GPUs per host, 1,875 hosts per data center

You Might Also Like: