Recently, Shen Wan Hongyuan and Huang Zhonghuang's team released a research report saying that at a time when large model parameters are exploding, computing power requirements are rapidly shifting from single-point to system-level integration.
Under this trend, scale-up and scale-out have become the two core dimensions of computing power expansion. Supernodes not only make progress in cabinet-level interconnection and cross-cabinet networking technology, but will also reshape the division of labor in the computing power industry chain, spawning investment opportunities such as server integration, increased optical communication, and improved liquid cooling penetration.
Using a freighter as an analogy, when demand for total capacity expands, scale-up builds larger freighters, while scale-out increases the number of freighters. Scale-up pursues tight hardware coupling; scale-out pursues elastic scaling to support loose tasks (such as data parallelism). The two have essential differences in protocol stacks, hardware, and fault-tolerance mechanisms, and communication efficiency is different.
Currently, scale-up has broken through the traditional limitations of a single server and single cabinet and entered the “supernode” era. Scale-up can be understood as increasing the number of GPUs (from 2 cards to 8 cards) within a single node (previously referred to as a single server); however, its core is to enable fully connected GPUs within a node rather than physically existing on a single server or single cabinet. With the evolution of interconnection technology, scale-up is breaking through the limitations of a single server and a single cabinet, and “supernodes” can cross servers and cabinets.
A supernode is actually a scaling up of a computing power network system at the level of a single or multiple cabinets. The mainstream communication scheme within the node is copper connections and electrical signals, while the cross-cabinet is considering introducing optical communication; the hardware boundary with the scale out is a NIC network card, and the outside uses devices such as optical modules and Ethernet switches. The architecture design, hardware equipment, and protocol standards of the two are fundamentally different.
Currently, Scale Up and Scale Out have not merged or intersected. Chip manufacturers such as Nvidia, Broadcom, Huawei, and Haiguang are expected to be deeply involved in the scale up field, while Ethernet (such as Broadcom network chips, Hisilicon network chips, Senko Communications, etc.) will focus on the scale out field.
Under the cabinet/supernode trend, the vertical integration of AI chip manufacturers to improve their communication, storage, and software capabilities is a definite trend. Chip giants are strengthening their layout in the computing power network. Nvidia, AMD, and domestic Haiguang information absorption and merger with Zhongke Shuguang are all in action.
Nvidia has 8 mergers and acquisitions in the past 6 years, focusing on the integration of the entire computing power chain. By acquiring network technology (Mellanox), software-defined networking (Cumulus), industry applications (Parabricks), cloud services (Lepton AI), and AI development tools (Run.ai, Deci), we are building a closed loop ecosystem from chip to application to cope with competition from cloud giants and penetrate emerging markets.
The domestic Haiguang information release plan to absorb and merge Zhongke Shuguang also confirms this industry trend. Industrial collaboration perspective. Haiguang Information's main revenue source is CPU+DCU, and Zhongke Shuguang's main revenue source is server+cloud infrastructure. After the merger was completed, Haiguang+Shuguang completed the deployment of the entire hardware industry chain from core to cloud, and the synergy effect was obvious.
So, is the living space of server vendors being squeezed?
First, AI chip makers won't enter the OEM business. After AMD acquired ZT System, it divested its OEM business to avoid competition with OEM/ODM. Haiguang's acquisition of Shuguang was also aimed at strengthening collaboration and improving capabilities such as liquid cooling and software.
However, the industrial chain division of labor in the computing power chain may be further refined. Under the trend of supernodes, most of the interconnections between AI chips and between AI chips and switch chips need to be connected through a board card (especially electrical signal interconnection). Take Nvidia as an example. Its board card is designed by itself at the beginning of the product launch, and will be opened to OEM partners after the product is stabilized. At this point, the ability to design a board becomes the core differentiating ability to obtain more value. Therefore, the division of labor in the OEM industry chain may be further divided into board card design OEM suppliers and cabinet OEM suppliers.
In terms of industrial opportunities, seize the industrial chain opportunities under the evolution of technology paths, focus on hardware interconnection and scenario adaptation dual-line layout. It is recommended to focus on AI chip and server suppliers: Haiguang Information, Zhongke Shuguang, Inspur Information, Ziguang Co., Ltd., Shenzhou Digital, Lenovo Group, Huaqin Technology, etc.