2026-04-30 08:28:11EE Times

According to TrendForce's latest AI server research, as large cloud service providers (CSPs) increase their efforts in developing their own chips, Nvidia shifted its focus at GTC 2026 to the practical application of AI inference across various fields, a departure from its previous focus on the cloud-based AI training market. It is driving demand for AI training and inference through a diversified product line including GPUs, CPUs, and LPUs, and leveraging rack integration solutions to boost supply chain growth.
TrendForce indicates that as the trend of self-developed chips, led by CSPs such as Google and Amazon, expands, the proportion of ASIC AI servers in the overall AI server shipments is expected to rise from 27.8% in 2026 to nearly 40% in 2030.
To solidify its leadership in the AI market, one of Nvidia's strategies is to actively promote rack-mount solutions integrating CPUs and GPUs, such as the GB300 and VR200, emphasizing scalability to AI inference applications. The Vera Rubin, unveiled at GTC, is defined as a highly vertically integrated complete system, encompassing seven chipsets and five racks.
Observing the progress of the Rubin supply chain, it is expected that memory manufacturers will be able to provide HBM4 chips for Rubin GPUs in the second quarter of 2026, helping Nvidia to gradually ship Rubin chips around the third quarter. As for the shipment progress of NVIDIA GB300 and VR200 Rack systems, the former has replaced GB200 as the mainstay in the fourth quarter of 2025, and its shipment share is expected to reach nearly 80% by 2026. The VR200 Rack is expected to gradually start shipping volume around the end of the third quarter of 2026, and its subsequent development will depend on the actual progress of the ODM.
Furthermore, as AI transitions from generation to the proxy model era, it faces significant latency and memory bandwidth bottlenecks during the token decoding stage. To address this, Nvidia integrated the technology from the Groq team to launch the Groq 3 LPU, specifically designed for low-latency inference, with 500MB of SRAM per chip and up to 128GB per rack.
However, the LPU's memory capacity is insufficient to accommodate the massive parameters and key-value cache required by Vera Rubin. Therefore, Nvidia proposed a "disaggregated inference" architecture at GTC, using an AI factory operating system called Dynamo to split the inference pipeline in two: the pre-fill and attention stages, which require extensive mathematical calculations and storage in a large key-value cache for processing agent-type AI, are handled by Vera Rubin, which boasts extremely high throughput and massive memory. Meanwhile, the decoding and token generation stages, limited by bandwidth and highly sensitive to latency, are directly offloaded to LPU racks with significantly expanded memory.
In terms of supply chain progress, the third-generation Groq LP30 is manufactured by Samsung and has entered the full-scale mass production stage. It is expected to be officially shipped in the second half of 2026. In the future, there are plans to launch the LP40 chip with higher performance in the next-generation Feynman architecture.
Declare:The sources of contents are from Internet,Please『 Contact Us 』 immediately if any infringement caused