[bsfp-cryptocurrency style=”widget-18″ align=”marquee” columns=”6″ coins=”selected” coins-count=”6″ coins-selected=”BTC,ETH,XRP,LTC,EOS,ADA,XLM,NEO,LTC,EOS,XEM,DASH,USDT,BNB,QTUM,XVG,ONT,ZEC,STEEM” currency=”USD” title=”Cryptocurrency Widget” show_title=”0″ icon=”” scheme=”light” bs-show-desktop=”1″ bs-show-tablet=”1″ bs-show-phone=”1″ custom-css-class=”” custom-id=”” css=”.vc_custom_1523079266073{margin-bottom: 0px !important;padding-top: 0px !important;padding-bottom: 0px !important;}”]

Solving the Real Roadblock of Next-Generation AI

Toward the end of last year, many media outlets began to question whether AI development was finally slowing down. So far, it’s been a more measured year for surprises, DeepSeek-sized shocks aside. AI hype is given way to practical assessments about how and where to find and measure actual ROI.

A similar shift is happening within the AI industry itself. We are reaching the limits of our current array of solutions to improve performance. This limitation is on both energy-efficiency limitations of what’s possible on a single die and the challenges to multi-die-based, complex solutions that can enable the necessary scale. These base-level roadblocks are what are truly hindering the next generation of AI development.

The Brute-Force Era

So far, companies have been able to arrange CPUs, GPUs, neural network accelerators, and other chips into monoliths to crunch the vast datasets required to train AI models. Consequently, most major models rely on more than 16,000 chips. DeepSeek’s ability to provide strong performance on just 2,000 NVIDIA chips represented a huge breakthrough that showed model innovation that could deliver accuracy of traditional LLMs at substantially lower compute.

However, even buying a relatively small amount like 2,000 chips may become more difficult. Supermicro just pushed NVIDIA for faster shipments again at Computex in late May, and ongoing economic turbulence due to tariffs and trade wars could stifle the global flow of chips again. Even if 2,000 chips were to become the new standard for a high-performance model, price points are still high enough to make this a prohibitively expensive solution to engineer—let alone regularly upgrade or repair.

Read More on AiThority: AI Agents: Transformative or Turbulent?

The New Challenge

As more GPUs participate in a single training run, the amount of data that needs to be moved both within and between chips increases dramatically. As with original CPU-based data centers on traditional workloads, memory bottlenecks and network bottlenecks led to very low utilization of “big-iron” CPUs and hence reduced power efficiency and hence cost-effectiveness. The current GPU chip behemoths are reaching the same challenge with added issues of latency, which is critical for these workloads.

Additionally, as Epoch.AI published late last year, LLM scaling appears to be fast approaching a “latency wall,” or the place where computational power can no longer overcome the bottlenecks of data movement, at 58 TFLOPs. Their estimates show that at current rates, we will hit that wall within just three years.

These inefficiencies often arise from what amounts to a very simple problem: the current hardware architectures weren’t designed with these extreme data movement requirements in mind. Traditional crossbar switching architectures are a good example; they struggle to keep pace with modern throughput demands because they scale quadratically to connect every input and every output. With a small number of inputs and outputs, it’s perfectly serviceable. At the scale needed for AI, the current methods of implementing these is running out of steam. Nvidia’s NVLink solutions, which are the gold standard, are not keeping up to the expected demand in node-count growth for scale-up switching.

Related Posts
1 of 14,561

Rethinking Infrastructure

I find it encouraging that open-source architectures like RISC-V and open industry standards such as UALink™ from the UALink Consortium continue to grow in popularity as the industry tries to innovate beyond current bottlenecks. Even NVIDIA seems to be opening up its walled garden as of late.

To break through the “latency wall,” we must fundamentally rethink how data moves through AI systems. This requires challenging long-standing design assumptions and standards at every level, from basic interconnects to system-on-chip (SoC) architectures themselves.

Additionally, switching solutions need to think beyond the traditional crossbar as well. Chiplets offer one promising path forward by shifting to a modular approach.  However, for traditional crossbarscrossbars, the “shoreline” (availability of all edges of the chiplet for high-speed IO) is limited due to the advanced and custom implementations of the core crossbars needed.

In order to use many discrete, smaller chiplets instead of the chip monolith approach, you need to have crossbar designs that are compact, scalable and allow more shoreline for scaling. Designers can then build highly customized systems. The plug-and-play components can also be of varying performance capabilities depending on the needs of the system in question, further reducing costs and introducing even more flexibility for tailored use cases.

Players across the industry are waking up to the promise and potential of chiplets, and the future will undoubtedly involve advanced high-performance, multi-chiplet systems that will be required for data center and wireless infrastructure for the next generation of AI. But like any new technology, it will take time to build the kind of cohesive ecosystem that promotes innovation.

Building a More Open Future

Shifting to chiplets and, more fundamentally, addressing the data movement problem won’t just improve performance—but also offer ways to democratize access to cutting-edge AI. As DeepSeek demonstrated, it’s possible to achieve strong results with fewer resources. This mindset should be—and can be—similarly applied to hardware and systems.  Arm, an IP pioneer, recently announced a chiplet-specification architecture (CSA), but while that opens up the ecosystem, it is still tied to Arm architecture. Not to be outdone, Tenstorrent, one of the new high-profile entrants into the AI compute platform space, and a leading RISC-V CPU vendor, announced their intent to build an Open Chiplet Architecture (OCA) to democratize this for the open-source world, along with its open-source software stack for AI development.

Industry collaborations like UALink, RISC-V-based computation, and more open software stacks can accelerate how fast we can break new ground in AI innovation. The more minds we engage in rethinking our fundamentals, such as hardware and data movement, the more we can ensure that the next wave of AI breakthrough is both broader in scope and more accessible to innovators around the globe.

Also Read: Developing Autonomous Security Agents Using Computer Vision and Generative AI

[To share your insights with us as part of editorial or sponsored content, please write to psen@itechseries.com]

About the Author Of this Article

Dr. Sailesh Kumar is CEO / Founder of Baya Systems

Comments are closed.