The artificial intelligence hardware landscape, once a monolithic empire ruled by the GPU, is fracturing. Nvidia’s surprise $20 billion licensing deal with Groq in late 2025 was not just an acquisition; it was a market-wide admission that for AI inference—the task of running trained models—the GPU is no longer the only answer. This move has ignited the sector, validating a new class of specialized processors and forcing a crucial distinction between ambitious promises and production-ready reality. The central battleground is no longer just training, but delivering answers from massive models with near-instantaneous speed.
In this reshaped arena, three distinct challengers have emerged from the pack, each with a radically different philosophy. Cerebras Systems, with its audacious wafer-scale engine, aims to eliminate the memory bottleneck entirely. SambaNova Systems champions a reconfigurable dataflow architecture, designed to tame the largest and most complex models. And Groq, though now absorbed into the Nvidia collective, has proven the power of its deterministic Language Processing Unit (LPU), a technology that continues to set the benchmark for low-latency performance. These companies, along with high-stakes outsiders like Etched and Tenstorrent, are defining the next era of AI computation, where speed is the entire value proposition.
In brief:
- Since 2022, a new wave of startups including Cerebras, Groq, and SambaNova have developed specialized chips for large language model (LLM) inference, targeting the architectural weaknesses of GPUs for this task.
- The core challenge for GPUs in inference is the memory bandwidth bottleneck; generating tokens sequentially is a latency-sensitive task, not a parallel-throughput task like training.
- Nvidia’s acquisition of Groq in December 2025 for $20 billion validated the market for specialized inference hardware, absorbing the LPU technology into its own ecosystem.
- Cerebras has secured major validation with a $10 billion OpenAI contract and deployment on AWS Bedrock for its WSE-3 wafer-scale chip, which eliminates external memory.
- SambaNova’s SN40L excels with extremely large models, demonstrated by independent benchmarks on models exceeding 400 billion parameters, thanks to its unique tiered memory system.
- Independent data is scarce, but benchmarks from services like Artificial Analysis confirm Groq’s LPU achieved 877 tokens/s on Llama 3 8B, while SambaNova’s SN40L reached 114 tokens/s on a much larger Llama 3.1 405B model.
- The ecosystem is proving as critical as the chip itself. Survival now depends on integration with hyperscalers and securing large-scale customer deployments.
The Post-Groq Landscape: Why GPU Inference Isn’t Enough
To understand the rise of these challengers, one must first grasp the fundamental mismatch between GPUs and AI inference. A GPU is a marvel of parallel processing, designed to efficiently train models by performing trillions of calculations simultaneously across large batches of data. This is akin to a massive factory optimized for producing a thousand identical items at once. Inference, however, is a different problem. It’s a sequential, autoregressive process—generating one word, or token, at a time. This is like a customer asking for a single, custom-built item. The massive factory floor is of little help; the task is constrained by the speed of the single production line.
This creates a critical memory bandwidth bottleneck. For each token generated, the AI model must read its entire set of parameters from memory. An Nvidia H100 GPU, for example, has an HBM memory bandwidth of 3.35 TB/s. For a 70-billion-parameter model, this translates to a theoretical maximum of roughly 24 tokens per second for a single user. This physical limit is what challengers are exploiting. It also leads to a confusion of metrics. Aggregate throughput (total tokens per second across all users) is a datacenter metric, while per-user latency (tokens per second for one user) is the metric that defines conversational AI. The two are not interchangeable, and the latter is where GPUs struggle.
Cerebras WSE-3: The Bet on Wafer-Scale Integration
Cerebras Systems chose the most direct solution to the memory problem: get rid of external memory altogether. Its Wafer-Scale Engine 3 (WSE-3) is a single chip occupying an entire 300mm silicon wafer, making it 57 times larger than an H100. This immense size allows it to integrate 900,000 AI-optimized cores and, crucially, 44 GB of on-chip SRAM memory. With this design, the memory bandwidth explodes to 21 petabytes per second—nearly 7,000 times that of an H100. It’s the difference between fetching a book from a library in another building versus having it on your desk.
This radical architecture is no longer theoretical. In 2026, its potential was validated by two landmark deals. In January, OpenAI signed a multi-year, $10 billion contract for Cerebras’s computing capacity. By March, Amazon had deployed the WSE-3 in its datacenters via Amazon Bedrock, marking its first integration with a major hyperscaler. While Cerebras does not participate in standardized MLPerf benchmarks, its own claims of over 2,500 tokens per second on large models are partially corroborated by community tests. The strategy appears to be working; the company filed for an IPO in April 2026, signaling its transition from a promising startup to a major market player.
Groq’s Legacy: The Deterministic LPU Inside Nvidia
Before its acquisition, Groq pioneered the Language Processing Unit (LPU), an architecture built on a principle of determinism. Unlike a GPU, which makes real-time decisions about how to schedule tasks, the LPU’s entire execution path is mapped out by a compiler ahead of time. At runtime, the hardware simply executes the plan with no improvisation, eliminating the variability in response time known as “jitter.” This results in predictable, ultra-low latency, a critical factor for real-time applications. The LPU’s design relies exclusively on high-speed on-chip SRAM, providing over 80 TB/s of memory bandwidth.
Independent benchmarks from 2024 confirmed the LPU’s prowess, showing it delivered a staggering 877 tokens/s on Llama 3 8B and 284 tokens/s on the 70B model—over 12 times faster than a single H100 on the same task. This performance made Groq too significant to ignore. The December 2025 deal saw its technology integrated into Nvidia’s Vera Rubin platform as the Groq 3 LPX, an inference co-processor. Groq is no longer an independent challenger, but its LPU technology now serves to reinforce Nvidia’s market dominance, proving that even revolutionary AI inference accelerators can be absorbed by the incumbent.
SambaNova’s RDU: Taming Trillion-Parameter Models
SambaNova’s approach centers on its Reconfigurable Dataflow Unit (RDU), which compiles entire graphs of operations into a continuous dataflow pipeline. Instead of running a sequence of separate computing tasks, the RDU fuses them into a single, efficient process. Its most distinctive feature is a three-tier memory system: a small amount of ultra-fast on-chip SRAM, a larger tier of co-packaged HBM, and up to 1.5 TiB of conventional DDR DRAM. This massive DRAM capacity allows multiple huge models to reside in memory simultaneously, enabling rapid switching between them.
This is a decisive advantage for Composition of Experts (CoE) deployments, where different specialized models are called upon depending on the user’s request. Independent data from Artificial Analysis validates this approach, showing the SN40L achieving 114 tokens/s on the massive 405-billion-parameter Llama 3.1 model. Other reports cite even more impressive figures, such as 198 tokens/s on the 671B DeepSeek-R1 model with just 16 chips. With its next-generation SN50 chip announced in February 2026, SambaNova continues to carve out a crucial niche for enterprises deploying the largest and most complex AI systems.
The High-Stakes Outsiders: Etched and Tenstorrent
Beyond the top three, other startups are making high-risk, high-reward bets. Etched is pursuing the most radical path with its Sohu chip, a pure transformer-only ASIC. By hardwiring the transformer architecture directly into silicon, Etched claims its chip can achieve unparalleled compute density, projecting that a single server could replace 160 H100s. The existential risk, however, is that if the transformer architecture is ever superseded, the chip becomes obsolete. As of April 2026, despite raising over $620 million, Etched has yet to deliver a chip to an external customer, leaving its spectacular claims entirely unverified.
Tenstorrent, led by legendary chip architect Jim Keller, is taking a more flexible, open-source approach. Its strategy targets multiple markets, from datacenters to edge devices, and its software stack is fully open-source. Its Blackhole chip combines RISC-V CPU cores with powerful matrix math engines. However, independent reviews in late 2025 found that the software was not yet mature enough to fully exploit the hardware’s potential, with real-world performance on LLMs reaching only about 50% of its theoretical peak. While its open approach is compelling, Tenstorrent must still bridge the gap between hardware capability and software execution to truly challenge established GPU-as-a-service platforms.
A Practical Guide: Matching Challengers to Use Cases
Navigating this complex market requires a clear understanding of which technology fits which need. The impressive technical specifications of these new AI chip challengers translate into distinct real-world applications.
- For ultra-low conversational latency on models larger than 100 billion parameters, Cerebras is the leading choice. Its on-chip SRAM architecture is purpose-built to eliminate the memory bottleneck, and its production partnerships with OpenAI and AWS provide strong validation.
- For applications demanding deterministic latency and zero jitter, such as real-time financial trading or autonomous systems, Groq’s LPU technology (now via Nvidia) remains the gold standard. Its pre-compiled, no-improvisation design guarantees predictable performance.
- For deploying very large Mixture of Experts (MoE) or multilingual models that exceed 500 billion parameters, SambaNova’s SN40L is uniquely suited. Its massive 1.5 TiB DDR memory and fast model-switching capabilities are unmatched for this use case.
- For hardware developers, enterprises focused on edge computing, or those committed to an open-source ecosystem, Tenstorrent’s Blackhole offers a compelling option. Its accessible workstations and RISC-V licensing model provide flexibility, though with the caveat of a maturing software stack.
- For organizations willing to bet on the long-term dominance of the transformer architecture for maximum compute density, Etched’s Sohu is the one to watch. However, with no third-party data or customer deliveries, this remains a high-risk consideration for deployments planned beyond late 2026.


