$GOOGL $ARM $NVDA $LITE This is an outstanding interview. Lots of great visibility into Google's cloud and TPU business, as well as color on some of their key customers.
EXECUTIVE OVERVIEW
The Kurian interview is best understood as a systems-level disclosure, not a conventional model-layer discussion. The central message is that Google’s AI roadmap is being organized around full-stack control: proprietary TPU silicon, Nvidia GPU optionality, Axion Arm CPUs, Intel and AMD CPU support, custom networking, storage, data-center design, energy procurement, enterprise distribution, Gemini models, and agent orchestration. The most important inference is that Google is positioning itself as an integrated AI utility: it can compete at the model layer while also selling capacity to model competitors; it can use Search, YouTube, Cloud, and enterprise cash flows to fund infrastructure; and it can improve unit economics by owning meaningful portions of the hardware and software stack. Kurian’s clearest strategic point was that AI capacity scarcity is expected to persist for roughly 10 years, and that owning silicon in a constrained market creates structurally better unit economics than reselling someone else’s accelerators.
The most important ecosystem implication is that generative AI is moving from a model race into an infrastructure supply-chain race. Model quality remains critical, but the binding constraints are increasingly memory bandwidth, accelerator utilization, interconnect latency, storage throughput, VM orchestration, local disk, power availability, data-center deployment cycle time, and energy efficiency per token. Google’s disclosed 8th-generation TPU architecture validates this shift: TPU 8t is optimized for frontier training and TPU 8i is optimized for low-latency inference, reasoning, reinforcement learning, MoE routing, KV cache, and agent workflows. Google Cloud’s Next ’26 materials state that TPU 8t and TPU 8i were created because infrastructure requirements for pre-training, post-training, and real-time serving have diverged, while AI Hypercomputer now integrates accelerators, CPUs, storage, networking, software frameworks, orchestration, and GKE runtime improvements into a unified stack. (Google Cloud)
Google’s roadmap points toward 3 simultaneous monetization loops. The 1st loop is internal: Search, Gemini app, Workspace, YouTube, advertising, and DeepMind frontier models consume the infrastructure. The 2nd loop is external: enterprise customers, AI labs, capital markets, HPC users, and government labs rent or consume the infrastructure. The 3rd loop is ecosystem control: Google can attract external workloads onto TPUs, drive higher TPU volumes, amortize silicon R&D, improve supply-chain bargaining power, and reduce relative dependence on Nvidia economics. This does not eliminate Nvidia demand; Google continues to offer Nvidia GPUs and will support Vera Rubin NVL72 through A5X. It does, however, create a credible long-term substitution path for captive hyperscale workloads and model providers willing to optimize for TPUs. Google Cloud explicitly says Nvidia GPUs remain a core part of its accelerator portfolio, while Virgo Network will support both TPU 8t and A5X powered by Nvidia Vera Rubin NVL72. (Google Cloud)
GOOGLE’S STRATEGIC POSITION: AN AI FACTORY WITH MERCHANT CAPACITY
The interview makes clear that Google is not behaving like a pure model lab, a pure cloud distributor, or a pure chip vendor. It is behaving like a vertically integrated AI factory with merchant capacity. Kurian described a model in which Google monetizes tokens directly through Gemini, monetizes other parties’ models running on Google infrastructure, monetizes TPU capacity with external labs, and increasingly places TPU systems in customer venues for latency-sensitive workloads such as capital markets. The capital markets example is particularly notable because it expands TPUs beyond classic LLM training and inference into inference-style numerical workloads, low-latency quantitative research, and potentially venue-proximate AI infrastructure. This suggests Google is trying to convert TPU from an internal accelerator into a broader domain-specific infrastructure platform.
The key strategic distinction is ownership of IP. Kurian repeatedly contrasted Google’s position with resellers that must buy accelerators, package them into cloud offerings, and compete primarily through capacity allocation and software services. Google still buys and sells Nvidia capacity, but the TPU stack gives it an owned path to differentiated price-performance. This matters because AI infrastructure has entered a supply-constrained regime where the value pool can migrate upstream toward scarce components: accelerators, HBM, advanced packaging, networking, power, and data-center real estate. In that environment, a cloud vendor without its own accelerator is more exposed to gross-margin compression when component prices rise. Google’s TPU path reduces that exposure and provides a mechanism to monetize scarcity without fully surrendering economics to external silicon suppliers.
This does not imply Google should hoard every TPU for Gemini. Kurian’s answer was economically rational: the company needs recurring cash flow to fund training and infrastructure, and venture capital cannot indefinitely fund model providers if inference margins fail to cover training costs. That logic is a direct challenge to model-only companies with weak gross margins, subsidized usage, and rising compute obligations. The implication is that the AI market will increasingly favor companies that control 1 of 3 funding sources: a massive cash cow, merchant cloud infrastructure, or strategic hyperscaler balance-sheet support. Google has all 3 through Services, Cloud, and the ability to supply Anthropic and other labs.
FINANCIAL AND CAPEX READ-THROUGH
Alphabet’s financial disclosures support the Kurian thesis that Google can sustain an infrastructure arms race more credibly than most standalone labs. In Q4 2025, Google Cloud revenue grew 48% to $17.7B, Cloud operating income reached $5.3B, Cloud operating margin expanded to 30.1%, and Cloud backlog rose 55% sequentially to $240B. Alphabet generated $164.7B of operating cash flow and $73.3B of free cash flow for FY 2025, and ended the year with $126.8B of cash and marketable securities. The company guided to 2026 capex of $175B-$185B, explicitly tied to frontier model development, Google Services, Cloud customer demand, and strategic investments. (Alphabet Investor Relations)
The investment implication is 2-sided. On the positive side, Google Cloud is no longer a low-margin strategic side project; it is a fast-growing, high-backlog, high-margin enterprise AI infrastructure business with meaningful operating leverage. On the negative side, the capex step-up is enormous, and the P&L will face pressure from depreciation and energy costs. Alphabet’s CFO specifically called out higher depreciation and data-center operating costs such as energy as consequences of technical infrastructure investment, with depreciation up 38% in 2025 and expected to accelerate in 2026. The central equity debate is therefore not whether demand exists; it is whether Google can convert backlog and AI usage into durable returns on $175B-$185B of annual capex without triggering pricing compression, utilization volatility, or material energy-cost inflation. (Alphabet Investor Relations)
Cloud is also becoming a larger share of Alphabet capital allocation. Kurian said Cloud is about 50% of Alphabet’s capital and growing because Cloud is growing faster. Reuters also reported that Pichai reaffirmed $175B-$185B of 2026 capex and said just over 50% of Alphabet’s ML compute investment would be dedicated to the Cloud business. That means Alphabet’s infrastructure cycle is no longer only a defensive investment to support Search and YouTube; it is increasingly an external revenue engine. The positive scenario is that Google becomes a scaled AI infrastructure utility with better-than-peer cost structure. The negative scenario is that capex grows faster than monetizable demand, depreciation rises before revenue recognition, and competitive pressure from Microsoft, Amazon, CoreWeave, Oracle, xAI, and sovereign-cloud providers compresses returns. (Reuters)
WORKLOAD EVOLUTION: FROM CHAT TO MEDIA TO AGENTS
The most important technical roadmap signal in the interview is Kurian’s 3-phase framing of model workloads. Phase 1 was chatbot/search-style Q&A, where prompts were often long and output responses relatively shorter. Phase 2 was multimodal content generation, where simple prompts could generate long image, audio, or video outputs, increasing output-token intensity and stressing generation latency. Phase 3 is agents, where models interact with CRM, ERP, supply chain systems, browsers, code interpreters, APIs, databases, and computers over long-running workflows. This is a fundamentally different compute pattern because the model is no longer only producing a response; it is maintaining state, invoking tools, preserving memory, executing steps, reading and writing data, and coordinating other agents over potentially 6, 7, or 12 hours.
This workload shift explains the 8t/8i split. Training infrastructure wants the largest possible compute pools, massive memory, deterministic scaling, high bisection bandwidth, checkpoint resilience, and high goodput. Agentic inference wants low latency, high concurrency, KV cache residency, memory bandwidth, CPU orchestration, local storage, low-cost sandboxing, and geographically distributed inference capacity. Google Cloud’s official description of the agentic era is consistent with Kurian’s interview: a single intent can trigger a chain reaction in which a primary agent decomposes goals into tasks for specialized agents that collaborate, preserve state, and use reinforcement learning to deliver outcomes. (Google Cloud)
The adoption metrics indicate that agentic AI is moving from proof-of-concept into production. Google Cloud disclosed that nearly 75% of Cloud customers use its AI products, 330 customers each processed more than 1T tokens over the prior 12 months, 35 customers reached the 10T-token milestone, and first-party models process more than 16B tokens per minute via direct API usage, up from 10B last quarter. Gemini Enterprise paid monthly active users grew 40% quarter-over-quarter in Q1, and Google highlighted production deployments across GE Appliances, KPMG, Macquarie, Citi, Signal Iduna, ASCO, Virgin Voyages, Unilever, and others. (Google Cloud)
The key inference is that token growth will not be linear with human usage. Agents multiply compute intensity per human request because a single request can invoke 10s, 100s, or 1,000s of intermediate steps. A travel-planning agent, code-repair agent, procurement agent, or SOC remediation agent can use accelerators for reasoning, CPUs for tools and VMs, SSD for local state, object storage for retrieval, network bandwidth for API calls, and identity/security systems for authorization. This creates a broader semiconductor and infrastructure demand basket than the 2023-2024 “GPU = AI” framing implied.
TPU ROADMAP: THE 8T/8I SPLIT VALIDATES INFRASTRUCTURE SPECIALIZATION
TPU 8t is Google’s training-focused system. Google states that a single TPU 8t superpod scales to 9,600 chips, provides 121 exaflops of compute, and includes 2 PB of shared high-bandwidth memory. Google also says TPU 8t delivers nearly 3x higher compute performance than prior generations, doubles ICI bandwidth, and can turn months of training into weeks with 1M+ TPU chips in a single logical cluster orchestrated by JAX and Pathways. The architecture is built for frontier model development, embedding-heavy workloads, large-scale pretraining, and near-linear scaling. (Google Cloud)
TPU 8i is Google’s inference and reasoning system. Google states that TPU 8i includes 288 GB of HBM and 384 MB of on-chip SRAM, with the SRAM explicitly sized for KV cache footprints in reasoning models at production scale. It doubles ICI bandwidth to 19.2 Tb/s, reduces network diameter by more than 50% through Boardfly, uses a Collectives Acceleration Engine to reduce on-chip latency by up to 5x, and delivers 80% better inference performance per dollar versus the prior generation. This design targets high-concurrency reasoning, MoE models, chain-of-thought processing, reinforcement learning, and multi-agent workflows. (Google Cloud)
The inference is that Google sees inference no longer as a byproduct of training chips, but as a separate economic domain large enough to justify custom silicon. That is a significant semiconductor signal. Historically, accelerators were evaluated mainly on training FLOPs and memory capacity. Agentic inference shifts the metric set toward tokens per watt, time-to-first-token, tail latency, KV cache efficiency, SRAM capacity, HBM bandwidth, interconnect hop count, utilization, and system-level goodput. Google’s 8i architecture is effectively a statement that the AI inference market is becoming sufficiently large, latency-sensitive, and memory-bound to support distinct product families.
The 8t/8i split also implies that future AI capex will be less homogeneous. Training clusters will remain extremely dense, liquid-cooled, network-intensive, and site-concentrated. Inference clusters will be more distributed, more latency-sensitive, more utilization-sensitive, and more dependent on CPU and storage orchestration. Kurian’s comment that 8i can run in non-water-cooled mode is strategically important because it suggests Google wants inference deployability in a wider range of existing data centers. The ability to place inference capacity closer to users, exchanges, enterprises, and sovereign jurisdictions can reduce latency and expand addressable deployments beyond mega-campus training sites.
GPU IMPLICATIONS: NVIDIA DEMAND REMAINS STRONG, BUT CUSTOM ASIC SHARE GAINS ARE REAL
The interview should not be read as anti-Nvidia. Google has a dual-track strategy: maintain Nvidia access for customers and workloads that prefer CUDA, while expanding TPU adoption where Google can offer better cost, latency, or energy efficiency. Google Cloud’s Next ’26 announcements explicitly state that Nvidia GPUs are a core part of the AI accelerator portfolio and that Google will be among the 1st to offer Nvidia Vera Rubin NVL72, in addition to Blackwell and Hopper-based instances. Virgo Network will also support A5X powered by Nvidia Vera Rubin NVL72, with Google saying it can support up to 80,000 GPUs in a single data center and up to 960,000 GPUs across multiple sites. (Google Cloud)
The Nvidia implication is nuanced. Near-term GPU demand remains supported by frontier training, CUDA inertia, customer portability, open-source ecosystem maturity, Nvidia’s full rack-scale roadmap, and hyperscaler desire to offer GPU choice. However, the long-term risk is mix shift, not demand collapse. Captive hyperscaler workloads and partner labs can increasingly move toward custom ASICs when model architecture stabilizes, software layers mature, and economics justify porting. Google’s native PyTorch support for TPUs, optimized vLLM support across GPUs and TPUs, and bare metal access are strategically important because they directly address the biggest adoption friction: CUDA lock-in and developer workflow inertia. (Google Cloud)
The investment read-through is that Nvidia remains a core beneficiary of the AI capex cycle but may face relative share and margin pressure in workloads where hyperscalers can substitute internal ASICs. The more standardized inference becomes, the more attractive ASIC optimization becomes. The more frontier training shifts to novel architectures, dynamic kernels, and research-heavy experimentation, the more valuable Nvidia’s generality and software ecosystem remain. Google’s roadmap therefore creates a barbell: GPUs dominate flexible, broad, developer-led workloads; TPUs gain in high-scale, repeatable, cost-sensitive, Google-integrated workloads.
CPU IMPLICATIONS: AGENTS RE-ACCELERATE GENERAL-PURPOSE COMPUTE DEMAND
The interview contains a critical but underappreciated CPU point. Kurian said that agent computer use informed Google’s CPU strategy because an agent operating a computer is still using traditional compute. Agents need CPUs for tool calls, browser use, sandboxed VMs, API orchestration, data preprocessing, reward calculation, code execution, visualization, identity checks, logging, security scanning, and workflow state management. This means AI does not eliminate CPU demand; it changes its role. CPUs become the control plane and execution substrate around accelerator-driven reasoning.
Google’s official infrastructure announcement reinforces this. Google said GPUs and TPUs must be complemented by high-performance CPU services for complex logic, tool calls, and feedback loops around the core AI model. Axion-powered N4A CPU instances are positioned for agent runtimes, while 4th-generation Google Compute Engine VM families powered by Intel and AMD are optimized for RL reward calculation, agent orchestration, and nested visualization. Google also said Axion N4A provides up to 30% better price-performance than agent workloads on other hyperscalers, while a separate Next ’26 keynote transcript says Axion N4A delivers 100% better price-performance than comparable x86 instances for sustained agent operation. (Google Cloud)
The key inference is that agents are likely to create a CPU shortage or at least a CPU demand renaissance in cloud. The accelerator is the expensive part of the system, but the agent runtime can bottleneck on VM spin-up, sandbox density, tool latency, CPU memory, local disk, and network egress. Kurian explicitly identified consumer VM economics as the next major bottleneck: consumers cannot afford VMs running indefinitely, so infrastructure needs rapid activation/deactivation, local storage, and oversubscription models. This points to a new category of infrastructure competition around serverless agents, secure sandboxes, fast cold starts, low-cost local disk, and CPU utilization management.
For Intel and AMD, this is a constructive but mixed signal. The positive is that agentic AI increases general-purpose compute consumption alongside accelerators. The negative is that Google is aggressively moving to Axion Arm CPUs for internal optimization and margin capture. Intel and AMD remain relevant for broad enterprise workloads, x86 compatibility, RL, orchestration, databases, and network-heavy instances, but Arm share gains inside hyperscalers are likely to continue wherever software portability and cost targets permit.
NETWORKING IMPLICATIONS: INTERCONNECT IS NOW A PRIMARY BOTTLENECK
Networking is moving from supporting infrastructure to strategic differentiation. TPU 8t requires massive training scale, while TPU 8i requires low-latency all-to-all communication for MoE and reasoning. Google’s Virgo Network is designed as a collapsed AI fabric with 4x the bandwidth of previous generations, and Google says Virgo can connect 134,000 TPU 8t chips into a single fabric in 1 data center and more than 1M TPUs across multiple data-center sites into a single training cluster. Virgo will also support Nvidia-based A5X, with up to 80,000 GPUs in 1 data center and up to 960,000 GPUs across sites. (Google Cloud)
The key technical point is that AI networking requirements are diverging by workload. Training needs high bisection bandwidth, deterministic latency, resilient checkpointing, and cross-cluster scaling. MoE inference and reasoning need low-latency all-to-all communication, reduced hop count, fast collectives, and predictable tail latency. Agentic workloads add another network layer because agents call tools, APIs, databases, storage, other agents, and enterprise SaaS systems. A single human request can fan out into many networked tasks, making network topology, congestion control, routing, and gateway design major levers of cost and user experience.
This is structurally positive for optical components, high-radix switching, co-packaged optics, 800G/1.6T interconnect, NICs, DPUs, Ethernet fabrics, optical circuit switching, RDMA, retimers, and high-speed cabling. It is also strategically important for Google because network co-design can reduce dependence on merchant networking stacks. Google’s ability to optimize network layers for TPUs and its willingness to make Virgo available for Nvidia systems suggest that networking can become a cloud differentiation layer, not only a data-center cost item.
MEMORY, DRAM, AND HBM: THE MEMORY WALL IS THE CORE SEMICONDUCTOR CONSTRAINT
The 8t and 8i disclosures make clear that AI is increasingly memory constrained. TPU 8t’s 9,600-chip superpod with 2 PB of shared HBM implies roughly 216 GB of HBM per chip. TPU 8i’s 288 GB of HBM and 384 MB of on-chip SRAM show that inference and reasoning are being optimized around memory bandwidth, KV cache residency, and reduced data movement. The phrase “memory wall” is not marketing; it is the bottleneck that determines whether expensive accelerator FLOPs sit idle. (https://t.co/AUBGFz9nBz)
HBM demand is therefore a direct beneficiary of Google’s TPU roadmap. TrendForce estimates HBM demand grew more than 130% YoY based on 2025 AI chip shipments and expects HBM consumption to rise by more than 70% YoY in 2026, driven by Nvidia, AMD, Google TPU, and AWS Trainium moving toward HBM3e. Reuters reported that SK Hynix said client requests for HBM over the next 3 years already far exceed production capacity, while DRAM contract prices jumped nearly 83% sequentially in Q1 2026 and some NAND prices rose around 160%. (TrendForce)
The Anthropic-Google TPU deal illustrates the magnitude of memory demand. Anthropic announced access to up to 1M Google TPUs, worth 10s of B dollars and expected to bring well over 1 GW of capacity online in 2026. If future TPU deployments carried memory content comparable to TPU 8t or TPU 8i, a 1M-chip fleet would imply 216 PB-288 PB of HBM-class memory content on an illustrative basis, before host DRAM, SSD, networking buffers, redundancy, and spares. The exact mix and generation are not disclosed, so this is a scale illustration rather than a contract specification. (Anthropic)
The DRAM impact is broader than HBM. Agentic inference requires host DRAM for CPU runtimes, VM sandboxes, tool execution, databases, local caches, retrieval systems, security logs, and orchestration. Google’s TPUDirect RDMA and TPUDirect Storage reduce host CPU and DRAM bottlenecks by moving data directly between TPU HBM, NICs, and storage, but they do not eliminate system memory demand. They shift the high-value bottleneck toward HBM bandwidth and direct data paths while still expanding overall data-center DRAM needs. The likely beneficiaries include HBM suppliers, DRAM suppliers, advanced packaging, TSV, CoWoS-like capacity, HBM test, memory controllers, and packaging equipment. The risk is that memory inflation becomes a server BOM headwind for cloud providers and AI labs.
STORAGE: SSD, HDD, OBJECT STORAGE, LUSTRE, AND LOCAL DISK ALL MATTER
Storage is becoming a 1st-order AI bottleneck. Training workloads need data ingest, checkpointing, model restore, multimodal corpus access, and failure recovery. Inference workloads need model weight loading, retrieval, KV cache tiers, prompt context, vector search, logs, and state. Agents add local disk because VMs and sandboxes need fast read/write storage for browser use, code execution, files, intermediate artifacts, and tool outputs. Kurian’s named “next bottleneck” around consumer VMs and local disk is therefore highly material for storage vendors.
Google’s storage announcements are directly aligned with this bottleneck. Managed Lustre now delivers 10 TB/s of bandwidth, a 10x improvement versus the prior year and up to 20x faster than other hyperscalers, with capacity increased to 80 PB. Rapid Buckets on Google Cloud Storage offer sub-millisecond latency and 20M operations per second for checkpoints and recovery, with a target of maintaining 95%+ accelerator utilization. Z4M instances scale to 168 TiB of local SSD capacity and can be deployed in RDMA clusters of 1,000s of machines. TPUDirect Storage allows direct data movement between accelerators and high-speed managed storage, bypassing the host and reducing CPU bottlenecks. (Google Cloud)