Cisco has done significant work in the past year to upgrade its Nexus data center switching portfolio for the AI era. Cisco N9000 Series Switches have adopted the benefits to include operational resiliency, security, and management features needed to sustain the high demands of today’s networking for AI.
Recently I spoke with the Cisco team to learn about the company’s work with customers across many different market segments—including the enterprise, telco, neocloud and sovereign cloud markets.
It’s clear that Cisco has put its foot on the gas to respond to rapidly emerging needs for AI networking, from back-end networks training to front-end inference. AI is changing entire network architectures. Customers think about what networks are needed to support AI whether that’s in the core or at the edge or in between. They also need to consider what impact AI applications will have on corporate networks, datacenters, operations, and governance strategies.
A Shifting Conversation
You might ask, what is going on to demand this evolution? Quite simply, the AI infrastructure market is shifting, as enterprises realize that data and applications are quite complex and widely distributed, emphasizing the role of inference for AI and the need for end-to-end network connectivity and observability.
Surbhi Paul, Director, Data Center Networking at Cisco, told me that Cisco has quickly moved to match changes in the market over the past year.
“The conversation has really shifted,” said Surbhi in an interview. “Six months ago, people were asking for more bandwidth. Today it’s not just speed but it’s determinism. The network is part of the computer. GPUs can stall with jitter. You can burn millions of dollars of capital expense if GPUs sit idle for milliseconds.”
A Diverse N9000 Series Portfolio
Let’s dive in on some more details.
The N9000 Series, part of the Cisco AI Networking solution, includes a flexible architecture to adopt many different forms of silicon and operating systems, including Cisco’s own Silicon One as well as NVIDIA Spectrum-X technologies. Operating systems are also flexible and can include Cisco ACI, NX-OS, or SONiC. The hallmark of the N9000 Series is flexibility and performance.
Cisco has also made significant commitments to AI-optimized networking with guided principles to embrace open standards, simplified operations, and embedded security.
First and foremost is a focus on operational resiliency. Massive AI datacenters and clusters put unprecedented demands on the network, both on the back end, where clusters process training, as well as the front end and storage networks, where AI applications are accessed and processed. These new demands mean that AI datacenters require ultra-low latency, bandwidth optimization, and operational resilience.
In an ideal deployment everything needs to be connected across any network, whether that’s front end, back end, or storage. It’s critical to have a centralized management platform. Cisco believes that integrating observability features, real-time applications, and job monitoring as part of its Nexus Dashboard management plane are part of the picture to ensure operational resiliency, whether it’s for the front-end or back-end networks.
“To maximize that ROI, you don’t treat the front-end and back-end networks as islands,” said Surbhi. “You need stability. You can’t have your management plane flake out. The secret sauce of ROI is having a unified management platform. You need to squeeze every performance out of the GPU. The unified operational model is how you keep the GPU idle time to zero.”
The N9000 Series includes crucial resiliency features including Priority-based Flow Control (PFC) and Explicit Congestion Notification (ECN), which assure AI training and inference operations can complete without dropping jobs before completion. But wait, there’s more: Cisco Intelligent Packet Flow includes PFC and ECN capabilities.
Cisco Intelligent Packet Flow is a solution designed to optimize traffic management in large-scale AI and high-performance computing environments. It addresses the challenges of AI workloads by providing advanced load balancing, congestion awareness, and fault recovery features. Key capabilities include Dynamic Load Balancing (DLB), Weighted Cost Multi-Path (WCMP), Per-Packet Load Balancing, Policy-Based Load Balancing, Hardware-Accelerated Telemetry, and Fault-Aware Recovery.
Surbhi points out that with Cisco NX-OS, the N9000 Series can use real-time telemetry from the ASIC to monitor at the nanosecond scale. This ensures that the ECN is signaling before the buffers fill up.
In addition to operational resiliency, there are also security needs. You need security embedded in the distributed fabric. Nexus includes advanced security such as eBPF and Hypershield, which means the network fabric can be secured with distributed security down to the Linux kernel level. Integrated observability can monitor apps, infrastructure, and logs in real time.
Open Standards and Flexibility
Another key element of the N9000 Series is flexibility. These switches are based on widely adopted standard Ethernet technology for both front-end and back-end use cases. It’s built into both Cisco Cloud Reference Architecture (CRA) as well as the forthcoming products based on NVIDIA’s Cloud Partner Reference Architecture (NCP), meaning that customers can select either platform for the right application and needs. Cisco’s new partnership with NVIDIA can deliver the Cisco N9300 with NVIDIA BlueField NICs and Cisco Silicon One, or they can select the latest Cisco N9100 with NVIDIA BlueField and NVIDIA’s Spectrum-X Ethernet switching silicon.
Cisco has also been on the forefront of guiding new standardized features, including cooperating with standards organizations such as the IETF and the UEC to add new features and standards. And it has updated API-based control for the N9000, ensuring that it can be managed using Nexus fabric via a cloud-managed service, as well as in infrastructure as code models by interacting with open APIs.
Key Reference Use Cases
Cisco has been backing up the goods with big customer wins. It has a full roster of customers using the data center portfolio for front-end, back-end, and storage applications.
In one example, an enterprise Fortune 500 retailer with 1,700 locations needed to run a hybrid AI model. There was a heavy centralized training load with inference delivered at the edge in thousands of stores. The company adopted the N9000 architecture and uses the Nexus Dashboard to manage all AI networking functions from the central AI factory out to the edge source.
Surbhi points out that this is a good example of training and edge networks working in sync to deliver the best performance as they did in this example. In this example, the N9000 Series uses real-time telemetry from the ASIC to monitor at the nanosecond scale. ECN signaling ensures that packet buffers never fill up.
“We are seeing customers that are spinning up inference clusters in days,” said Surbhi. “They need something that turns on immediately and delivers low latency.”
Closing Remarks
With substantial investment over the past year, Cisco has proven that the N9000 Series is a flexible and operationally sophisticated answer for datacenter and AI cluster networking applications. With the horsepower of 800G and a clear plan for 1.6T, along with Cisco’s new integrated and unified Nexus Dashboard, the N9000 Series can support broad AI or cloud datacenter operations, including back-end, front-end, and storage networks for AI.
