Close Menu
  • Home
  • AI
  • Big Data
  • Cloud Computing
  • iOS Development
  • IoT
  • IT/ Cybersecurity
  • Tech
    • Nanotechnology
    • Green Technology
    • Apple
    • Software Development
    • Software Engineering

Subscribe to Updates

Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

    What's Hot

    Will Outrageous Gas Prices Restart the EV Boom?

    May 12, 2026

    Four ways Google Research scientists have been using Empirical Research Assistance

    May 12, 2026

    Streamlined monitoring and debugging for Amazon EMR on EC2

    May 12, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Big Tee Tech Hub
    • Home
    • AI
    • Big Data
    • Cloud Computing
    • iOS Development
    • IoT
    • IT/ Cybersecurity
    • Tech
      • Nanotechnology
      • Green Technology
      • Apple
      • Software Development
      • Software Engineering
    Big Tee Tech Hub
    Home»Cloud Computing»Benchmarking scale-out AI fabrics with Cisco N9000 + AMD Pensando™ Pollara 400 NICs
    Cloud Computing

    Benchmarking scale-out AI fabrics with Cisco N9000 + AMD Pensando™ Pollara 400 NICs

    big tee tech hubBy big tee tech hubMay 11, 2026008 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email Telegram WhatsApp
    Follow Us
    Google News Flipboard
    Benchmarking scale-out AI fabrics with Cisco N9000 + AMD Pensando™ Pollara 400 NICs
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    The “AI paradox” is a growing hurdle for enterprise leaders: investing millions in powerful GPUs, only to watch them sit idle while waiting for data. As enterprises scale from pilot to production, the real bottleneck isn’t compute—it’s the hidden cost of an inefficient network. In scale-out architectures, the tens of thousands of GPUs must synchronize to complete a single training iteration. When the network can’t keep pace with the bursty demands of modern AI training, GPUs stall and job completion time (JCT) spikes. We’ve partnered with AMD to deliver a validated, end-to-end AI infrastructure that eliminates these bottlenecks and transforms the network into a high-performance engine for innovation.

    Fabric as the foundation: The Cisco and AMD AI performance blueprint

    As AI workloads expand across distributed clusters, the network must scale linearly to prevent packet loss and retransmissions. This performance is only verifiable through rigorous, real-world benchmarking. At Cisco, we prioritize systemic, deterministic performance that goes beyond individual component specs.

    Our reference architecture features AMD Instinct™ MI300X GPUs, AMD Pensando™ Pollara 400 NICs, Cisco Silicon One G200-powered N9364E-SG2 switches, and Cisco 800G OSFP optics. Deploying is only half the challenge; operating at scale is the other. Cisco Nexus Dashboard provides the granular, real-time visibility needed for day-0 through day-N operations.

    Cisco N9000 Series Switches, with AMD Instinct GPU accelerators and AMD Pensando AI NICs, unified with Cisco Nexus One in a fully integrated stack. N9000 Series switches are included in AMD reference architecture for AI cluster design.Cisco N9000 Series Switches, with AMD Instinct GPU accelerators and AMD Pensando AI NICs, unified with Cisco Nexus One in a fully integrated stack. N9000 Series switches are included in AMD reference architecture for AI cluster design.
    Figure 1: Cisco N9000 Series Switches, with AMD Instinct™ GPU accelerators and AMD Pensando™ AI NICs

    By combining these technologies, we minimize JCT and maximize GPU utilization, ensuring AI infrastructure remains secure, compliant, and continuously optimized.

    Benchmarking the architecture

    We benchmarked two Clos topologies (2×2 & 4×2) with Cisco N9364E-SG2 switches (each with 51.2 Tbps throughput and 64 ports of 800 GbE), 128 AMD Instinct™ MI300X Series GPUs (16 servers x 8 GPUs), 128 AMD Pensando™ Pollara 400 AI NICs (16 servers x 8 NICs), and the AMD ROCm™ 6.3/7.0.3 software ecosystem.

    2×2 Clos topology

    This design fully subscribes each leaf switch, forcing the switch into high-congestion states to test fabric resilience:

    • 2x leaf and 2x spine (4x Cisco N9364E-SG2) switches
    • 8 servers (8x AMD Instinct™ MI300X Series GPUs) connected to each leaf switch
    • 8x AMD Pensando™ Pollara 400G NICs per server
    • Switch side: Cisco OSFP 800G DR8 optics
    2x2 CLOS topology with Cisco N9364E-SG2 + AMD Topology2x2 CLOS topology with Cisco N9364E-SG2 + AMD Topology
    Figure 2: 2×2 Clos topology

    4×2 Clos topology

    This design focuses on the efficacy of advanced load-balancing techniques for efficient load distribution during synchronous bursts in the GPU scale-out fabric:

    • 4x leaf and 2x spine (6x Cisco N9364E-SG2) switches
    • 4 servers (8x AMD Instinct™ MI300X Series GPUs) connected to each leaf switch
    • 8x AMD Pensando™ Pollara 400G NICs per server
    • Switch side: Cisco OSFP 800G DR8 optics
    4x2 CLOS topology with Cisco N9364E-SG2 + AMD Scale-out Topology4x2 CLOS topology with Cisco N9364E-SG2 + AMD Scale-out Topology
    Figure 3: 4×2 Clos topology

    Benchmarking tools

    We measured scale-out fabric performance using a comprehensive toolset, including:

    • IBPerf measures RDMA performance over scale-out fabric in varying congestive scenarios. We used this tool to test performance between GPUs connected across a single leaf and across leaf-spine.
    • MLPerf is an industry-standard benchmark used to measure actual workload performance. The performance output translates to ROI on fully validated designs from Cisco and AMD.

    Network fabric performance benchmarking results

    We evaluated scale-out fabric performance using comprehensive testing and standard KPIs.

    Single-hop IBPerf testing evaluates performance within a localized fabric domain, typically within a single leaf switch. This establishes a baseline for link utilization, buffer tuning effectiveness, and NIC-to-switch performance prior to introducing multi-hop variables.

    These tests measure the Remote Direct Memory Access (RDMA) sessions’ throughput between two GPUs connected through a Cisco N9364E-SG2 leaf switch. The results capture P01 (1st percentile) and P99 (99th percentile) bandwidth, while all the sessions are active simultaneously. P01 bandwidth represents the throughput of the slowest session—a critical metric for synchronized AI/ML workload performance—while P99 represents the throughput of the fastest session. A minimal delta between P01 and P99 bandwidth and each bandwidth closer to the link bandwidth proves the efficacy of the GPU interconnect technology.

    In the 2-leaf/2-spine (2×2) topology, each leaf switch handles 32 bi-directional sessions, effectively saturating the leaf switch. The 4-leaf/2-spine (4×2) topology handles 16 bi-directional sessions per leaf. Across both topologies and varying queue pair (QP) counts (4 QPs and 32 QPs), the P01 and P99 bandwidths in both topologies and both sets of queue pairs are closer to each other, with each one approaching the link bandwidth of 400 Gbps.

    N9000 series switches blogg figure 4N9000 series switches blogg figure 4
    Figure 4: Single-hop RDMA bandwidth performance across varying leaf-spine topologies and queue pair counts

    This performance shows that the AMD Pensando™ Pollara NIC and Cisco N9364E-SG2 switches deliver a highly efficient solution for demanding workloads. The tight delta between P01 and P99 metrics across different scale and configurations demonstrates that this architecture maintains deterministic performance, regardless of cluster size or queue pair density.

    Bisectional IBPerf testing evaluates cross-fabric traffic traversing multiple tiers to measure bisection bandwidth, path symmetry, cross-spine load balancing, and congestion propagation.

    These tests measure RDMA session throughput between two GPUs connected through leaf and spine Cisco N9364E-SG2 switches. The results show P01 and P99 bandwidth measurements with all sessions are simultaneously active. In the 2×2 topology, there are 32 bi-directional sessions per leaf, whereas the 4×2 topology has 16 bi-directional sessions per leaf. All these sessions go over spine. The traffic from each session traverses three hops (leaf-spine-leaf) to stress the entire fabric. This test validates the efficiency of the fabric’s load-balancing algorithm; any traffic polarization would lead to some links being underutilized, while other links become congested, ultimately degrading RDMA session performance. Tests were conducted using 4 and 32 QPs.

    N9000 series switches blogg figure 5N9000 series switches blogg figure 5
    Figure 5: Bisection RDMA bandwidth stability comparison for 2-leaf/2-spine and 4-leaf/2-spine architectures across varying queue pair counts

    The results demonstrate that P01 and P99 bandwidths are similar and each is closer to the link bandwidth of 400 Gbps, mirroring the performance observed in single-hop testing. This confirms that the Cisco N9364E-SG2 switches and AMD Pensando™ Pollara NIC provide a high-performance, resilient GPU interconnect technology capable of maintaining consistently deterministic performance under stress.

    Congestive IBPerf testing creates high-contention scenarios using a 31:1 communication pattern, where 31 GPUs communicate with a single GPU. It evaluates queue buildup, Explicit Congestion Notification (ECN) effectiveness, Data Center Quantized Congestion Notification (DCQCN) reaction curves, tail latency, and fabric stability under worst-case AI communication patterns.

    Incast conditions represent some of the most challenging scenarios for scale-out AI fabric. These tests measure P01 and P99 bandwidths under incast conditions, which manifest during collective communications such as all-to-all. If the scale-out fabric hardware, design, and tuning are not optimal, it leads to substantial degradation in JCT for training workloads. Because it is difficult to synchronize all sessions to start simultaneously, we use the Quantile Range Method to analyze the results. It analyzes bandwidth samples as a result of incast congestion instead of all bandwidth samples.

    N9000 series switches blogg figure 6N9000 series switches blogg figure 6
    Figure 6: RDMA incast 31:1 congestion performance. Comparison of P01 and P99 bandwidth during high-contention 31:1 incast traffic

    In this test, each of the 128 GPUs establishes 31 RDMA sessions to 31 other GPUs across the leaf-spine fabric, resulting in a total of 3,968 (31*128 = 3,968) simultaneously active sessions in the scale-out fabric. The delta between P01 and P99 bandwidth is very tight, and each bandwidth is close to the link bandwidth of 400 Gbps, which is a solid proof point of the Cisco N9364E-SG2 switches’ ability to handle extreme congestive conditions and a testament to the Cisco and AMD validated design.

    MLPerf Training and Inference Benchmarking tests establish standardized metrics to evaluate the performance of training and inference workloads. By enforcing strict guidelines regarding models, datasets, and allowable optimizations, these benchmarks provide a level playing field for fair comparison among competing AI infrastructure solutions.

    The MLPerf tests from MLCommons are designed to provide a common benchmarking methodology for measuring application-level KPIs, which are the primary indicators of performance for end users. For inference, the Llama 2 70B results demonstrate clear throughput scaling as the configuration expands from two to four nodes. The training benchmarks provide representative data for Llama 2 70B (on two nodes) and Llama 3.1 8B (on eight nodes).

    N9000 series switches blogg figure 7N9000 series switches blogg figure 7
    Figure 7: MLPerf training and inference key performance metrics for Llama 2 and Llama 3.1 models, detailing throughput and JCT across multi-node configurations

    These findings provide the foundation for our core claim: the Cisco validated architecture is not just theoretically sound; benchmarking shows it can handle the most demanding AI inference and training workloads.

    A real-world deployment of the Cisco and AMD AI solution architecture

    The Cisco-AMD partnership delivers real-world impact, notably powering G42’s large-scale AI clusters. This end-to-end solution—integrating AMD GPUs, Cisco UCS servers, N9000 800G switches, and Nexus Dashboard—provides the secure, scalable performance required for cutting-edge AI workloads.

    “As AI workloads scale, network performance becomes a critical enabler of cluster efficiency. The AMD Pensando™ Pollara 400 AI NIC, with its fully programmable, fault-resilient design, delivers consistent performance for GPU scale-out training. In collaboration with Cisco N9000 switching, we’re advancing Ethernet to the next level, helping maximize GPU utilization and accelerate job completion.”

    —Yousuf Khan, Corporate Vice President, Networking Technology and Solutions Group, AMD

    Operationalizing intelligence: A new standard for performance at scale

    In the age of massive-scale AI, an organization’s infrastructure is either its greatest competitive advantage or its most significant bottleneck. When the stakes involve mission-critical training, fine-tuning, and inferencing, a unified, fully validated ecosystem is a must. Cisco and AMD are changing the equation, delivering a deterministic, high-performance fabric that turns your network into a catalyst for innovation.

    Connect with a Cisco AI networking specialist today to design a deployment tailored to your specific workloads.

    Additional resources:



    Source link

    AMD Benchmarking Cisco fabrics N9000 NICs Pensando Pollara scaleout
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    tonirufai
    big tee tech hub
    • Website

    Related Posts

    How to add AI to an existing product (without annoying users)

    May 12, 2026

    Beyond the Pilot: Building the Clinical Data Fabric for the Agentic Era

    May 10, 2026

    Azure IaaS: Defense in depth built on secure-by-design principles

    May 9, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Editors Picks

    Will Outrageous Gas Prices Restart the EV Boom?

    May 12, 2026

    Four ways Google Research scientists have been using Empirical Research Assistance

    May 12, 2026

    Streamlined monitoring and debugging for Amazon EMR on EC2

    May 12, 2026

    Tunable polaritonic topologies generated by non-local photonic modes

    May 12, 2026
    Timer Code
    15 Second Timer for Articles
    20
    About Us
    About Us

    Welcome To big tee tech hub. Big tee tech hub is a Professional seo tools Platform. Here we will provide you only interesting content, which you will like very much. We’re dedicated to providing you the best of seo tools, with a focus on dependability and tools. We’re working to turn our passion for seo tools into a booming online website. We hope you enjoy our seo tools as much as we enjoy offering them to you.

    Don't Miss!

    Will Outrageous Gas Prices Restart the EV Boom?

    May 12, 2026

    Four ways Google Research scientists have been using Empirical Research Assistance

    May 12, 2026

    Subscribe to Updates

    Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

      • About Us
      • Contact Us
      • Disclaimer
      • Privacy Policy
      • Terms and Conditions
      © 2026 bigteetechhub.All Right Reserved

      Type above and press Enter to search. Press Esc to cancel.