For over 50 years now, egged on by the seeming inevitability of Moore’s Law, engineers have managed to double the number of transistors they can pack into the same area every two years. But while the industry was chasing logic density, an unwanted side effect became more prominent: heat.
In a system-on-chip (SoC) like today’s
CPUs and GPUs, temperature affects performance, power consumption, and energy efficiency. Over time, excessive heat can slow the propagation of critical signals in a processor and lead to a permanent degradation of a chip’s performance. It also causes transistors to leak more current and as a result waste power. In turn, the increased power consumption cripples the energy efficiency of the chip, as more and more energy is required to perform the exact same tasks.
The root of the problem lies with the end of another law:
Dennard scaling. This law states that as the linear dimensions of transistors shrink, voltage should decrease such that the total power consumption for a given area remains constant. Dennard scaling effectively ended in the mid-2000s at the point where any further reductions in voltage were not feasible without compromising the overall functionality of transistors. Consequently, while the density of logic circuits continued to grow, power density did as well, generating heat as a by-product.
As chips become increasingly compact and powerful, efficient heat dissipation will be crucial to maintaining their performance and longevity. To ensure this efficiency, we need a tool that can predict how new semiconductor technology—processes to make transistors, interconnects, and logic cells—changes the way heat is generated and removed. My research colleagues and I at
Imec have developed just that. Our simulation framework uses industry-standard and open-source electronic design automation (EDA) tools, augmented with our in-house tool set, to rapidly explore the interaction between semiconductor technology and the systems built with it.
The results so far are inescapable: The thermal challenge is growing with each new technology node, and we’ll need new solutions, including new ways of designing chips and systems, if there’s any hope that they’ll be able to handle the heat.
The Limits of Cooling
Traditionally, an SoC is cooled by blowing air over a heat sink attached to its package. Some data centers have begun using liquid instead because it can absorb more heat than gas. Liquid coolants—typically water or a water-based mixture—may work well enough for the latest generation of high-performance chips such as Nvidia’s new AI GPUs, which reportedly consume an astounding 1,000 watts. But neither fans nor liquid coolers will be a match for the smaller-node technologies coming down the pipeline.
Heat follows a complex path as it’s removed from a chip, but 95 percent of it exits through the heat sink. Imec
Take, for instance,
nanosheet transistors and complementary field-effect transistors (CFETs). Leading chip manufacturers are already shifting to nanosheet devices, which swap the fin in today’s fin field-effect transistors for a stack of horizontal sheets of semiconductor. CFETs take that architecture to the extreme, vertically stacking more sheets and dividing them into two devices, thus placing two transistors in about the same footprint as one. Experts expect the semiconductor industry to introduce CFETs in the 2030s.
In our work, we looked at an upcoming version of the nanosheet called A10 (referring to a node of 10 angstroms, or 1 nanometer) and a version of the CFET called A5, which Imec projects will appear two generations after the A10. Simulations of our test designs showed that the power density in the A5 node is 12 to 15 percent higher than in the A10 node. This increased density will, in turn, lead to a projected temperature rise of 9 °C for the same operating voltage.
Complementary field-effect transistors will stack nanosheet transistors atop each other, increasing density and temperature. To operate at the same temperature as nanosheet transistors (A10 node), CFETs (A5 node) will have to run at a reduced voltage. Imec
Nine degrees might not seem like much. But in a data center, where hundreds of thousands to millions of chips are packed together, it can mean the difference between stable operation and thermal runaway—that dreaded feedback loop in which rising temperature increases leakage power, which increases temperature, which increases leakage power, and so on until, eventually, safety mechanisms must shut down the hardware to avoid permanent damage.
Researchers are pursuing advanced alternatives to basic liquid and air cooling that may help mitigate this kind of extreme heat. Microfluidic cooling, for instance, uses tiny channels etched into a chip to circulate a liquid coolant inside the device. Other approaches include jet impingement, which involves spraying a gas or liquid at high velocity onto the chip’s surface, and immersion cooling, in which the entire printed circuit board is dunked in the coolant bath.
But even if these newer techniques come into play, relying solely on coolers to dispense with extra heat will likely be impractical. That’s especially true for mobile systems, which are limited by size, weight, battery power, and the need to not cook their users. Data centers, meanwhile, face a different constraint: Because cooling is a building-wide infrastructure expense, it would cost too much and be too disruptive to update the cooling setup every time a new chip arrives.
Performance Versus Heat
Luckily, cooling technology isn’t the only way to stop chips from frying. A variety of system-level solutions can keep heat in check by dynamically adapting to changing thermal conditions.
One approach places thermal sensors around a chip. When the sensors detect a worrying rise in temperature, they signal a reduction in operating voltage and frequency—and thus power consumption—to counteract heating. But while such a scheme solves thermal issues, it might noticeably affect the chip’s performance. For example, the chip might always work poorly in hot environments, as anyone who’s ever left their smartphone in the sun can attest.
Another approach, called thermal sprinting, is especially useful for multicore data-center CPUs. It is done by running a core until it overheats and then shifting operations to a second core while the first one cools down. This process maximizes the performance of a single thread, but it can cause delays when work must migrate between many cores for longer tasks. Thermal sprinting also reduces a chip’s overall throughput, as some portion of it will always be disabled while it cools.
System-level solutions thus require a careful balancing act between heat and performance. To apply them effectively, SoC designers must have a comprehensive understanding of how power is distributed on a chip and where hot spots occur, where sensors should be placed and when they should trigger a voltage or frequency reduction, and how long it takes parts of the chip to cool off. Even the best chip designers, though, will soon need even more creative ways of managing heat.
Making Use of a Chip’s Backside
A promising pursuit involves adding new functions to the underside, or backside, of a wafer. This strategy mainly aims to improve power delivery and computational performance. But it might also help resolve some heat problems.
New technologies can reduce the voltage that needs to be delivered to a multicore processor so that the chip maintains a minimum voltage while operating at an acceptable frequency. A backside power-delivery network does this by reducing resistance. Backside capacitors lower transient voltage losses. Backside integrated voltage regulators allow different cores to operate at different minimum voltages as needed.Imec
Imec foresees several backside technologies that may allow chips to operate at lower voltages, decreasing the amount of heat they generate. The first technology on the road map is the so-called backside power-delivery network (BSPDN), which does precisely what it sounds like: It moves power lines from the front of a chip to the back. All the advanced CMOS foundries plan to offer BSPDNs by the end of 2026. Early demonstrations show that they lessen resistance by bringing the power supply much closer to the transistors. Less resistance results in less voltage loss, which means the chip can run at a reduced input voltage. And when voltage is reduced, power density drops—and so, in turn, does temperature.
By changing the materials within the path of heat removal, backside power-delivery technology could make hot spots on chips even hotter.
Imec
After BSPDNs, manufacturers will likely begin adding capacitors with high energy-storage capacity to the backside as well. Large voltage swings caused by inductance in the printed circuit board and chip package can be particularly problematic in high-performance SoCs. Backside capacitors should help with this issue because their closer proximity to the transistors allows them to absorb voltage spikes and fluctuations more quickly. This arrangement would therefore enable chips to run at an even lower voltage—and temperature—than with BSPDNs alone.
Finally, chipmakers will introduce backside integrated voltage-regulator (IVR) circuits. This technology aims to curtail a chip’s voltage requirements further still through finer voltage tuning. An SoC for a smartphone, for example, commonly has 8 or more compute cores, but there’s no space on the chip for each to have its own discrete voltage regulator. Instead, one off-chip regulator typically manages the voltage of four cores together, regardless of whether all four are facing the same computational load. IVRs, on the other hand, would manage each core individually through a dedicated circuit, thereby improving energy efficiency. Placing them on the backside would save valuable space on the frontside.
It is still unclear how backside technologies will affect heat management; demonstrations and simulations are needed to chart the effects. Adding new technology will often increase power density, and chip designers will need to consider the thermal consequences. In placing backside IVRs, for instance, will thermal issues improve if the IVRs are evenly distributed or if they are concentrated in specific areas, such as the center of each core and memory cache?
Recently, we showed that backside power delivery may introduce new thermal problems even as it solves old ones. The cause is the vanishingly thin layer of silicon that’s left when BSPDNs are created. In a frontside design, the silicon substrate can be as thick as 750 micrometers. Because silicon conducts heat well, this relatively bulky layer helps control hot spots by spreading heat from the transistors laterally. Adding backside technologies, however, requires thinning the substrate to about 1 mm to provide access to the transistors from the back. Sandwiched between two layers of wires and insulators, this slim silicon slice can no longer move heat effectively toward the sides. As a result, heat from hyperactive transistors can get trapped locally and forced upward toward the cooler, exacerbating hot spots.
Our simulation of an 80-core server SoC found that BSPDNs can raise hot-spot temperatures by as much as 14 °C. Design and technology tweaks—such as increasing the density of the metal on the backside—can improve the situation, but we will need more mitigation strategies to avoid it completely.
Preparing for “CMOS 2.0”
BSPDNs are part of a new paradigm of silicon logic technology that Imec is calling CMOS 2.0. This emerging era will also see advanced transistor architectures and specialized logic layers. The main purpose of these technologies is optimizing chip performance and power efficiency, but they might also offer thermal advantages, including improved heat dissipation.
In today’s CMOS chips, a single transistor drives signals to both nearby and faraway components, leading to inefficiencies. But what if there were two drive layers? One layer would handle long wires and buffer these connections with specialized transistors; the other would deal only with connections under 10 mm. Because the transistors in this second layer would be optimized for short connections, they could operate at a lower voltage, which again would reduce power density. How much, though, is still uncertain.
In the future, parts of chips will be made on their own silicon wafers using the appropriate process technology for each. They will then be 3D stacked to form SoCs that function better than those built using only one process technology. But engineers will have to carefully consider how heat flows through these new 3D structures.
Imec
What is clear is that solving the industry’s heat problem will be an interdisciplinary effort. It’s unlikely that any one technology alone—whether that’s thermal-interface materials, transistors, system-control schemes, packaging, or coolers—will fix future chips’ thermal issues. We will need all of them. And with good simulation tools and analysis, we can begin to understand how much of each approach to apply and on what timeline. Although the thermal benefits of CMOS 2.0 technologies—specifically, backside functionalization and specialized logic—look promising, we will need to confirm these early projections and study the implications carefully. With backside technologies, for instance, we will need to know precisely how they alter heat generation and dissipation—and whether that creates more new problems than it solves.
Chip designers might be tempted to adopt new semiconductor technologies assuming that unforeseen heat issues can be handled later in software. That may be true, but only to an extent. Relying too heavily on software solutions would have a detrimental impact on a chip’s performance because these solutions are inherently imprecise. Fixing a single hot spot, for example, might require reducing the performance of a larger area that is otherwise not overheating. It will therefore be imperative that SoCs and the semiconductor technologies used to build them are designed hand in hand.
The good news is that more EDA products are adding features for advanced thermal analysis, including during early stages of chip design. Experts are also calling for a new method of chip development called
system technology co-optimization. STCO aims to dissolve the rigid abstraction boundaries between systems, physical design, and process technology by considering them holistically. Deep specialists will need to reach outside their comfort zone to work with experts in other chip-engineering domains. We may not yet know precisely how to resolve the industry’s mounting thermal challenge, but we are optimistic that, with the right tools and collaborations, it can be done.
From Your Site Articles
Related Articles Around the Web