Website header

Reintroduction of liquid cooling for high performance computing

A closed loop carries liquid coolant from a heat exchanger in the bottom of each cabinet to heat sinks over the processors. The heat exchanger regulates the fluid temperature, keeping it low enough to cool the chips yet warm enough to avoid condensation inside the system. A separate chilled-water loop connects the heat exchangers in each cabinet with a large reservoir that is cooled by the computer facility's chilled-water system. With the installation of the IBM Power 575 computer bluefire in FY2008, CISL re-embraced the technology of liquid-cooled supercomputer processors. One decade ago, the last liquid-cooled Cray system was decommissioned and removed from the Mesa Lab computing facility. During FY2008, CISL staff and members of UCAR Physical Plant Services planned and installed water cooling infrastructure to cool bluefire, the second phase of the ICESS procurement.

Liquid cooling was reintroduced because increases in the density and speed of supercomputing processors produce too much heat for air cooling to remove from the cabinets. The IBM POWER systems (Power Optimization With Enhanced RISC) began with air cooling and are now actively using liquid (in our case, water) cooling both at the surface of the processors and for heat capture at the back of each cabinet. The POWER3, POWER4, and POWER5 systems blackforest, bluesky, and bluevista were completely air cooled. In 1999, IBM's POWER3 processors in blackforest operated at 0.375 GHz; in 2002, POWER4 processors at NCAR ran at 1.3 GHz; and in 2005, bluevista's POWER5 processors ran at 1.9 GHz. In 2008, the POWER6 processors in the water-cooled system bluefire are running at 4.7 GHz. At full load, the bluefire system draws 65 KW of electricity.

Water cooling is significantly more efficient than air cooling, and it is identified as a state-of-the-art practice by EPA guidelines for data centers. In addition to air cooling, the bluefire system uses two liquid-cooling technologies. One system directs hot air expelled from the back of the computer across a large coolant coil built into the rear door of each computer cabinet; this captures nearly 60% of the waste heat before it enters the air in the computer room.

The second system uses a liquid-to-liquid heat exchanger in the bottom of each cabinet to regulate the temperature at the surface of each of the 384 processors using a 3,000-gallon chilled-water reservoir nearby. One closed loop carries chilled water between the reservoir and the heat exchangers, and another closed loop distributes coolant from the heat exchangers to the processors.

In this installation photo, two 1,500-gallon chilled-water tanks (background) are located near bluefire (to the left of the open floor panels). To mitigate condensation, a worker insulates the incoming chilled-water pipes from the reservoir and the outgoing warm-water pipes from bluefire. This chilled-water system terminates at heat exchangers in the bottom of each cabinet. A separate liquid cooling loop regulates the internal temperature inside each of bluefire's 11 cabinets. While liquid cooling is much more efficient than air cooling, water inside the supercomputer room -- and inside the supercomputer itself -- presents challenges. Clearly, top-quality installations must be followed because the risk of leaks must be rigorously controlled. NCAR's in-house mechanical and electrical facilities staff distinguished themselves during their expert installation of these complex new systems.

One additional risk is introduced with this technology specifically because it is so efficient. If one of the two water chillers fails, the coolant temperature can rise so rapidly that the computers may overheat in the three minutes required for the redundant chiller to come on line. The thermal mass of the water in the 3,000-gallon reservoir slows the rise of the coolant temperature, giving the backup chiller and CISL staff time to react before the computers are compromised. The concept of thermal storage is analogous to Uninterruptible Power Supplies (UPS) for electrical systems. For mechanical systems, a reservoir of water, ice, or other liquid is stored in the system and works just like the batteries in a UPS system.

In FY2009, the liquid cooling facility will be maintained to continue serving the production computing needs of bluefire. No additional liquid-cooling requirements will be added to NCAR's computing facility during the next fiscal year.

This advancement in computing facility technology fulfills NCAR's strategic goal to "Provide robust, accessible, and innovative information services and tools," and the related strategic priority of "Enhancing capability and capacity of NCAR supercomputing." This work is supported by NSF Core funding.