New data service speeds the progress of research

by Brian Bevirt

Copying data between computers, filesystems, and tape archives can be time-consuming when handling enormous quantities of data, and this repetitive process impacts scientific productivity. To speed the pace of research, NCAR’s Computational and Information Systems Laboratory (CISL) has initiated a paradigm shift in its high performance data infrastructure by creating a large, centralized pool of data storage shared by supercomputers, data servers, and analysis and visualization systems. In the past, disk space was fragmented because each computer had a dedicated disk system associated with it. CISL’s new central data storage pool, called the GLobally Accessible Data Environment (GLADE), reduces copying of model output between computer systems and allows online access to data collections – such as NCAR’s Research Data Archive – by both local and remote users. Reducing the number of times scientists have to copy their datasets also creates the benefits of using disk space more efficiently and reducing the amount of data written to tape archive. In the past, much of this intermediate data had to be written to tape before it could be used on the next system. GLADE’s innovative design allows significantly faster and more efficient user workflows, the sequence of tasks they have to perform during scientific computation and analysis.

One breakthrough in GLADE’s design is that it can mirror the needs of user workflows rather than requiring users to adapt to the system’s design, which is typically inefficient. Another is that GLADE’s 1.6 petabytes of centralized disk storage space provides system managers and scientists with a synoptic view of the available storage, thus enabling them to more effectively manage and allocate resources between modeling, analysis, and data management requirements. This results in a more effective overall system performance in service to science. Additional, consolidated services – like wide-area high-performance data transfer engines using grid middleware such as gridftp – enhance the bandwidth available to transfer data between other sites and NCAR. This capability facilitates collaboration and improves the ability of scientists to use national high performance computing infrastructure.

GLADE is supporting NCAR’s participation in the Intergovernmental Panel on Climate Change Fifth Assessment Report campaign through the Earth System Grid science gateway with 400 TB of GLADE dedicated to this campaign. GLADE also enhances NCAR’s Community Data Portal and Research Data Archive (RDA). In particular, GLADE’s size is designed to support anticipated growth in certain critical data products over its lifetime. For example, the RDA datasets online are expected to grow to 250 TB in 2011, six times the size of the online RDA in 2010. For the first time, users of NCAR computing resources now have high-speed direct access to the RDA. Internet users will retain their current workflows, and all users will benefit from increased speed, dataset availability, and data services. CISL’s data servers and GLADE central data storage infrastructure will also speed data access for a variety of other data-intensive projects, and these projects will be able to request dedicated space through an allocation process currently under development.

CISL engineers are studying GLADE’s performance in the Mesa Lab production environment, and they will use the knowledge gained from this operational experience to adapt the design for the much larger (greater than 10 petabytes) successor system to be deployed at NWSC. Researchers also benefit from using GLADE now because they will be prepared for a smooth transition to NWSC’s new resources. Continued enhancement of NCAR grid services and their usability sustains CISL’s commitment to the future of supercomputing.

The new GLobally Accessible Data Environment (GLADE) system currently offers 1.6 petabytes of shared disk storage. NCAR’s supercomputer users receive numerous benefits because GLADE provides computational, analysis, and visualization work spaces common to all CISL computing resources, and because storage can be allocated according to project needs. Immediately after deployment, GLADE significantly expanded online data available for the Research Data Archive, the Earth System Grid, climate model projections for the upcoming IPCC report, and the NCAR-UCAR Community Data Portal.