Data-intensive computing architecture

Traditionally, high-performance storage is procured and tightly associated with individual computing systems. This conceptual model leads to a processor-centric design where each computer must be surrounded by a collection of services: online data, archived data, data management, data analysis, visualization, external interfaces, and networking. To complete multistep workflows, this design requires data to be copied between the separate filesystems of the various components supporting the supercomputer.

Recently, computing power increases have been outpacing data system performance, dataset sizes are increasing exponentially, and this significantly increases the cost of duplicating data and maintaining multiple copies of it to support user workflows. Because of these changes in computing power and dataset sizes, the processor-centric model is becoming untenable, especially for the long term. In its place, CISL has adopted a data-centric model in which a common, high-speed, central filesystem holds all data for shared access by all the computing and support systems required to complete scientific workflows. This design improves scientific productivity and reduces costs by eliminating the need to move or maintain multiple copies of data.

Data-centric computing architecture at NWSC
This diagram shows the foundational CISL architecture for the equipment to be deployed in the new NWSC facility. It shows the integration and relative sizes of systems for computing, data analysis and visualization, online and archived data, data management, external interfaces, and networks.

In FY2010, CISL deployed the GLADE file system at NCAR’s Mesa Lab Computing Facility (MLCF) to confirm the value of this design and provide common, centralized, high-speed access to user data from CISL’s CI resources. GLADE streamlines user workflows and minimizes time-consuming data movement tasks. The GLADE architecture currently operating at ML is a precursor to the data-handling environment designed for NWSC. The diagram above shows the foundational architecture for NWSC’s data-centric design: the large (15 PB) high-speed (>90 GB/s) central filesystem is the centerpiece of the design.

CISL’s data-intensive computing strategy extends past the CI hardware and includes a full suite of community data services. CISL is leading the community in developing data services that can address the future challenges of data growth, preservation, and management. CISL also leads in supporting NSF’s new requirement for data management plans. Our disk and tape-based storage systems provide an efficient, safe, reliable environment for hosting datasets, and CISL anticipates that its data services will be further streamlined, improved, and expanded through its new data-centric CI design.

2011 CISL Annual Report