TeraGrid operations at NCAR
By integrating with the TeraGrid as a resource provider (RP), NCAR has gained operational expertise and confidence in Grid computing, stimulated new strategic thinking about the design of cyberinfrastructure systems, changed how atmospheric and related science gets done, and enabled new opportunities for collaboration.
Scientists are gathering unprecedented
quantities of data from sophisticated ground-based networks and space-based
missions designed that look at the global oscillations of the Sun and other
stars. This asteroseismology data can be used to deduce the age and internal
structure of stars. Supercomputers provide an answer to the problem of
analyzing this vast amount of information, and scientists are utilizing
TeraGrid resources at NCAR's Computational and Information Systems Laboratory
to delve deeper into stellar properties. The NCAR team is working to make the
pipeline available to the asteroseismology community as a TeraGrid Science
Gateway, called Asteroseismic Modeling Pipeline, and expects to have a
working prototype of the system by the end of Fy2008.
Coupled with these benefits to NCAR, the long-range broader impacts for NSF include:
- Access to new resources:
These resources promise to accelerate climate and weather research on many fronts. These opportunities are already influencing NCAR's science priorities and objectives. - High-speed, remote access to data:
Wide area parallel filesystems promise to change how science campaigns and workflows are constructed by enabling cross-site data sharing. - Opportunities to form new collaborations:
The TeraGrid continues to bring CISL staff and NCAR researchers into contact with not only the staffs of other resource providers but also scientists in disciplines as varied as solar physics and seismology -- collaborative relationships that most likely would never have formed without the TeraGrid connection. - Data preservation:
Safeguarding critical datasets, such as the ECMWF reanalysis archive at NCAR, is a high priority. To that end, NCAR and SDSC have duplicated critical data on each other's archives under a Memorandum of Understanding between the two organizations. To date, over 80 TB of NCAR research datasets have been archived at SDSC over a Storage Resource Broker (SRB) connection to SDSC.
Resource statistics
While overall utilization of frost
hovered around 50% in FY2008, utilization by TeraGrid users has increased
to around 20% of the resource in recent months. This trend is expected to
continue in FY2009 with the award of a handful of mid-sized TeraGrid
resource allocations to NCAR. These allocations are effective on
1 October 2008.
While the 2,048-processor IBM Blue Gene/L system frost was available an
impressive 99.5% of the time, overall utilization of the system did not
approach that level. As shown in the figure at right, total utilization of
frost hovered around 50% throughout the year. This relatively low rate of
resource utilization is, in part, an artifact of the ability of the Cobalt
job scheduler to efficiently schedule parallel jobs on a 3D torus network
topology under the constraint that jobs must be allocated in contiguous
32-processor blocks. This constraint results in a kind of knapsack problem
for the job scheduler that it is currently unable to solve. This inherent
limitation of the Cobalt scheduler cannot avoid creating "unschedulable
islands" of computational nodes on an architecture like Blue Gene's.
Overall, TeraGrid users have used 24% of the cycles available to them on frost. As shown by the blue bars on the graph, TeraGrid usage was effectively zero during the first six months of frost's operation on the TeraGrid, but then began rising throughout the summer, presumably as TeraGrid users became aware of the existence of this resource and began applying for cycles via the TeraGrid's MRAC and LRAC allocation processes. Other factors that may explain this period of low utilization include:
- The excess computing cycles on the TeraGrid during this timeframe
- The infrequent TeraGrid resource allocation cycle (LRACs make allocations on a quarterly basis, and MRACs make allocations on a semi-annual basis)
- The reticence of NCAR users to voluntarily spend time on the TeraGrid application and account activation processes
- The availability of significantly more BG/L cycles at SDSC
- For the full TeraGrid user community, possibly an unwillingness to use or lack of awareness of this new, small resource
- Because the per-processor memory of the system is small, perhaps a perceived difficulty in using the BG/L architecture
- The relative age and single-processor speed of the BG/L system
During the period of August through September 2008, utilization by TeraGrid users suddenly jumped above 20%. This jump resulted from actively migrating important users on frost to the TeraGrid. Since NCAR makes only 25% of the frost resource available to TeraGrid users, this represents a 90% utilization of the frost allocation by TeraGrid users since July. In the most recent allocation, 773,000 CPU-hours were allocated to TeraGrid users on frost.
Cyberinfrastructure development
The resource manager and scheduler used on frost is Cobalt, an open-source project hosted at Argonne National Laboratory (ANL). We have been heavily involved in Cobalt development for several years, being one of the first sites outside ANL to use it. Contributions have included bug fixes, feature testing, and component development. NCAR and ANL also hold a weekly conference call in support of this collaboration, where discussion includes long-term planning in support of our TeraGrid initiatives. Our Cobalt development activities for the TeraGrid in CY2007 have focused mainly on the accounting and logging facilities. For example, we added the facility to run scripts both before jobs start running and after jobs complete. This gave us the ability to collect accounting information such as storage and network utilization on a per-job basis. All of our code modifications have been pushed back upstream to the Cobalt source repository, and are therefore made available to other sites that use Cobalt.
User support
NCAR prepared an announcement that appeared in TeraGrid News on August 1. We participated in the activities of the TeraGrid Services Working Group and the ARCH Working Group, including conducting surveys of users, reviewing new documents in the TeraGrid Knowledge Base, and testing new tools such as the User Portal and the TeraGrid News system. Support staff obtained login accounts on most TeraGrid platforms and familiarized themselves with the trouble ticket system. We wrote a web-based user guide for frost. We invited NCAR users of frost to obtain DAC accounts to familiarize themselves with the TeraGrid. Support staff attended TeraGrid 08, TeraGrid quarterly planning meetings, and Supercomputing 07 Birds-of-a-Feather sessions.
Security
The security posture of NCAR's TeraGrid systems has remained strong and is getting stronger. The NCAR/UCAR security group has been upgrading our network monitoring infrastructure. We are developing filtering strategies that allow us to process all of the traffic that enters or leaves the UCAR network, and we have improved our ability to scale up to handle a much greater volume of traffic.
We have also prepared a transition plan to shift our data acquisition from port mirroring in the production network switches to passive optical splitters. This will eliminate any possibility of a performance impact on the production network due to the monitoring infrastructure, and it will provide greater reliability if our network infrastructure is ever the target of an attack.
Frost itself has been diligently patched for each new vulnerability, and we have participated in the TeraGrid Security Working Group incident response process throughout the year. Our ReSET group implemented an account management system for frost, and we have verified that it works correctly for both new account creation and removal.
Asteroseismology Modeling Portal (AMP)
In February 2009, NASA is scheduled to launch the Kepler satellite -- a mission designed to discover habitable Earth-like planets around distant Sun-like stars. Hundreds of scientists from around the globe will be involved in the analysis of asteroseismic data from this mission, and interpreting the observations using state-of-the-art stellar models will present a significant data analysis challenge. NCAR has proposed to make these modeling capabilities available via a new Asteroseismology Science Gateway that utilizes TeraGrid resources.
TeraGrid build and test system
NCAR is evaluating and testing the TeraGrid Build and Test Service, based on the Metronome software package from the University of Wisconsin, by employing it to run the exhaustive test suite of a community software framework, namely the Earth System Modeling Framework (ESMF) regression test suite. So far the ESMF regression test suite has been installed on systems at NCSA and SDSC and is being installed on frost at NCAR. ESMF staff at NCAR are learning how to use Metronome.
Alignment with NCAR's strategic plan
This effort supports NCAR's strategic priorities of "Developing and providing advanced services and tools" and "Engaging a broader and more diverse community." NCAR's TeraGrid RP operations are funded by NSF Core funds and UCAR Communications Pool indirect funds. The asteroseismology and TeraGrid build and test system activities are receiving $144K of funding from the TeraGrid Integration Group (GIG) through the University of Chicago for project year 4 of the TeraGrid (Aug 2008 - Aug 2009).
