Website header

Blue Gene/L system uptime availability

This chart shows the percent of frost uptime for each month in FY2008. The scheduled semi-annual machine room power-down periods were in October 2007 and April 2008. The overall system uptime availability for FY2008 including these periods was 99.46%. The transition of the Blue Gene/L system frost from an experimental platform into a full production computational resource for the TeraGrid placed additional uptime requirements on the hardware, software, and operational policies of the machine. Even though frost demonstrated a proportionately low hardware failure rate and fairly robust management software suite since delivery, multiple opportunities existed to improve on the original system design, software configuration, and procedures to provide an increased quality of service to the expanded frost user community.

During FY2008, CISL staff planned and implemented multiple system upgrades and modifications to the system configuration that focused on providing increased system uptime. Though the system and operational staff had already provided a high level of system availability, these improvements addressed known issues in the system software and offered additional insulation against a variety of failures.

Through FY2008, these modifications reduced unanticipated downtimes due to software component failures, and provided for a more robust operational environment.

The following availability-related activities occurred in FY2008:

  • Careful planning allowed for the full operating system, file system, scheduler, and control system software upgrades and reconfiguration to be implemented during the scheduled semi-annual machine room power-down periods (October 2007 and April 2008). A single additional two-hour downtime was taken for a series of patches to the Cobalt scheduler to fix issues identified following a major upgrade.
  • A new 17.2 TB file system was brought online using a custom high-availability design that eliminated all single points of failure. CISL engineers provided custom software engineering to implement support for this design under the IBM GPFS file system, as well as custom support scripts to improve management and provide a consistent interface for ongoing operation.
  • Expansion of the user home directory storage space.
  • Development of disaster recovery and continuity plans for multiple critical components and services.

FY2009 plans include developing disaster recovery procedures for several remaining single points of failure, including further development of the process to completely replace the service node in case of a catastrophic failure of that hardware.

The acquisition and operation of frost was made possible through NSF MRI Grants CNS-0421498, CNS-0420873, and CNS-0420985; through the IBM Shared University Research (SUR) program with the University of Colorado; and NSF Core funding.