Computer Science Colloquium
Stephen L. Scott, Christian Engelmann
Oak Ridge National Laboratory, USA
Advancing Reliability, Availability and Serviceability for High-Performance Computing
Tue 18.04.2006, 16:15, 60 minutesSR/T642
Abstract
Today's high performance computing systems have several reliability deficiencies resulting in noticeable availability and serviceability issues. For example, head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. Furthermore, current solutions for fault-tolerance focus on dealing with the result of a failure. However, most are unable to transparently mask runtime system configuration changes caused by failures and require a complete restart of essential system services, such as MPI, in case of a failure. High availability computing strives to avoid the problems of unexpected failures through preemptive measures. The overall goal of our research is to expand today's effort in high availability for high-performance computing, so that systems can be kept alive by an OS runtime environment that understands the concepts of dynamic system configuration and degraded operation mode. This talk will present an overview of recent research performed at Oak Ridge National Laboratory in collaboration with Louisiana Tech University, North Carolina State University and the University of Reading in developing core technologies and proof-of-concept prototypes that improve the overall reliability, availability and serviceability of high-performance computing systems.Bio
Stephen L. Scott Senior Research Scientist Computer Science and Mathematics Division Oak Ridge National Laboratory (ORNL) Oak Ridge, Tennessee, USA scottsl@ornl.gov Stephen L. Scott is a Senior Research Scientist in the Network and Cluster Computing Group of the Computer Science and Mathematics Division at the Oak Ridge National Laboratory (ORNL), Oak Ridge, USA. Dr. Scott's research interest is in experimental systems with a focus on high performance distributed, heterogeneous, and parallel computing. He is a founding member of the Open Cluster Group (OCG) and Open Source Cluster Application Resources (OSCAR). Stephen is presently the OCG steering committee chair and has served as the OSCAR release manager and working group chair. Dr. Scott is the lead principal investigator for the Modular Linux and Adaptive Runtime support for HEC OS/R research (MOLAR) research team. This multi-lab and multi-education institution research effort concentrates on adaptive, reliable, and efficient operating and runtime system solutions for ultra-scale scientific high-end computing (HEC) as part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS). He is also the ORNL lead for the Scalable Systems Software project where his group is developing technologies to scale cluster resources to 10,000's of processors. Stephen has published numerous papers on cluster and distributed computing and has both a Ph.D. and M.S. in computer science. He is also a member of ACM, IEEE Computer, and the IEEE Task Force on Cluster Computing. Christian Engelmann Research and Development Staff Member, Computer Science and Mathematics Division, Oak Ridge National Laboratory (ORNL), Oak Ridge, Tennessee, USA engelmannc@ornl.gov Research Assistant/PhD Student Department of Computer Science The University of Reading Reading, UK Christian Engelmann's primary research interests target high-level reliability, availability and serviceability (RAS) solutions for scientific high-end computing (current Dissertation research topic). Specifically, his focus is on active/active and active/standby high availability for critical high-performance computing system services, efficient fault tolerance mechanisms for extreme-scale parallel and distributed systems and super-scalable scientific algorithms that have the capability to survive failures without the need for extensive reconfiguration. His secondary research area deals with flexible, pluggable, component-based runtime environments for parallel and distributed scientific computing. Christian is part of the MOLAR research team, which concentrates on adaptive, reliable, and efficient operating and runtime system solutions for ultra-scale scientific high-end computing as part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS). His MOLAR research focuses on high availability for Linux clusters. Christian is also a contributor to the Harness Workbench research effort of ORNL, University of Tennessee and Emory University in developing next-generation development tools, deployment mechanisms and runtime environments for high-performance computing in the spirit of its predecessors, the Parallel Virtual Machine (PVM) and the Harness Distributed Virtual Machine (DVM). In the past, Christian was involved with the ORNL/IBM Blue Gene/L research initiative in super-scalable algorithms for next generation supercomputing on systems with hundreds of thousands of processors. He was also a contributor to the Harness DVM research effort in developing a pluggable, lightweight DVM environment for heterogeneous metacomputing. For more details see: http://www.csm.ornl.gov/~engelmanInvited by o. Univ.-Prof. Dr. Jens Volkert
The Computer Science Colloquium is organized by the Department of Coputer Science at JKU, the Österreichische Gesellschaft für Informatik (ÖGI) and the Österreichische Computergesellschaft (OCG).