On Switching Policies for Modular Redundancy Fault-Tolerant Computing Systems

Menachem Berg, Israel Koren

Research output: Contribution to journalArticlepeer-review

Abstract

The objective of fault-tolerant computing systems is to provide an error-free operation in the presence of faults. The system has to recover from the effects of a fault by employing certain recovery procedures like program rollback, reload, and restart, etc. However, these recovery procedures, result in interruptions in the system's operation, thus reducing the availability of the system for user applications. Fault-tolerant systems for critical applications include, therefore, standby spares that are ready to replace active modules which fail to recover from the effects of a fault. A standby spare may also be used to replace a module suffering from frequent fault occurrences resulting in too many repetitions of the recovery process, in order to increase the availability of the system for user applications. In this case a module switching policy is needed indicating upon a fault occurrence, whether to retry a failing module or switch it out and replace it by a spare, considering the remaining mission time and the probability of a system crash. A module switching policy for dynamic redundancy systems is presented in this paper and the improvement in application-oriented availability due to the use of this policy is illustrated.

Original languageEnglish
Pages (from-to)1052-1062
Number of pages11
JournalIEEE Transactions on Computers
VolumeC-36
Issue number9
DOIs
StatePublished - Sep 1987
Externally publishedYes

Keywords

  • Application-oriented availability
  • deterioration models
  • failure rate
  • fault tolerance
  • modular redundancy
  • module switching policy
  • recovery
  • standby spare

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'On Switching Policies for Modular Redundancy Fault-Tolerant Computing Systems'. Together they form a unique fingerprint.

Cite this