Adaptive Fault Tolerance in Real Time Cloud Computing
- Creators
- Malik, Sheheryar
- Huet, Fabrice
Description
With the increasing demand and benefits of cloud computing infrastructure, real time computing can be performed on cloud infrastructure. A real time system can take advantage of intensive computing capabilities and scalable virtualized environment of cloud computing to execute real time tasks. In most of the real time cloud applications, processing is done on remote cloud computing nodes. So there are more chances of errors, due to the undetermined latency and loose control over computing node. On the other side, most of the real time systems are also safety critical and should be highly reliable. So there is an increased requirement for fault tolerance to achieve reliability for the real time computing on cloud infrastructure. In this paper, a fault tolerance model for real time cloud computing is proposed. In the proposed model, the system tolerates the faults and makes the decision on the basis of reliability of the processing nodes, i.e. virtual machines. The reliability of the virtual machines is adaptive, which changes after every computing cycle. If a virtual machine manages to produce a correct result within the time limit, its reliability increases. And if it fails to produce the result within time or correct result, its reliability decreases. A metric model is given for the reliability assessment. In the model, decrease in reliability is more than increase. If the node continues to fail, it is removed, and a new node is added. There is also a minimum reliability level. If any processing node does not achieve that level, the systems will perform backward recovery or safety measures. The proposed technique is based on the execution of design diverse variants on multiple virtual machines, and assigning reliability to the results produced by variants. The virtual machine instances can be of same type or of different types. The system provides both the forward and backward recovery mechanism, but main focus is on forward recovery. The main essence of the proposed technique is the adaptive behavior of the reliability weights assigned to each processing node and adding and removing of nodes on the basis of reliability.
Abstract
International audience
Additional details
- URL
- https://hal.inria.fr/hal-00639904
- URN
- urn:oai:HAL:hal-00639904v1
- Origin repository
- UNICA