A Proactive Approach to Fault Tolerance Using Predictive Machine Learning Models in Distributed Systems

Authors

DOI:

https://doi.org/10.52756/ijerr.2024.v44spl.018

Keywords:

Cloud computing, distributed systems, preventive maintenance, proactive fault tolerance, random forest machine learning models

Abstract

In the era of cloud computing and large-scale distributed systems, ensuring uninterrupted service and operational reliability is crucial. Conventional fault tolerance techniques usually take a reactive approach, addressing problems only after they arise. This can result in performance deterioration and downtime. With predictive machine learning models, this research offers a proactive approach to fault tolerance for distributed systems, preventing significant failures before they arise. Our research focuses on combining cutting-edge machine learning algorithms with real-time analysis of massive streams of operational data to predict abnormalities in the system and possible breakdowns. We employ supervised learning algorithms such as Random Forests and Gradient Boosting to predict faults with high accuracy. The predictive models are trained on historical data, capturing intricate patterns and correlations that precede system faults. Early defect detection made possible by this proactive approach enables preventative remedial measures to be taken, reducing downtime and preserving system integrity. To validate our approach, we designed and implemented a fault prediction framework within a simulated distributed system environment that mirrors contemporary cloud architectures. Our experiments demonstrate that the predictive models can successfully forecast a wide range of faults, from hardware failures to network disruptions, with significant lead time, providing a critical window for implementing preventive measures. Additionally, we assessed the impact of these pre-emptive actions on overall system performance, highlighting improved reliability and a reduction in mean time to recovery (MTTR). We also analyse the scalability and adaptability of our proposed solution within diverse and dynamic distributed environments. Through seamless integration with existing monitoring and management tools, our framework significantly enhances fault tolerance capabilities without requiring extensive restructuring of current systems. This work introduces a proactive approach to fault tolerance in distributed systems using predictive machine learning models. Unlike traditional reactive methods that respond to failures after they occur, this work focuses on anticipating faults before they happen.

References

Al-Dulaimy, A., Sicari, C., Papadopoulos, A. V., Galletta, A., Villari, M., & Ashjaei, M. (2022, September). Tolerancer: A fault tolerance approach for cloud manufacturing environments. In 2022 IEEE 27th International Conference on Emerging Technologies and Factory Automation (ETFA), pp. 1-8. https://doi.org/10.1109/ETFA52439.2022.9921606 DOI: https://doi.org/10.1109/ETFA52439.2022.9921606

Al Qassem, L. M., Stouraitis, T., Damiani, E., & Elfadel, I. A. M. (2023). Proactive random-forest autoscaler for microservice resource allocation. IEEE Access, 11, 2570-2585. https://doi.org/10.1109/ACCESS.2023.3234021 DOI: https://doi.org/10.1109/ACCESS.2023.3234021

AlOrbani, A., & Bauer, M. (2021). Load balancing and resource allocation in smart cities using reinforcement learning. In 2021 IEEE International Smart Cities Conference (ISC2), pp. 1-7. https://doi.org/10.1109/ISC253183.2021.9562941 DOI: https://doi.org/10.1109/ISC253183.2021.9562941

Bandari, V. (2020). Proactive Fault Tolerance Through Cloud Failure Prediction Using Machine Learning. ResearchBerg Review of Science and Technology, 3(1), 51-65. Retrieved from https://researchberg.com/index.php/rrst/article/view/54

Bessani, A., Sousa, J., & Alchieri, E. E. (2014, June). State machine replication for the masses with BFT-SMART. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 355-362. https://doi.org/10.1109/DSN.2014.43 DOI: https://doi.org/10.1109/DSN.2014.43

Bharany, S., Badotra, S., Sharma, S., Rani, S., Alazab, M., Jhaveri, R. H., & Gadekallu, T. R. (2022). Energy efficient fault tolerance techniques in green cloud computing: A systematic survey and taxonomy. Sustainable Energy Technologies and Assessments, 53, 102613. https://doi.org/10.1016/j.seta.2022.102613 DOI: https://doi.org/10.1016/j.seta.2022.102613

Chakrabarty, N., Kundu, T., Dandapat, S., Sarkar, A., & Kole, D. K. (2019). Flight arrival delay prediction using gradient boosting classifier. In Emerging Technologies in Data Mining and Information Security: Proceedings of IEMIS 2018, 2, 651-659). https://doi.org/10.1007/978-981-13-1498-8_57 DOI: https://doi.org/10.1007/978-981-13-1498-8_57

Dhingra, M., & Gupta, N. (2017). Comparative analysis of fault tolerance models and their challenges in cloud computing. International Journal of Engineering & Technology, 6(2), 36-40. https://doi.org/10.14419/ijet.v6i2.7565 DOI: https://doi.org/10.14419/ijet.v6i2.7565

Eckart, B., Chen, X., He, X., & Scott, S. L. (2008). Failure prediction models for proactive fault tolerance within storage systems. In 2008 IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems, pp. 1-8. https://doi.org/10.1109/MASCOT.2008.4770560 DOI: https://doi.org/10.1109/MASCOT.2008.4770560

Elnozahy, E. N., Alvisi, L., Wang, Y. M., & Johnson, D. B. (2002). A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR), 34(3), 375-408. https://doi.org/10.1145/568522.56852 DOI: https://doi.org/10.1145/568522.568525

Fox, A., & Brewer, E. A. (1999, March). Harvest, yield, and scalable tolerant systems. In Proceedings of the seventh workshop on hot topics in operating systems, pp. 174-178. https://doi.org/10.1109/HOTOS.1999.798396 DOI: https://doi.org/10.1109/HOTOS.1999.798396

Garg, S. (2022). Task resource usage of Google Cluster Usage Trace dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6979672

Gossman, M. J., Nicolae, B., & Calhoun, J. C. (2024). Scalable I/O aggregation for asynchronous multi-level checkpointing. Future Generation Computer Systems, 160, 420-432, ISSN 0167-739X. https://doi.org/10.1016/j.future.2024.06.003 DOI: https://doi.org/10.1016/j.future.2024.06.003

Gururaj, H. L., Flammini, F., Swathi, B. H., Nagaraj, N., & Ramesh, S. K. B. (2023a). Fault tolerance of network routers using machine learning techniques. In Big Data Analytics and Intelligent Systems for Cyber Threat Intelligence, pp. 253-274. River Publishers, eBook ISBN 9781003373384. DOI: https://doi.org/10.1201/9781003373384-13

Gururaj, H. L., Flammini, F., Swathi, B. H., Nagaraj, N., & Ramesh, S. K. B. (2023b). Machine Learning Techniques for Fault Tolerance Management. In Computational Intelligence for Cybersecurity Management and Applications, pp. 83-100. CRC Press, eBook ISBN 9781003319917. DOI: https://doi.org/10.1201/9781003319917-7

Haloi, R., & Chanda, D. (2024). Performance Analysis of KNN, Naïve Bayes, and Extreme Learning Machine Techniques on EEG Signals for Detection of Parkinson’s Disease. International Journal of Experimental Research and Review, 43(Spl Vol), 32–41. https://doi.org/10.52756/ijerr.2024.v43spl.003 DOI: https://doi.org/10.52756/ijerr.2024.v43spl.003

Hasan, D., & Zeebaree, S. R. (2024). Proactive Fault Tolerance in Distributed Cloud Systems: A Review of Predictive and Preventive Techniques. Indonesian Journal of Computer Science, 13(2). https://doi.org/10.33022/ijcs.v13i2.3808 DOI: https://doi.org/10.33022/ijcs.v13i2.3808

Hien, P. T. (2023). Adaptive Fault Tolerance Mechanisms for Enhancing Service Reliability in Cloud Computing Environments. Eigenpub Review of Science and Technology, 7(1), 252-265. Retrieved from https://studies.eigenpub.com/index.php/erst/article/view/35

Kalaskar, C., & Thangam, S. (2023). Fault tolerance of cloud infrastructure with machine learning. Cybernetics and Information Technologies, 23(4), 26-50. https://doi.org/10.2478/cait-2023-0034 DOI: https://doi.org/10.2478/cait-2023-0034

Karadayi, Y., Aydin, M. N., & Ö?rencí, A. S. (2020). Unsupervised anomaly detection in multivariate spatio-temporal data using deep learning: early detection of COVID-19 outbreak in Italy. IEEE Access, 8, 164155-164177. https://doi.org/10.1109/ACCESS.2020.3022366 DOI: https://doi.org/10.1109/ACCESS.2020.3022366

Khan, W., & Haroon, M. (2022). An efficient framework for anomaly detection in attributed social networks. International Journal of Information Technology, 14(6), 3069-3076. https://doi.org/10.1007/s41870-022-01044-2 DOI: https://doi.org/10.1007/s41870-022-01044-2

Kirti, M., Maurya, A. K., & Yadav, R. S. (2024a). Fault?tolerance approaches for distributed and cloud computing environments: A systematic review, taxonomy and future directions. Concurrency and Computation: Practice and Experience, 36(13), e8081. https://doi.org/10.1002/cpe.8081 DOI: https://doi.org/10.1002/cpe.8081

Kirti, M., Maurya, A. K., & Yadav, R. S. (2024b). A Fault?tolerant model for tuple space coordination in distributed environments. Concurrency and Computation: Practice and Experience, 36(1), e7884. https://doi.org/10.1002/cpe.7884 DOI: https://doi.org/10.1002/cpe.7884

Kochhar, D., & Jabanjalin, H. (2017). An approach for fault tolerance in cloud computing using machine learning technique. International Journal of Pure and Applied Mathematics, 117(22), 345-351. https://api.semanticscholar.org/CorpusID:195063043

Kumar, A., Dutta, S., & Pranav, P. (2023). Supervised learning for Attack Detection in Cloud. Int. J. Exp. Res. Rev., 31(Spl Volume), 74-84. https://doi.org/10.52756/10.52756/ijerr.2023.v31spl.008 DOI: https://doi.org/10.52756/10.52756/ijerr.2023.v31spl.008

Lan, Z., & Li, Y. (2008). Adaptive fault management of parallel applications for high-performance computing. IEEE Transactions on Computers, 57(12), 1647-1660. https://doi.org/10.1109/TC.2008.90 DOI: https://doi.org/10.1109/TC.2008.90

Lima, A. L. D. C. D., Aranha, V. M., Carvalho, C. J. D. L., & Nascimento, E. G. S. (2021). Smart predictive maintenance for high-performance computing systems: a literature review. The Journal of Supercomputing, 77(11), 13494-13513. https://doi.org/10.1007/s11227-021-03811-7 DOI: https://doi.org/10.1007/s11227-021-03811-7

Lu, L. T., Zhu, S. L., Wang, D. M., & Han, Y. Q. (2024). Distributed adaptive fault-tolerant control with prescribed performance for nonlinear multiagent systems. Communications in Nonlinear Science and Numerical Simulation, 138, 108222. https://doi.org/10.1016/j.cnsns.2024.108222 DOI: https://doi.org/10.1016/j.cnsns.2024.108222

Mondal, S., Nag, A., Barman, A. K., & Karmakar, M. (2023). Machine Learning-based maternal health risk prediction model for IoMT framework. International Journal of Experimental Research and Review, 32, 145–159. https://doi.org/10.52756/ijerr.2023.v32.012 DOI: https://doi.org/10.52756/ijerr.2023.v32.012

Mukwevho, M. A., & Celik, T. (2018). Toward a smart cloud: A review of fault-tolerance methods in cloud systems. IEEE Transactions on Services Computing, 14(2), 589-605. https://doi.org/10.1109/TSC.2018.2816644 DOI: https://doi.org/10.1109/TSC.2018.2816644

Obadia, M., Bouet, M., Leguay J., Phemius K. and Iannone L.. (2014) Failover mechanisms for distributed SDN controllers, 2014 International Conference and Workshop on the Network of the Future (NOF), Paris, France, 2014, pp. 1-6. https://doi.org/10.1109/NOF.2014.7119795 DOI: https://doi.org/10.1109/NOF.2014.7119795

Polze, A., Tröger, P., & Salfner, F. (2011, March). Timely virtual machine migration for pro-active fault tolerance. In 2011 14th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing Workshops, pp. 234-243. https://doi.org/10.1109/ISORCW.2011.42 DOI: https://doi.org/10.1109/ISORCW.2011.42

Power, A., & Kotonya, G. (2018, June). A microservices architecture for reactive and proactive fault tolerance in IoT systems. In 2018 IEEE 19th International Symposium on" A World of Wireless, Mobile and Multimedia Networks (WoWMoM), pp. 588-599. https://doi.org/10.1109/WoWMoM.2018.8449789 DOI: https://doi.org/10.1109/WoWMoM.2018.8449789

Pal, R., Pandey, M., Pal, S., & Yadav, D. (2023). Phishing Detection: A Hybrid Model with Feature Selection and Machine Learning Techniques. Int. J. Exp. Res. Rev., 36, 99-108. https://doi.org/10.52756/ijerr.2023.v36.009 DOI: https://doi.org/10.52756/ijerr.2023.v36.009

Ren, Y. (2021). Optimizing predictive maintenance with machine learning for reliability improvement. ASCE-ASME Journal of Risk and Uncertainty in Engineering Systems, Part B: Mechanical Engineering, 7(3), 030801. https://doi.org/10.1115/1.4049525 DOI: https://doi.org/10.1115/1.4049525

Seba, A. M., Gemeda, K. A., & Ramulu, P. J. (2024). Prediction and classification of IoT sensor faults using hybrid deep learning model. Discover Applied Sciences, 6(1), 9. https://doi.org/10.1007/s42452-024-05633-7 DOI: https://doi.org/10.1007/s42452-024-05633-7

Siddiqui, Z. A., & Haroon, M. (2023). Analysis of Challenges for Blockchain Adoption in Enterprise Distributed Applications. International Journal on Recent and Innovation Trends in Computing and Communication, 11(8s), 474–482. https://doi.org/10.17762/ijritcc.v11i8s.7228 DOI: https://doi.org/10.17762/ijritcc.v11i8s.7228

Siddiqui, Z. A., & Haroon, M. (2024). Ranking of components for reliability estimation of CBSS: an application of entropy weight fuzzy comprehensive evaluation model. International Journal of System Assurance Engineering and Management, pp. 1-15. https://doi.org/10.1007/s13198-024-02263-5 DOI: https://doi.org/10.1007/s13198-024-02263-5

Sifat, M. M. H., & Das, S. K. (2024). Proactive and Reactive Maintenance Strategies for Self-Healing Digital Twin Islanded Microgrids Using Fuzzy Logic Controllers and Machine Learning Techniques. IEEE Transactions on Power Systems. https://doi.org/10.1109/TPWRS.2024.3408096 DOI: https://doi.org/10.1109/TPWRS.2024.3408096

Singh, D. P., & Singh, S. K. (2023). Precision fault prediction in motor bearings with feature selection and deep learning. Int. J. Exp. Res. Rev, 32, 398-407. https://doi.org/10.52756/ijerr.2023.v32.035 DOI: https://doi.org/10.52756/ijerr.2023.v32.035

Srivastava, S., Haroon, M., & Bajaj, A. (2013, September). Web document information extraction using class attribute approach. In 2013 4th International Conference on Computer and Communication Technology (ICCCT), pp. 17-22. https://doi.org/10.1109/ICCCT.2013.6749596 DOI: https://doi.org/10.1109/ICCCT.2013.6749596

Sun, S., Yao, W., & Li, X. (2018). DARS: A dynamic adaptive replica strategy under high load Cloud-P2P. Future Generation Computer Systems, 78, 31-40. https://doi.org/10.1016/j.future.2017.07.046 DOI: https://doi.org/10.1016/j.future.2017.07.046

Swarnalatha, K., Narisetty, N., Rao Kancherla, G., & Bobba, B. (2024). Analyzing Resampling Techniques for Addressing the Class Imbalance in NIDS using SVM with Random Forest Feature Selection. International Journal of Experimental Research and Review, 43(Spl Vol), 42–55. https://doi.org/10.52756/ijerr.2024.v43spl.004 DOI: https://doi.org/10.52756/ijerr.2024.v43spl.004

Tiwari, R. G., Haroon, M., Tripathi, M. M., Kumar, P., Agarwal, A. K., & Jain, V. (2024) A System Model of Fault Tolerance Technique in Distributed System and Scalable System Using Machine Learning. In Software-Defined Network Frameworks, pp. 1-16. CRC Press, eBook ISBN 9781003437482.

Veer, A. S., & Bhardwaj, S. (2024, February). An Adaptive Storage Switching Algorithm for Fault-Tolerant Network Attached Storage systems. In 2024 2nd International Conference on Computer, Communication and Control (IC4), pp. 1-7. https://doi.org/10.1109/IC457434.2024.10486061 DOI: https://doi.org/10.1109/IC457434.2024.10486061

Venkataraman, N. (2023). Proactive fault prediction of fog devices using LSTM-CRP conceptual framework for IoT applications. Sensors, 23(6), 2913. https://doi.org/10.3390/s23062913 DOI: https://doi.org/10.3390/s23062913

Yadav, P., Bhargava, C. P., Gupta, D., Kumari, J., Acharya, A., & Dubey, M. (2024). Breast Cancer Disease Prediction Using Random Forest Regression and Gradient Boosting Regression. International Journal of Experimental Research and Review, 38, 132–146. https://doi.org/10.52756/ijerr.2024.v38.012 DOI: https://doi.org/10.52756/ijerr.2024.v38.012

Yang, Y., Mei, J., Zhang, Z., Long, Y., Liu, A., Gao, Z., & Rui, L. (2023). Lightweight Fault Prediction Method for Edge Networks. IEEE Internet of Things Journal. https://doi.org/10.1109/JIOT.2023.3333293 DOI: https://doi.org/10.1109/JIOT.2023.3333293

Zou, Y., Yang, L., Jing, G., Zhang, R., Xie, Z., Li, H., & Yu, D. (2024). A survey of fault tolerant consensus in wireless networks. High-Confidence Computing, 4(2), 100202. https://doi.org/10.1016/j.hcc.2024.100202 DOI: https://doi.org/10.1016/j.hcc.2024.100202

Published

2024-10-30

How to Cite

Haroon, M., Siddiqui, Z. A., Husain, M., Ali, A., & Ahmad, T. (2024). A Proactive Approach to Fault Tolerance Using Predictive Machine Learning Models in Distributed Systems. International Journal of Experimental Research and Review, 44, 208–220. https://doi.org/10.52756/ijerr.2024.v44spl.018