COMPARISON OF SIMPLE MISSING DATA IMPUTATION TECHNIQUES FOR NUMERICAL AND CATEGORICAL DATASETS

  • Ramu Gautam Department of Electrical and Computer Engineering, University of Nevada Las Vegas
  • Shahram Latifi Department of Electrical and Computer Engineering, University of Nevada Las Vegas
Keywords: Statistical imputation techniques;, k-nearest neighbor imputation;, sensor data imputation;, MCAR;, MNAR;

Abstract

Almost every dataset has missing data. The common reasons are sensor error, equipment malfunction, human error, or translation loss. We study the efficacy of statistical (mean, median, mode) and machine learning based (k-nearest neighbors) imputation methods in accurately imputing missing data in numerical datasets with data missing not at random (MNAR) and data missing completely at random (MCAR) as well as categorical datasets. Imputed datasets are used to make prediction on the test set and Mean squared error (MSE) in prediction is used as the measure of performance of the imputation. Mean absolute difference between the original and imputed data is also observed. When the data is MCAR, kNN imputation results in lowest MSE for all datasets, making it the most accurate method. When less than 20% of data is missing, mean and median imputations are effective in regression problems. kNN imputation is better at 20% missingness and significantly better when 50% or more data is missing. For the kNN method, k = 5 gives better results than k=3 but k=10 gives similar results to k=5. For MNAR datasets, statistical methods result in similar or lower MSE compared to kNN imputation when less than 25% of instances have a missing feature. For higher missing levels, kNN imputation is superior. Given enough data points without missing features, deleting the instances with missing data may be a better choice at lower missingness levels. For categorical data imputation, kNN and Mode imputation are both effective.

Author Biographies

Ramu Gautam, Department of Electrical and Computer Engineering, University of Nevada Las Vegas

Ramu Gautam is a PhD student in Electrical and Computer Engineering Department at University of Nevada Las Vegas. Currently his research is in computer vision, focusing on 3D and 4D biological images. He has a master’s degree in Nanotechnology and a bachelor's degree in Electronics and Communication Engineering. He likes to go hiking and play table tennis in his free time.

Shahram Latifi, Department of Electrical and Computer Engineering, University of Nevada Las Vegas

Shahram Latifi is a Professor of Electrical Engineering at the University of Nevada, Las Vegas. Dr. Latifi is the co-director of the Center for Information Technology and Algorithms (CITA) at UNLV. He has designed and taught undergraduate and graduate courses in the broad spectrum of Computer Science and Engineering in the past four decades. He has given keynotes and seminars on machine learning/AI and IT-related topics all over the world. His research has been funded by NSF, NASA, DOE, DoD, Boeing, Lockheed, and Cray Inc. Dr. Latifi is the recipient of several research awards, the most recent being the Barrick Distinguished Research Award (2021). Dr. Latifi was recognized to be among the top 2% researchers around the world in December 2020, according to Stanford top 2% list (publication data in Scopus, Mendeley).  He is an IEEE Fellow and a Registered Professional Engineer in the State of Nevada.

References

1. D. B. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3, pp. 581–592, 1976.
2. J. M. Jerez et al., “Missing data imputation using statistical and machine learning methods in a real breast cancer problem,” Artificial Intelligence in Medicine, vol. 50, no. 2, pp. 105–115, 2010
3. J. W. Graham, S. M. Hofer, S. I. Donaldson, D. P. MacKinnon, and J. L. Schafer, “Analysis with missing data in prevention research.,” 1997.
4. N. Tsikriktsis, “A review of techniques for treating missing data in OM survey research,” Journal of operations management, vol. 24, no. 1, pp. 53–62, 2005.
5. M. R. Raymond, “Missing data in evaluation research,” Evaluation & the health professions, vol. 9, no. 4, pp. 395–420, 1986.
6. A. Jadhav, D. Pramod, and K. Ramanathan, “Comparison of Performance of Data Imputation Methods for Numeric Dataset,” Applied Artificial Intelligence, vol. 33, no. 10, pp. 913–933, Aug. 2019
7. G. E. Batista and M. C. Monard, “A study of K-nearest neighbour as an imputation method.,” His, vol. 87, no. 251–260, p. 48, 2002.
8. G. E. Batista and M. C. Monard, “An analysis of four missing data treatment methods for supervised learning,” Applied Artificial Intelligence, vol. 17, no. 5–6, pp. 519–533, May 2003
9. A. Choudhury and M. R. Kosorok, “Missing Data Imputation for Classification Problems,” Feb. 2020.
10. A. Ngueilbaye, H. Wang, D. A. Mahamat, and S. B. Junaidu, “Modulo 9 model-based learning for missing data imputation,” Applied Soft Computing, vol. 103, p. 107167, May 2021
11. D. Dua and C. Graff, “UCI Machine Learning Repository.” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
12. H. Kaya, P. Tüfekci, and E. Uzun, “Predicting co and no x emissions from gas turbines: novel data and a benchmark pems,” Turkish Journal of Electrical Engineering & Computer Sciences, vol. 27, no. 6, pp. 4783–4796, 2019.
13. S. Zhang, B. Guo, A. Dong, J. He, Z. Xu, and S. X. Chen, “Cautionary tales on air-quality improvement in Beijing,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 473, no. 2205, p. 20170457, 2017.
14. F. A. Thabtah, “Autism Spectrum Disorder Screening: Machine Learning Adaptation and DSM-5 Fulfillment,” Proceedings of the 1st International Conference on Medical and Health Informatics 2017, 2017.
Published
2023-04-08
Section
Articles