Breast Cancer Disease Prediction Using Random Forest Regression and Gradient Boosting Regression
DOI:
https://doi.org/10.52756/ijerr.2024.v38.012Keywords:
Gradient Boosting Regression, Mean Squared Error, Mean Absolute Error, Random Forest, R-squaredAbstract
The current research endeavors to evaluate the efficacy of regression-based machine learning algorithms through an assessment of their performance using diverse metrics. The focus of our study involves the implementation of the breast cancer Wisconsin (Diagnostic) dataset, employing both the random forest and gradient-boosting regression algorithms. In our comprehensive performance analysis, we utilized key metrics such as Mean Squared Error (MSE), R-squared, Mean Absolute Error (MAE), and Coefficient of Determination (COD), supplemented by additional metrics. The evaluation aimed to gauge the algorithms' accuracy and predictive capabilities. Notably, for continuous target variables, the gradient-boosting regression model emerged as particularly noteworthy in terms of performance when compared to other models. The gradient-boosting regression model exhibited remarkable results, highlighting its superiority in handling the breast cancer dataset. The model achieved an impressively low MSE value of 0.05, indicating minimal prediction errors. Furthermore, the R-squared value of 0.89 highlighted the model's ability to explain the variance in the data, affirming its robust predictive power. The Mean Absolute Error (MAE) of 0.14 reinforced the model's accuracy in predicting continuous outcomes. Beyond these core metrics, the study incorporated additional measures to provide a comprehensive understanding of the algorithms' performance. The findings underscore the potential of gradient-boosting regression in enhancing predictive accuracy for datasets with continuous target variables, particularly evident in the context of breast cancer diagnosis. This research contributes valuable insights to the ongoing exploration of machine learning algorithms, providing a basis for informed decision-making in medical and predictive analytics domains.
References
Acquah, H. D. G. (2010). Comparison of Akaike information criterion (AIC) and Bayesian information criterion (BIC) in selection of an asymmetric price relationship. Journal of Development and Agricultural Economics, 2(1), 001-006.
Ahmed, A., Whittington, J., & Shafaee, Z. (2023). Impact of Commission on Cancer Accreditation on Cancer Survival: A Surveillance, Epidemiology, and End Results (SEER) Database analysis. Annals of Surgical Oncology, 31(4), 2286–2294. https://doi.org/10.1245/s10434-023-14709-4
Azur, M., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49. https://doi.org/10.1002/mpr.329
Chen, R., Cai, N., Luo, Z., Wang, H., Liu, X., & Li, J. (2023). Multi-task banded regression model: A novel individual survival analysis model for breast cancer. Computers in Biology and Medicine, 162, 107080. https://doi.org/10.1016/j.compbiomed.2023.107080
Chen, S., Goo, Y. J. J., & Shen, Z. D. (2014). A hybrid approach of stepwise regression, logistic regression, support vector machine, and decision tree for forecasting fraudulent financial statements. The Scientific World Journal, 2014, 1-9. https://doi.org/10.1155/2014/968712
Chicco, D., Warrens, M. J., & Jurman, G. (2021). The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. Peerj Computer Science, 7, e623. https://doi.org/10.7717/peerj-cs.623
Choi, J. A., & Lim, K. (2020). Identifying machine learning techniques for classification of target advertising. ICT Express, 6(3), 175-180. https://doi.org/10.1016/j.icte.2020.04.012
Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E. W., Verbakel, J. Y., & Van Calster, B. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology, 110, 12-22. https://doi.org/10.1016/j.jclinepi.2019.02.004
De Myttenaere, A., Golden, B., Le Grand, B., & Rossi, F. (2016). Mean absolute percentage error for regression models. Neurocomputing, 192, 38-48. https://doi.org/10.1016/j.neucom.2015.12.114
Dehkharghanian, T., Bidgoli, A. A., Riasatian, A., Mazaheri, P., Campbell, C. J., Pantanowitz, L., Tizhoosh, H. R., & Rahnamayan, S. (2023). Biased data, biased AI: deep networks predict the acquisition site of TCGA images. Diagnostic Pathology, 18(1). https://doi.org/10.1186/s13000-023-01355-3
DeMaris, A., & Selman, S. H. (2013). Converting data into evidence. A statistics primer for the medical practitioner. In Springer eBooks, New York. https://doi. org/10.1007/978-1-4614-7792-1.
El‐Gabbas, A., & Dormann, C. F. (2018). Improved species‐occurrence predictions in data‐poor regions: using large‐scale data and bias correction with down‐weighted Poisson regression and Maxent. Ecography, 41(7), 1161-1172. https://doi.org/10.1111/ecog.03149
Emami, N. P., Degeling, M., Bauer, L., Chow, R., Cranor, L. F., Haghighat, M. R., & Patterson, H. (2018). The influence of friends and experts on privacy decision making in IoT scenarios. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW), 1-26. https://doi.org/10.1145/3274317
Emmert-Streib, F., & Dehmer, M. (2019). High-dimensional LASSO-based computational regression models: regularization, shrinkage, and selection. Machine Learning and Knowledge Extraction, 1(1), 359-383. https://doi.org/10.3390/make1010021
Gelman, A., Goodrich, B., Gabry, J., & Vehtari, A. (2019). R-squared for Bayesian regression models. The American Statistician, 73(3), 307–309. https://doi.org/10.1080/00031305.2018.1549100
Geraci, M., & Bottai, M. (2007). Quantile regression for longitudinal data using the asymmetric Laplace distribution. Biostatistics, 8(1), 140-154. https://doi.org/10.1093/biostatistics/kxj039
He, B., Sun, H., Bao, M., Li, H., He, J., Tian, G., & Wang, B. (2023). A cross-cohort computational framework to trace tumor tissue-of-origin based on RNA sequencing. Scientific Reports, 13(1), 15356.
https://doi.org/10.1038/s41598-023-42465-8
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67. https://doi.org/10.1080/00401706.1970.10488634
Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression. John Wiley & Sons. https://doi.org/10.1002/9781118548387
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. In Springer Texts in Statistics, 112, 18. https://doi.org/10.1007/978-1-0716-1418-1
Jia, Y., Kwong, S., Wu, W., Wang, R., & Gao, W. (2017). Sparse Bayesian learning-based kernel Poisson regression. IEEE Transactions on Cybernetics, 49(1), 56-68. https://doi.org/10.1109/TCYB.2017.2764099
Jie, H., & Zheng, G. (2019). Calibration of Torque Error of Permanent Magnet Synchronous Motor Base on Polynomial Linear Regression Model. In IECON 2019-45th Annual Conference of the IEEE Industrial Electronics Society (Vol. 1, pp. 318-323). IEEE. https://doi.org/10.1109/IECON.2019.8927537
Joe, H., & Zhu, R. (2005). Generalized Poisson Distribution: the Property of Mixture of Poisson and Comparison with Negative Binomial Distribution. Biometrical Journal, 47(2), 219–229. https://doi.org/10.1002/bimj.200410102
Khadhouri, S., Gallagher, K., MacKenzie, K., Shah, T. T., Gao, C., Moore, S., Zimmermann, E., Edison, E., Jefferies, M., Nambiar, A., Anbarasan, T., Mannas, M., Lee, T., Marra, G., Rivas, J. G., Marcq, G., Assmus, M., Uçar, T., Claps, F., . . . Zainuddin, Z. M. (2022). Developing a Diagnostic Multivariable Prediction Model for Urinary Tract Cancer in Patients Referred with Haematuria: Results from the IDENTIFY Collaborative Study. European Urology Focus, 8(6), 1673–1682. https://doi.org/10.1016/j.euf.2022.06.001
Li, G., & Niu, P. (2013). An enhanced extreme learning machine based on ridge regression for regression. Neural Computing and Applications, 22, 803-810. https://doi.org/10.1007/s00521-011-0771-7
Li, H., & Yamamoto, S. (2016). Polynomial regression based model-free predictive control for nonlinear systems. In 2016 55th annual conference of the society of instrument and control engineers of Japan (SICE) (pp. 578-582). IEEE. https://doi.org/10.1109/SICE.2016.7749264
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). The M4 Competition: Results, findings, conclusion and way forward. International Journal of Forecasting, 34(4), 802-808. https://doi.org/10.1016/j.ijforecast.2018.06.001
Mao, X., Yang, H., Huang, S., Liu, Y., & Li, R. (2019). Extractive summarization using supervised and unsupervised learning. Expert systems with applications, 133, 173-181. https://doi.org/10.1016/j.eswa.2019.05.011
Mason, C. H., & Perreault Jr, W. D. (1991). Collinearity, power, and interpretation of multiple regression analysis. Journal of Marketing Research, 28(3), 268-280. https://doi.org/10.1177/002224379102800302
Maulud, D., & Abdulazeez, A. M. (2020). A review on linear regression comprehensive in machine learning. Journal of Applied Science and Technology Trends, 1(2), 140-147. https://doi.org/10.38094/jastt1457
Mohsenijam, A., Siu, M. F. F., & Lu, M. (2017). Modified stepwise regression approach to streamlining predictive analytics for construction engineering applications. Journal of Computing in Civil Engineering, 31(3), 04016066.
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000636
Muthén, B., & Asparouhov, T. (2011). Beyond multilevel regression modeling: Multilevel analysis in a general latent variable framework. In Handbook of advanced multilevel analysis (pp. 15-40). Routledge. https://doi.org/10.4324/9780203848852
Muthukrishnan, R., & Rohini, R. (2016). LASSO: A feature selection technique in predictive modeling for machine learning. In 2016 IEEE international conference on advances in computer applications (ICACA) (pp. 18-20). IEEE.
https://doi.org/10.1109/ICACA.2016.7887916
Nemade, V., & Fegade, V. (2023). Machine learning techniques for breast cancer prediction. Procedia Computer Science, 218, 1314-1320. https://doi.org/10.1016/j.procs.2023.01.110
Ostertagová, E. (2012). Modelling using polynomial regression. Procedia Engineering, 48, 500-506. https://doi.org/10.1016/j.proeng.2012.09.545
Rácz, A., Bajusz, D., & Héberger, K. (2019). Multi-level comparison of machine learning classifiers and their performance metrics. Molecules, 24(15), 2811. https://doi.org/10.3390/molecules24152811
Romano, Y., Patterson, E., & Candès, E. J. (2019). Conformalized Quantile Regression. Neural Information Processing Systems.
Schober, P., Boer, C., & Schwarte, L. A. (2018). Correlation coefficients: appropriate use and interpretation. Anesthesia & Analgesia, 126(5), 1763-1768. https://doi.org/10.1213/ANE.0000000000002864
Shanableh, T., & Assaleh, K. (2010). Feature modeling using polynomial classifiers and stepwise regression. Neurocomputing, 73(10-12), 1752-1759. https://doi.org/10.1016/j.neucom.2009.11.045
Shigeto, Y., Suzuki, I., Hara, K., Shimbo, M., & Matsumoto, Y. (2015). Ridge regression, hubness, and zero-shot learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part I 15 (pp. 135-151). Springer International Publishing. https://doi.org/10.1007/978-3-319-23528-8_9
Siemsen, E., Roth, A., & Oliveira, P. (2010). Common method bias in regression models with linear, quadratic, and interaction effects. Organizational Research Methods, 13(3), 456-476. https://doi.org/10.1177/1094428109351241
Snijders, T. A., & Bosker, R. (2012). Multilevel analysis: An introduction to basic and advanced multilevel modeling. Second edition. London etc.: Sage Publishers, 2012
Sudhaman, K., Akuthota, M., & Chaurasiya, S. K. (2022). A Review on the Different Regression Analysis in Supervised Learning. Bayesian Reasoning and Gaussian Processes for Machine Learning Applications, pp.15-32.
Tabelini, L., Berriel, R., Paixao, T. M., Badue, C., De Souza, A. F., & Oliveira-Santos, T. (2021, January). Polylanenet: Lane estimation via deep polynomial regression. In 2020 25th International Conference on Pattern Recognition (ICPR) (pp. 6150-6156). IEEE. https://doi.org/10.1109/ICPR48806.2021.9412265
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Uyanık, G. K., & Güler, N. (2013). A study on multiple linear regression analysis. Procedia-Social and Behavioral Sciences, 106, 234-240. https://doi.org/10.1016/j.sbspro.2013.12.027
Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A survey on distributed machine learning. ACM Computing Surveys (Esur), 53(2), 1-33. https://doi.org/10.1145/3377454
Vrieze, S. I. (2012). Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychological methods, 17(2), 228.
https://psycnet.apa.org/doi/10.1037/a0027127
Wang, W., & Lu, Y. (2018, March). Analysis of the mean absolute error (MAE) and the root mean square error (RMSE) in assessing rounding model. In IOP conference series: materials science and engineering (Vol. 324, p. 012049). IOP Publishing. https://doi.org/10.1088/1757-899X/324/1/012049
Yang, J., Meng, X., & Mahoney, M. (2013). Quantile regression for large-scale applications. In International Conference on Machine Learning. Proceedings of the 30th International Conference on Machine Learning, PMLR, 28(3), 881-887.