Breast Cancer Disease Prediction Using Random Forest Regression and Gradient Boosting Regression

Keywords: Gradient Boosting Regression, Mean Squared Error, Mean Absolute Error, Random Forest, R-squared

Abstract

The current research endeavors to evaluate the efficacy of regression-based machine learning algorithms through an assessment of their performance using diverse metrics. The focus of our study involves the implementation of the breast cancer Wisconsin (Diagnostic) dataset, employing both the random forest and gradient-boosting regression algorithms. In our comprehensive performance analysis, we utilized key metrics such as Mean Squared Error (MSE), R-squared, Mean Absolute Error (MAE), and Coefficient of Determination (COD), supplemented by additional metrics. The evaluation aimed to gauge the algorithms' accuracy and predictive capabilities. Notably, for continuous target variables, the gradient-boosting regression model emerged as particularly noteworthy in terms of performance when compared to other models. The gradient-boosting regression model exhibited remarkable results, highlighting its superiority in handling the breast cancer dataset. The model achieved an impressively low MSE value of 0.05, indicating minimal prediction errors. Furthermore, the R-squared value of 0.89 highlighted the model's ability to explain the variance in the data, affirming its robust predictive power. The Mean Absolute Error (MAE) of 0.14 reinforced the model's accuracy in predicting continuous outcomes. Beyond these core metrics, the study incorporated additional measures to provide a comprehensive understanding of the algorithms' performance. The findings underscore the potential of gradient-boosting regression in enhancing predictive accuracy for datasets with continuous target variables, particularly evident in the context of breast cancer diagnosis. This research contributes valuable insights to the ongoing exploration of machine learning algorithms, providing a basis for informed decision-making in medical and predictive analytics domains.

References

Acquah, H. D. G. (2010). Comparison of Akaike information criterion (AIC) and Bayesian information criterion (BIC) in selection of an asymmetric price relationship. Journal of Development and Agricultural Economics, 2(1), 001-006.

Ahmed, A., Whittington, J., & Shafaee, Z. (2023). Impact of Commission on Cancer Accreditation on Cancer Survival: A Surveillance, Epidemiology, and End Results (SEER) Database analysis. Annals of Surgical Oncology, 31(4), 2286–2294. https://doi.org/10.1245/s10434-023-14709-4

Azur, M., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49. https://doi.org/10.1002/mpr.329

Chen, R., Cai, N., Luo, Z., Wang, H., Liu, X., & Li, J. (2023). Multi-task banded regression model: A novel individual survival analysis model for breast cancer. Computers in Biology and Medicine, 162, 107080. https://doi.org/10.1016/j.compbiomed.2023.107080

Chen, S., Goo, Y. J. J., & Shen, Z. D. (2014). A hybrid approach of stepwise regression, logistic regression, support vector machine, and decision tree for forecasting fraudulent financial statements. The Scientific World Journal, 2014, 1-9. https://doi.org/10.1155/2014/968712

Chicco, D., Warrens, M. J., & Jurman, G. (2021). The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. Peerj Computer Science, 7, e623. https://doi.org/10.7717/peerj-cs.623

Choi, J. A., & Lim, K. (2020). Identifying machine learning techniques for classification of target advertising. ICT Express, 6(3), 175-180. https://doi.org/10.1016/j.icte.2020.04.012

Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E. W., Verbakel, J. Y., & Van Calster, B. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology, 110, 12-22. https://doi.org/10.1016/j.jclinepi.2019.02.004

De Myttenaere, A., Golden, B., Le Grand, B., & Rossi, F. (2016). Mean absolute percentage error for regression models. Neurocomputing, 192, 38-48. https://doi.org/10.1016/j.neucom.2015.12.114

Dehkharghanian, T., Bidgoli, A. A., Riasatian, A., Mazaheri, P., Campbell, C. J., Pantanowitz, L., Tizhoosh, H. R., & Rahnamayan, S. (2023). Biased data, biased AI: deep networks predict the acquisition site of TCGA images. Diagnostic Pathology, 18(1). https://doi.org/10.1186/s13000-023-01355-3

DeMaris, A., & Selman, S. H. (2013). Converting data into evidence. A statistics primer for the medical practitioner. In Springer eBooks, New York. https://doi. org/10.1007/978-1-4614-7792-1.

El‐Gabbas, A., & Dormann, C. F. (2018). Improved species‐occurrence predictions in data‐poor regions: using large‐scale data and bias correction with down‐weighted Poisson regression and Maxent. Ecography, 41(7), 1161-1172. https://doi.org/10.1111/ecog.03149

Emami, N. P., Degeling, M., Bauer, L., Chow, R., Cranor, L. F., Haghighat, M. R., & Patterson, H. (2018). The influence of friends and experts on privacy decision making in IoT scenarios. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW), 1-26. https://doi.org/10.1145/3274317

Emmert-Streib, F., & Dehmer, M. (2019). High-dimensional LASSO-based computational regression models: regularization, shrinkage, and selection. Machine Learning and Knowledge Extraction, 1(1), 359-383. https://doi.org/10.3390/make1010021

Gelman, A., Goodrich, B., Gabry, J., & Vehtari, A. (2019). R-squared for Bayesian regression models. The American Statistician, 73(3), 307–309. https://doi.org/10.1080/00031305.2018.1549100

Geraci, M., & Bottai, M. (2007). Quantile regression for longitudinal data using the asymmetric Laplace distribution. Biostatistics, 8(1), 140-154. https://doi.org/10.1093/biostatistics/kxj039

He, B., Sun, H., Bao, M., Li, H., He, J., Tian, G., & Wang, B. (2023). A cross-cohort computational framework to trace tumor tissue-of-origin based on RNA sequencing. Scientific Reports, 13(1), 15356.

https://doi.org/10.1038/s41598-023-42465-8

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67. https://doi.org/10.1080/00401706.1970.10488634

Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression. John Wiley & Sons. https://doi.org/10.1002/9781118548387

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. In Springer Texts in Statistics, 112, 18. https://doi.org/10.1007/978-1-0716-1418-1

Jia, Y., Kwong, S., Wu, W., Wang, R., & Gao, W. (2017). Sparse Bayesian learning-based kernel Poisson regression. IEEE Transactions on Cybernetics, 49(1), 56-68. https://doi.org/10.1109/TCYB.2017.2764099

Jie, H., & Zheng, G. (2019). Calibration of Torque Error of Permanent Magnet Synchronous Motor Base on Polynomial Linear Regression Model. In IECON 2019-45th Annual Conference of the IEEE Industrial Electronics Society (Vol. 1, pp. 318-323). IEEE. https://doi.org/10.1109/IECON.2019.8927537

Joe, H., & Zhu, R. (2005). Generalized Poisson Distribution: the Property of Mixture of Poisson and Comparison with Negative Binomial Distribution. Biometrical Journal, 47(2), 219–229. https://doi.org/10.1002/bimj.200410102

Khadhouri, S., Gallagher, K., MacKenzie, K., Shah, T. T., Gao, C., Moore, S., Zimmermann, E., Edison, E., Jefferies, M., Nambiar, A., Anbarasan, T., Mannas, M., Lee, T., Marra, G., Rivas, J. G., Marcq, G., Assmus, M., Uçar, T., Claps, F., . . . Zainuddin, Z. M. (2022). Developing a Diagnostic Multivariable Prediction Model for Urinary Tract Cancer in Patients Referred with Haematuria: Results from the IDENTIFY Collaborative Study. European Urology Focus, 8(6), 1673–1682. https://doi.org/10.1016/j.euf.2022.06.001

Li, G., & Niu, P. (2013). An enhanced extreme learning machine based on ridge regression for regression. Neural Computing and Applications, 22, 803-810. https://doi.org/10.1007/s00521-011-0771-7

Li, H., & Yamamoto, S. (2016). Polynomial regression based model-free predictive control for nonlinear systems. In 2016 55th annual conference of the society of instrument and control engineers of Japan (SICE) (pp. 578-582). IEEE. https://doi.org/10.1109/SICE.2016.7749264

Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). The M4 Competition: Results, findings, conclusion and way forward. International Journal of Forecasting, 34(4), 802-808. https://doi.org/10.1016/j.ijforecast.2018.06.001

Mao, X., Yang, H., Huang, S., Liu, Y., & Li, R. (2019). Extractive summarization using supervised and unsupervised learning. Expert systems with applications, 133, 173-181. https://doi.org/10.1016/j.eswa.2019.05.011

Mason, C. H., & Perreault Jr, W. D. (1991). Collinearity, power, and interpretation of multiple regression analysis. Journal of Marketing Research, 28(3), 268-280. https://doi.org/10.1177/002224379102800302

Maulud, D., & Abdulazeez, A. M. (2020). A review on linear regression comprehensive in machine learning. Journal of Applied Science and Technology Trends, 1(2), 140-147. https://doi.org/10.38094/jastt1457

Mohsenijam, A., Siu, M. F. F., & Lu, M. (2017). Modified stepwise regression approach to streamlining predictive analytics for construction engineering applications. Journal of Computing in Civil Engineering, 31(3), 04016066.

https://doi.org/10.1061/(ASCE)CP.1943-5487.0000636

Muthén, B., & Asparouhov, T. (2011). Beyond multilevel regression modeling: Multilevel analysis in a general latent variable framework. In Handbook of advanced multilevel analysis (pp. 15-40). Routledge. https://doi.org/10.4324/9780203848852

Muthukrishnan, R., & Rohini, R. (2016). LASSO: A feature selection technique in predictive modeling for machine learning. In 2016 IEEE international conference on advances in computer applications (ICACA) (pp. 18-20). IEEE.

https://doi.org/10.1109/ICACA.2016.7887916

Nemade, V., & Fegade, V. (2023). Machine learning techniques for breast cancer prediction. Procedia Computer Science, 218, 1314-1320. https://doi.org/10.1016/j.procs.2023.01.110

Ostertagová, E. (2012). Modelling using polynomial regression. Procedia Engineering, 48, 500-506. https://doi.org/10.1016/j.proeng.2012.09.545

Rácz, A., Bajusz, D., & Héberger, K. (2019). Multi-level comparison of machine learning classifiers and their performance metrics. Molecules, 24(15), 2811. https://doi.org/10.3390/molecules24152811

Romano, Y., Patterson, E., & Candès, E. J. (2019). Conformalized Quantile Regression. Neural Information Processing Systems.

Schober, P., Boer, C., & Schwarte, L. A. (2018). Correlation coefficients: appropriate use and interpretation. Anesthesia & Analgesia, 126(5), 1763-1768. https://doi.org/10.1213/ANE.0000000000002864

Shanableh, T., & Assaleh, K. (2010). Feature modeling using polynomial classifiers and stepwise regression. Neurocomputing, 73(10-12), 1752-1759. https://doi.org/10.1016/j.neucom.2009.11.045

Shigeto, Y., Suzuki, I., Hara, K., Shimbo, M., & Matsumoto, Y. (2015). Ridge regression, hubness, and zero-shot learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part I 15 (pp. 135-151). Springer International Publishing. https://doi.org/10.1007/978-3-319-23528-8_9

Siemsen, E., Roth, A., & Oliveira, P. (2010). Common method bias in regression models with linear, quadratic, and interaction effects. Organizational Research Methods, 13(3), 456-476. https://doi.org/10.1177/1094428109351241

Snijders, T. A., & Bosker, R. (2012). Multilevel analysis: An introduction to basic and advanced multilevel modeling. Second edition. London etc.: Sage Publishers, 2012

Sudhaman, K., Akuthota, M., & Chaurasiya, S. K. (2022). A Review on the Different Regression Analysis in Supervised Learning. Bayesian Reasoning and Gaussian Processes for Machine Learning Applications, pp.15-32.

Tabelini, L., Berriel, R., Paixao, T. M., Badue, C., De Souza, A. F., & Oliveira-Santos, T. (2021, January). Polylanenet: Lane estimation via deep polynomial regression. In 2020 25th International Conference on Pattern Recognition (ICPR) (pp. 6150-6156). IEEE. https://doi.org/10.1109/ICPR48806.2021.9412265

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

Uyanık, G. K., & Güler, N. (2013). A study on multiple linear regression analysis. Procedia-Social and Behavioral Sciences, 106, 234-240. https://doi.org/10.1016/j.sbspro.2013.12.027

Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A survey on distributed machine learning. ACM Computing Surveys (Esur), 53(2), 1-33. https://doi.org/10.1145/3377454

Vrieze, S. I. (2012). Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychological methods, 17(2), 228.

https://psycnet.apa.org/doi/10.1037/a0027127

Wang, W., & Lu, Y. (2018, March). Analysis of the mean absolute error (MAE) and the root mean square error (RMSE) in assessing rounding model. In IOP conference series: materials science and engineering (Vol. 324, p. 012049). IOP Publishing. https://doi.org/10.1088/1757-899X/324/1/012049

Yang, J., Meng, X., & Mahoney, M. (2013). Quantile regression for large-scale applications. In International Conference on Machine Learning. Proceedings of the 30th International Conference on Machine Learning, PMLR, 28(3), 881-887.

Published
2024-04-30
How to Cite
Yadav, P., Bhargava, C., Gupta, D., Kumari, J., Acharya, A., & Dubey, M. (2024). Breast Cancer Disease Prediction Using Random Forest Regression and Gradient Boosting Regression. International Journal of Experimental Research and Review, 38, 132-146. https://doi.org/10.52756/ijerr.2024.v38.012
Section
Articles