An Approach for Efficient and Accurate Phishing Website Prediction Using Improved ML Classifier Performance for Feature Selection

Anjaneya Awasthi; Noopur Goel

doi:10.52756/ijerr.2024.v40spl.006

Authors

Anjaneya Awasthi Department of Computer Applications, VBS Purvanchal University, India https://orcid.org/0000-0003-4033-936X
Noopur Goel Department of Computer Applications, VBS Purvanchal University, India http://orcid.org/0000-0003-3351-3761

DOI:

https://doi.org/10.52756/ijerr.2024.v40spl.006

Keywords:

Computer viruses, Cybersecurity, Phishing Website Prediction, Machine learning (ML)

Abstract

The article discusses the use of machine learning (ML) to combat phishing websites, which are deceptive sites that mimic trusted entities to steal sensitive information. This is why the continued invention of methods of identifying and counteracting phishing threats is beneficial. Such attacks pose significant risks to the integrity of online security. To enhance the success rate and specificity of predicting phishing websites, this study proposes a new approach that utilizes machine learning algorithms. To enhance the methods mentioned above and achieve better results in classification and better prediction of customer behaviour, the main points exposed to further transformations are increasing classifier accuracy and selecting an optimal feature space. Traditional anti-phishing strategies like blacklisting and heuristic searches often have slow detection times and high false positive rates. The article introduces a novel feature selection method to extract highly correlated features from datasets, thereby enhancing classifier accuracy. Using six feature selection techniques on a phishing dataset, it evaluates eight classifiers, including SVM, Logistic Regression, Random Forest, and others. The study finds that the Random Forest classifier combined with the Chi-2 feature selection method significantly improves model accuracy, achieving up to 96.99%.

References

Abdul-Khalek, R., Ball, R. D., Carrazza, S., Forte, S., Giani, T., Kassabov, Z., ... & Wilson, M. (2019). A first determination of parton distributions with theoretical uncertainties. The European Physical Journal C, 79(10), 1-6. https://doi.org/10.1140/epjc/s10052-019-7364-5

Ali, L., Rahman, A., Khan, A., Zhou, M., Javeed, A., & Khan, J. A. (2019). An automated diagnostic system for heart disease prediction based on ${chi^{2}} $ statistical model and optimally configured deep neural network. IEEE Access, 7, 34938-34945. https://doi.org/10.1109/ACCESS.2019.2904800

Alsouda, Y., Pllana, S., & Kurti, A. (2019). Iot-based urban noise identification using machine learning: performance of SVM, KNN, bagging, and random forest. In Proceedings of the International Conference on Omni-layer Intelligent Systems, pp. 62-67. https://doi.org/10.1145/3312614.3312631

Anand, A., Gorde, K., Moniz, J.R.A., Park, N., Chakraborty, T., & Chu, B.T. (2018). Phishing URL detection with oversampling based on text generative adversarial networks. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA. pp. 1168-1177. https://doi.org/10.1109/BigData.2018.8622547

Awasthi, A., & Goel, N. (2021a). Phishing Website Prediction: A Comparison of Machine Learning Techniques. Springer, Singapore, In Data Intelligence and Cognitive Informatics, pp. 637-650. https://doi.org/10.1007/978-981-15-8530-2_50

Awasthi, A., & Goel, N. (2021b). Phishing Website Prediction: A Machine Learning Approach. Springer, Singapore, In Progress in Advanced Computing and Intelligent Engineering, pp. 143-152. https://doi.org/10.1007/978-981-33-4299-6_12

Awasthi, A., & Goel, N. (2021c). Generating Rules to Detect Phishing Websites Using URL Features. IEEE, In 2021 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology (ODICON), pp. 1-9. https://doi.org/10.1109/ODICON50556.2021.9429003

Awasthi, A., & Goel, N. (2022). Phishing website prediction using base and ensemble classifier techniques with cross-validation. Cybersecurity, 5(1), 1-23. https://doi.org/10.1186/s42400-022-00126-9

Babagoli, M., Aghababa, M.P., & Solouk, V. (2018). Heuristic nonlinear regression strategy for detecting phishing websites. Soft Computing, 23, 4315-4327. https://doi.org/10.1007/s00500-018-3084-2

Bahnsen, A.C., Bohorquez, E.C., Villegas, S.; Vargas, J., & González, F.A. (2017). Classifying phishing URLs using recurrent neural networks. In Proceedings of the 2017 APWG Symposium on Electronic Crime Research (eCrime), Scottsdale, AZ, USA. pp. 1-8.

https://doi.org/10.1109/ECRIME.2017.7945048Banerjee, M., Goyal, R., Gupta, P., &

Tripathi, A. (2023). Real-Time Face Recognition System with Enhanced Security Features using Deep Learning. Int. J. Exp. Res. Rev., 32, 131-144. https://doi.org/10.52756/ijerr.2023.v32.011

Bansal, M., Goyal, A., & Choudhary, A. (2022). A comparative analysis of K-Nearest Neighbour, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning. Decision Analytics Journal, 100071. https://doi.org/10.1016/j.dajour.2022.100071

Bu, S.J., & Cho, S.B. (2021). Deep character-level anomaly detection based on a convolutional autoencoder for zero-day phishing URL detection. Electronics, 10, 1492. https://doi.org/10.3390/electronics10121492

Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A new perspective. Neurocomputing, 300, 70-79. https://doi.org/10.1016/j.neucom.2017.11.077

Chen, H., Gilad-Bachrach, R., Han, K., Huang, Z., Jalali, A., Laine, K., & Lauter, K. (2018). Logistic regression over encrypted data from fully homomorphic encryption. BMC Medical Genomics, 11(4), 3-12. https://doi.org/10.1186/s12920-018-0397-z

Chiew, K. L., Tan, C. L., Wong, K., Yong, K. S., & Tiong, W. K. (2019). A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Information Sciences, 484, 153-166. https://doi.org/10.1016/j.ins.2019.01.064

Dawn, N., Ghosh, T., Ghosh, S., Saha, A., Mukherjee, P., Sarkar, S., Guha, S., & sanyal, T. (2023). Implementation of Artificial Intelligence, Machine Learning, and Internet of Things (IoT) in revolutionizing Agriculture: A review on recent trends and challenges. Int. J. Exp. Res. Rev., 30, 190-218. https://doi.org/10.52756/ijerr.2023.v30.018

Feng, F., Zhou, Q., Shen, Z., Yang, X., Han, L., & Wang, J. (2018). The application of a novel neural network in the detection of phishing websites. Journal of Ambient Intelligence and Humanized Computing, 2018. https://doi.org/10.1007/s12652-018-0786-3

Franjić, S. (2020). Cybercrime is Very Dangerous Form of Criminal Behavior and Cybersecurity. Emerging Science Journal, 4, 18-26. https://doi.org/10.28991/esj-2020-SP1-02

Gøttcke, J. M. N., Zimek, A., & Campello, R. J. (2021). Non-parametric semi-supervised learning by Bayesian label distribution propagation. Springer, Cham., In International Conference on Similarity Search and Applications, pp. 118-132. https://doi.org/10.1007/978-3-030-89657-7_10

Gupta, S., Cherukuri, A. K., Subramanian, C. M., & Ahmad, A. (2022). Comparison, Analysis and Analogy of Biological and Computer Viruses. Springer, Singapore, In Intelligent Interactive Multimedia Systems for e-Healthcare Applications, pp. 3-34. https://doi.org/10.1007/978-981-16-6542-4_1

Iuga, C., Nurse, J.R., & Erola, A. (2016). Baiting the hook: Factors impacting susceptibility to phishing attacks. Hum. Cent. Comput. Inf. Sci., 6, 8. https://doi.org/10.1186/s13673-016-0065-2

Jain, A.K., & Gupta, B.B. (2018). Towards detection of phishing websites on client-side using machine learning based approach. Telecommunication Systems, 68(4), 687-700. https://doi.org/10.1007/s11235-017-0414-0

Jain, P., Thada, V., & Motwani, D. (2024). Providing Highest Privacy Preservation Scenario for Achieving Privacy in Confidential Data. International Journal of Experimental Research and Review, 39(Spl Volume), 190-199. https://doi.org/10.52756/ijerr.2024.v39spl.015

Josephine, P. K., Prakash, V. S., & Divya, K. S. (2021). Supervised Learning Algorithms: A Comparison. Kristu Jayanti Journal of Computational Sciences (KJCS), pp. 01-12. https://doi.org/10.59176/kjcs.v1i1.1259

Korkmaz, M., Sahingoz, O. K., & Diri, B. (2020). Feature selections for the classification of webpages to detect phishing attacks: a survey. IEEE,

In 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), pp. 1-9. https://doi.org/10.1109/HORA49412.2020.9152934

Kumar, A., Dutta, S., & Pranav, P. (2023). Supervised learning for Attack Detection in Cloud. Int. J. Exp. Res. Rev., 31(Spl Volume), 74-84. https://doi.org/10.52756/10.52756/ijerr.2023.v31spl.008

Le, A., Markopoulou, A., & Faloutsos, M. (2011). Phishdef: Url names say it all. In Proceedings of the 2011 Proceedings IEEE INFOCOM, Shanghai, China, pp. 191-195. https://doi.org/10.1109/INFCOM.2011.5934995

Le, H., Pham, Q., Sahoo, D., & Hoi, S.C. (2018). URLNet: Learning a URL representation with deep learning for malicious URL detection. arXiv 2018, arXiv:1802.03162.

Li, L., Ching, W. K., & Liu, Z. P. (2022). Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods. Computational Biology and Chemistry, 100, 107747. https://doi.org/10.1016/j.compbiolchem.2022.107747

Mohammad, R., McCluskey, T., & Thabtah, F.A. (2013). Predicting phishing websites using neural network trained with back-propagation. World Congress in Computer Science, Computer Engineering, and Applied Computing.

Mohammad, R.M., Thabtah, F., & McCluskey, L. (2012). An assessment of features related to phishing websites using an automated technique. In Proceedings of the 2012 International Conference for Internet Technology and Secured Transactions, London, UK. pp. 492-497.

Mohammad, R.M., Thabtah, F., & McCluskey, L. (2014). Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25(2), 443-458, 2014. https://doi.org/10.1007/s00521-013-1490-z

Oest, A., Safei, Y., Doupé, A., Ahn, G. J., Wardman, B., & Warner, G. (2018). Inside a phisher's mind: Understanding the anti-phishing ecosystem through phishing kit analysis. IEEE,

In 2018 APWG Symposium on Electronic Crime Research (eCrime), pp. 1-12. https://doi.org/10.1109/ECRIME.2018.8376206

Pal, R., Pandey, M., Pal, S., & Yadav, D. (2023). Phishing Detection: A Hybrid Model with Feature Selection and Machine Learning Techniques. Int. J. Exp. Res. Rev., 36, 99-108. https://doi.org/10.52756/ijerr.2023.v36.009

Park, K.W., Bu, S.J., & Cho, S.B. (2021). Evolutionary optimization of neuro-symbolic integration for phishing URL detection. In Proceedings of the International Conference on Hybrid Artificial Intelligence Systems, Bilbao, Spain. pp. 88-100. https://doi.org/10.1007/978-3-030-86271-8_8

Phishing website dataset | Kaggle, https://www.kaggle.com/datasets/akashkr/phishing-website-dataset?select=dataset.csv. Accessed 8th January 2023.

Qiu, P., & Niu, Z. (2021). TCIC_FS: Total correlation information coefficient-based feature selection method for high-dimensional data. Knowledge-Based Systems, 231, 107418.

https://doi.org/10.1016/j.knosys.2021.107418

Rajab, (2018). An anti-phishing method based on feature analysis," in Proceedings of the 2nd International Conference on Machine Learning and Soft Computing. ACM, 133-139. https://doi.org/10.1145/3184066.3184082

Rufo, D. D., Debelee, T. G., Ibenthal, A., & Negera, W. G. (2021). Diagnosis of diabetes mellitus using gradient boosting machine (LightGBM). Diagnostics, 11(9), 1714. https://doi.org/10.3390/diagnostics11091714

Sahingoz, O.K., Buber, E., Demir, O., & Diri, B. (2019). Machine learning based phishing detection from urls. Expert Systems with Applications, 117, 345-357. https://doi.org/10.1016/j.eswa.2018.09.029

Singh, D., & Singh, S. (2023). Precision fault prediction in motor bearings with feature selection and deep learning. Int. J. Exp. Res. Rev., 32, 398-407. https://doi.org/10.52756/ijerr.2023.v32.035

Srinivas, J., Das, A. K., & Kumar, N. (2019). Government regulations in cyber security: Framework, standards and recommendations. Future Generation Computer Systems, 92, 178-188. https://doi.org/10.1016/j.future.2018.09.063

Suleman, M.T., & Awan, S.M. (2019). Optimization of URL-based phishing websites detection through genetic algorithms. Autom. Control. Comput. Sci., 53, 333-341. https://doi.org/10.3103/S0146411619040102

Sun, J., Fujita, H., Chen, P., & Li, H. (2017). Dynamic financial distress prediction with concept drift based on time weighting combined with Adaboost support vector machine ensemble. Knowledge-Based Systems, 120, 4-14. https://doi.org/10.1016/j.knosys.2016.12.019

Taher, S. A., Akhter, K. A., & Hasan, K. A. (2018). N-gram based sentiment mining for bangla text using support vector machine. IEEE, In 2018 international conference on Bangla speech and language processing (ICBSLP), pp. 1-5.

Tajaddodianfar, F., Stokes, J.W., & Gururajan, A. (2020). Texception: A character/word-level deep learning model for phishing URL detection. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, pp. 2857-2861. https://doi.org/10.1109/ICASSP40776.2020.9053670

Tekouabou, S. C. K., Cherif, W., & Silkan, H. (2020). Improving parking availability prediction in smart cities with IoT and ensemble-based model. Journal of King Saud University-Computer and Information Sciences.

Thabtah, F., Abdelhamid, N., & Peebles, D. (2019). A machine learning autism classification based on logistic regression analysis. Health information science and systems, 7(1), 1-11.https://doi.org/10.1007/s13755-019-0073-5

Yadav, R., & Singh, R. (2023). Enhancing Software Maintainability Prediction Using Multiple Linear Regression and Predictor Importance. Int. J. Exp. Res. Rev., 36, 135-146. https://doi.org/10.52756/ijerr.2023.v36.013

Zhang, X., Zhao, J., & LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, pp. 649-657.

Zhao, J., Wang, N., Ma, Q., & Cheng, Z. (2018). Classifying malicious URLs using gated recurrent neural networks. In Proceedings of the International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, Matsue, Japan. pp. 385-394. https://doi.org/10.1007/978-3-319-93554-6_36

Zhong, C., & Sastry, N. (2017). Systems applications of social networks. ACM Computing Surveys (CSUR), 50(5), 1-42. https://doi.org/10.1145/3092742

Zhu, E., Chen, Y., Ye, C., Li, X., & Liu, F. (2019). OFS-NN: an effective phishing websites detection model based on optimal feature selection and neural network. IEEE Access, 7, 73271-73284. https://doi.org/10.1109/ACCESS.2019.2920655