Phishing Detection: A Hybrid Model with Feature Selection and Machine Learning  Techniques

Rekha Pal; Mithilesh Kumar Pandey; Saurabh Pal; Dhyan Chandra Yadav

doi:10.52756/ijerr.2023.v36.009

Authors

Rekha Pal Department of Computer Applications, VBS Purvanchal University, Jaunpur, Uttar Pradesh, India
Mithilesh Kumar Pandey Department of Computer Applications, VBS Purvanchal University, Jaunpur, Uttar Pradesh, India
Saurabh Pal Department of Computer Applications, VBS Purvanchal University, Jaunpur, Uttar Pradesh, India https://orcid.org/0000-0001-9545-7481
Dhyan Chandra Yadav Department of Computer Science, Maharshi University, Lucknow, India https://orcid.org/0000-0003-0084-0360

DOI:

https://doi.org/10.52756/ijerr.2023.v36.009

Keywords:

Principal Component Analysis, Logistic Regression, Random Forest, Logit LR model, Pearson Corelation

Abstract

Various phishing problems increase in cyber space with the progress of information technology. One of the prominent cyber-attacks rooted in social engineering is known as phishing. This malicious activity aims to deceive individuals into divulging sensitive information, including credit card details, login credentials, and passwords. The main importance of this research is finding the best outcome by various machine learning (ML) techniques. This paper uses a Tree Classifier (ETC), Forward Selection, Pearson correlation, Logit-LR model and Principal_Component_Analysis for feature selection. The Logistic_regression (LR), Naïve_Bayes (NB), Decision_Tree (DT), K-Nearest Neighbor (K-NN), Support_Vector_Machine (SVM), Random_Forest (RF), AdaBoost and Bagging classifiers are used for developing the phishing detection model. We have studied the model in four cases. Case 1 has 6 commonly selected features by ET, forward selection and Pearson's correlation, case 2 has 25 features by logit model, case 3 has all features, and case 4 has principal component analysis (3 and 5 components). We find the highest accuracy of 97.3% in case 2 with the random forest model.

References

Abu-Nimeh, S., Nappa, D., Wang, X., & Nair, S. (2007, October). A comparison of machine learning techniques for phishing detection. In Proceedings of the Anti-phishing Working Groups 2nd Annual eCrime Researchers Summit, pp. 60-69. https://doi.org/10.1145/1299015.1299021

Aleroud, A., & Zhou, L. (2017). Phishing environments, techniques, and countermeasures: A survey. Computers & Security, 68, 160-196.https://doi.org/10.1016/j.cose.2017.04.006.

Aljofey, A., Jiang, Q., Qu, Q., Huang, M., & Niyigena, J. P. (2020). An effective phishing detection model based on character level convolutional neural network from URL. Electronics, 9(9), 1514. https://doi.org/10.3390/electronics9091514

Babagoli, M., Aghababa, M. P., & Solouk, V. (2019). Heuristic nonlinear regression strategy for detecting phishing websites. Soft Computing, 23(12), 4315-4327. https://doi.org/10.1007/s00500-018-3084-2

Babagoli, M., Aghababa, M. P., & Solouk, V. (2019). Heuristic nonlinear regression strategy for detecting phishing websites. Soft Computing, 23(12), 4315-4327. https://doi.org/10.1007/s00500-018-3084-2.

Bagui, S., Nandi, D., Bagui, S., & White, R. J. (2021). Machine learning and deep learning for phishing email classification using one-hot encoding. Journal of Computer Science, 17, 610-623. https://doi.org/10.3844/jcssp.2021.610.623

Basnet, R. B., Sung, A. H., & Liu, Q. (2012). Feature selection for improved phishing detection. In Advanced Research in Applied Artificial Intelligence: 25th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2012, Dalian, China, June 9-12, 2012. Springer Berlin Heidelberg,Proceedings 25, pp. 252-261. https://doi.org/10.1007/978-3-642-31087-4_27.

Bokrantz, J., Skoogh, A., Berlin, C., Wuest, T., & Stahre, J. (2020). Smart Maintenance: a research agenda for industrial maintenance management. International Journal of Production Economics, 224, 107547. https://doi.org/10.1016/j.ijpe.2019.107547.

Cawley, G. C., & Talbot, N. L. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. The Journal of Machine Learning Research, 11, 2079-2107.

Chandrasekaran, M., Narayanan, K., & Upadhyaya, S. (2006). Phishing email detection based on structural properties. In NYS Cyber Security Conference, 3, 2-8. https://doi.org/10.4236/aces.2019.94023

Chaurasia, V., & Pal, S. (2021). Ensemble Technique to Predict Breast Cancer on Multiple Datasets, The Computer Journal, bxab110, https://doi.org/10.1093/comjnl/bxab110.

Chaurasia, V., Pandey, M. K., & Pal, S. (2021, March). Prediction of presence of breast cancer disease in the patient using machine learning algorithms and SFS. IOP Publishing,In IOP conference series: Materials Science and Engineering, 1099(1), 012003. https://doi.org/10.1088/1757-899X/1099/1/012003.

Chaurasia, V., Pandey, M. K., & Pal, S. (2022). Chronic kidney disease: a prediction and comparison of ensemble and basic classifiers performance. Human-Intelligent Systems Integration, pp. 1-10. https://doi.org/10.1007/s42454-022-00040-y.

Chiew, K. L., Tan, C. L., Wong, K., Yong, K. S., & Tiong, W. K. (2019). A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Information Sciences, 484, 153-166. https://doi.org/10.1016/j.ins.2019.01.064.

Dawn, N., Ghosh, T., Ghosh, S., Saha, A., Mukherjee, P., Sarkar, S., Guha, S., & Sanyal, T. (2023). Implementation of Artificial Intelligence, Machine Learning, and Internet of Things (IoT) in revolutionizing Agriculture: A review on recent trends and challenges. International Journal of Experimental Research and Review, 30, 190-218. https://doi.org/10.52756/ijerr.2023.v30.018

Fang, Y., Zhang, C., Huang, C., Liu, L., & Yang, Y. (2019). Phishing email detection using improved RCNN model with multilevel vectors and attention mechanism. IEEE Access, 7, 56329-56340. https://doi.org/10.5120/ijca2022921868

Fette, I., Sadeh, N., & Tomasic, A. (2007, May). Learning to detect phishing emails. In Proceedings of the 16th International Conference on World Wide Web, pp. 649-656. https://doi.org/10.1145/1242572.1242660

Guan, H., Zhang, Y., Xian, M., Cheng, H. D., & Tang, X. (2021). SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling. Applied Intelligence, 51(3), 1394-1409. https://doi.org/10.1007/s10489-020-01852-8.

Gupta, S. S., Thakral, A., & Choudhury, T. (2018). Social media security analysis of threats and security measures. IEEE, In 2018 International Conference on Advances in Computing and Communication Engineering (ICACCE), pp. 115-120. https://doi.org/10.1109/ICACCE.2018.8441710.

Han, X., Kheir, N., & Balzarotti, D. (2016, October). Phisheye: Live monitoring of sandboxed phishing kits. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1402-1413. https://doi.org/10.1145/2976749.2978330.

Hota, H. S., Shrivas, A. K., & Hota, R. (2018). An ensemble model for detecting phishing attack with proposed remove-replace feature selection technique. Procedia Computer Science, 132, 900-907. https://doi.org/10.1016/j.procs.2018.05.103.

Jain, A. K., & Gupta, B. B. (2016). A novel approach to protect against phishing attacks at client side using auto-updated white-list. EURASIP Journal on Information Security, 2016(1), 1-11. https://doi.org/10.1186/s13635-016-0034-3.

Jain, A. K., & Gupta, B. B. (2018). Towards detection of phishing websites on client-side using machine learning based approach. Telecommunication Systems, 68, 687-700. https://doi.org/10.1007/s11235-017-0414-0.

Jameel, N. G. M., & George, L. E. (2013). Detection of phishing emails using feed forward neural network. International Journal of Computer Applications, 77(7). https://doi.org/10.5120/13405-1057

Jamil, A., Asif, K., Ghulam, Z., Nazir, M. K., Alam, S. M., & Ashraf, R. (2018). Mpmpa: A mitigation and prevention model for social engineering based phishing attacks on facebook. IEEE,In 2018 IEEE International Conference on Big Data (Big Data), pp. 5040-5048. https://doi.org/10.1109/BigData.2018.8622505.

Kang, J., & Lee, D. (2007, November). Advanced white list approach for preventing access to phishing sites. IEEE, In 2007 International Conference on Convergence Information Technology (ICCIT 2007), pp. 491-496. https://doi.org/10.1109/ICCIT.2007.50.

Khonji, M., Jones, A., & Iraqi, Y. (2013). An empirical evaluation for feature selection methods in phishing email classification. International Journal of Computer Systems Science & Engineering, 28(1), 37-51. https://doi.org/10.1109/SURV.2013.032213.00009.

Lee, L. H., Lee, K. C., Chen, H. H., & Tseng, Y. H. (2014, November). Poster: Proactive blacklist update for anti-phishing. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pp. 1448-1450. https://doi.org/10.1145/2660267.2662362.

Leskovec, J., Huttenlocher, D., & Kleinberg, J. (2010, April). Predicting positive and negative links in online social networks. In Proceedings of the 19th International Conference on World wideWe,b pp. 641-650. https://doi.org/10.1145/1772690.1772756.

Li, X., Geng, G., Yan, Z., Chen, Y., & Lee, X. (2016, December). Phishing detection based on newly registered domains. IEEE, In 2016 IEEE International Conference on Big Data (big data), pp. 3685-3692. https://doi.org/10.1109/BigData.2016.7841036.

Mahapatra, M., Majhi, S. K., & Dhal, S. K. (2022). Mrmr-ssa: a hybrid approach for optimal feature selection. Evolutionary Intelligence, 15(3), 2017-2036.https://doi.org/10.1007/s12065-021-00608-8.

Marchal, S., François, J., State, R., & Engel, T. (2014). Phish Storm: Detecting phishing with streaming analytics. IEEE Transactions on Network and Service Management, 11(4), 458-471. https://doi.org/10.1109/TNSM.2014.2377295.

Mbah, K. (2017). A phishing e-mail detection approach using machine learning techniques. Computer and Information Engineering, vol. 3(1), pp. 2333. https://doi.org/10.1080/01430750.2021.1953590.

Mohammad, R. M., Thabtah, F., & McCluskey, L. (2014). Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25, 443-458. https://doi.org/10.1007/s00521-013-1490-z.

Mohammad, R., McCluskey, T. L., & Thabtah, F. (2013). Predicting phishing websites using neural network trained with back-propagation. World Congress in Computer Science, Computer Engineering, and Applied Computing,25, 443–458. https://doi.org/10.1007/s00521-013-1490-z.

Nguyen, L. A. T., To, B. L., Nguyen, H. K., & Nguyen, M. H. (2014, October). An efficient approach for phishing detection using single-layer neural network. IEEE, In 2014 International Conference on Advanced Technologies for Communications (ATC 2014), pp. 435-440. https://doi.org/10.1109/ATC.2014.7043427.

Rajab, M. (2018, February). An anti-phishing method based on feature analysis. In Proceedings of the 2nd International Conference on Machine Learning and Soft Computing, pp. 133-139.https://doi.org/10.1145/3184066.3184082.

Rathod, S. B., & Pattewar, T. M. (2015). Content based spam detection in email using Bayesian classifier. IEEE, In 2015 International Conference on Communications and Signal Processing (ICCSP), pp. 1257-1261. https://doi.org/10.1109/ICCSP.2015.7322709

Rawal, S., Rawal, B., Shaheen, A., & Malik, S. (2017). Phishing detection in e-mails using machine learning. International Journal of Applied Information Systems, 12(7), 21-24. https://doi.org/10.5120/ijais2017451713

Seo, J., & Shneiderman, B. (2005). A rank-by-feature framework for interactive exploration of multidimensional data. Information Visualization, 4(2), 96-113. https://doi.org/10.1057/palgrave.ivs.9500091.

Sharfuddin, N., Anwer, F., & Ali, S. (2023). A Novel Cryptographic Technique for Cloud Environment Based on Feedback DNA. International Journal of Experimental Research and Review, 32, 323-339. https://doi.org/10.52756/ijerr.2023.v32.028

Sharifi, M., & Siadati, S. H. (2008, March). A phishing sites blacklist generator. IEEE, In 2008 IEEE/ACS International Conference on Computer Systems and Applications, pp. 840-843. https://doi.org/10.1109/AICCSA.2008.4493625.

Shyni, C. E., Sarju, S., & Swamynathan, S. (2016). A multi-classifier based prediction model for phishing emails detection using topic modelling, named entity recognition and image processing. Circuits and Systems, 7(9), 2507-2520. https://doi.org/10.4236/cs.2016.79217.

Smadi, S., Aslam, N., Zhang, L., Alasem, R., & Hossain, M. A. (2015, December). Detection of phishing emails using data mining algorithms. IEEE, In 2015 9th International Conference on Software, Knowledge, Information Management and Applications (SKIMA), pp. 1-8. https://doi.org/10.1109/SKIMA.2015.7399985.

Sonowal, G. (2020). Phishing email detection based on binary search feature selection. SN Computer Science, 1(4), 191.https://doi.org/10.1007/s42979-020-00194-z.

Tama, B. A., & Lim, S. (2020). A comparative performance evaluation of classification algorithms for clinical decision support systems. Mathematics, 8(10), 1814. https://doi.org/10.3390/math8101814.

UCI Machine Learning Repository (2022). Center for Machine Learning and Intelligent Systems. Accessed: 2022. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/phishing+website.

Xiang, G., Hong, J., Rose, C. P., & Cranor, L. (2011). Cantina+ a feature-rich machine learning framework for detecting phishing web sites. ACM Transactions on Information and System Security (TISSEC), 14(2), 1-28. https://doi.org/10.1145/2019599.2019606.

Zhang, Y., Hong, J. I., & Cranor, L. F. (2007). Cantina: a content-based approach to detecting phishing web sites. In Proceedings of the 16th International Conference on World Wide Web, pp. 639-648. https://doi.org/10.1145/1242572.1242659.

Zhu, E., Chen, Y., Ye, C., Li, X., & Liu, F. (2019). OFS-NN: an effective phishing websites detection model based on optimal feature selection and neural network. IEEE Access, 7, 73271-73284. https://doi.org/10.1109/ACCESS.2019.2920655.