Classification and analysis for Focused Crawled Textual Dataset for retrieving  Indian origin scientists

Shivani Gautam; Rajesh Bhatia; Shaily Jain

doi:10.52756/ijerr.2023.v34spl.008

Authors

Shivani Gautam Chitkara University School of Engineering and Technology, Chitkara University, Himachal Pradesh, India https://orcid.org/0000-0002-7428-5155
Rajesh Bhatia Department of Computer Science and Engineering, PEC University of Technology, Chandigarh, India
Shaily Jain Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, India https://orcid.org/0000-0001-6078-3607

DOI:

https://doi.org/10.52756/ijerr.2023.v34spl.008

Keywords:

Content Retrieval, focused crawler, natural language processing, Supervised Machine Learning, Text classification, web scraping

Abstract

Text classification also called (text categorization or text tagging) is a crucial and extensively used approach in Natural Language Processing (NLP), to predict unseen content documents into prearranged categories. In this paper, we evaluate the dataset construction and evaluation process as a component of text classification. To begin with, we produced a newly created dataset for Indian Origin Scientists for text classification, which was collected by applying focused crawling and web scraping techniques. We then demonstrate an extensive evaluation of numerous models on this recently constructed dataset. Our evaluations display that the Random forest model outperforms the rest of the supervised models. Our results produce a fine beginning for additional research in Indian Origin Scientists' classification of text. Investigational outcome with K Nearest Neighbor, Logistic Regression, and Support Vector Machine for Indian-origin scientists produced much better performances for Random Forest when combined with SMOTE and K fold cross-validation techniques. We apply the Area under the ROC Curve to compute the effectiveness of the chosen models. Overall, the Random Forest classifier exhibited the best output along with 90% micro-average AUC.

References

Anglin, K. L. (2019). Gather-narrow-extract: A framework for studying local policy variation using web-scraping and natural language processing. Journal of Research on Educational Effectiveness, 12(4), 685-706. https://doi.org/10.1080/19345747.2019.1654576

Bajaj, K., Jain, S., & Singh, R. (2023). Context-Aware Offloading for IoT Application using Fog-Cloud Computing. International Journal of Electrical and Electronics Research, 11(1), 69-83. https://doi.org/10.37391/ijeer.110110

Bajaj, K., Sharma, B., Singh, R., Kumar, M., & Chowdhury, S. (2022). A comparative analysis of cloud-based services platform. In 6th Smart Cities Symposium (SCS 2022), 2022, 243-247. https://doi.org/10.1049/icp.2023.0424

Dallmeier, E. C. (2021). Computer vision-based web scraping for internet forums. IEEE, In 2021 7th International Conference on Optimization and Applications (ICOA), pp. 1-5. https://doi.org/10.1109/ICOA51614.2021.9442634

Deeksha, D., Bhatia, R., Bhardwaj, S., Kumar, M., Bhatia, K., & Gill, S. S. (2021). Stacking Ensemble-based Automatic Web Page Classification. IEEE, In 2021 Fourth International Conference on Computational Intelligence and Communication Technologies (CCICT), pp. 169-174. https://doi.org/10.1109/CCICT53244.2021.00042

Dzisevič, R., & Šešok, D. (2019). Text classification using different feature extraction approaches. IEEE, In 2019 Open Conference of Electrical, Electronic and Information Sciences (eStream), pp. 1-4. https://doi.org/10.1109/eStream.2019.8732167

Glez-Peña, D., Lourenço, A., López-Fernández, H., Reboiro-Jato, M., & Fdez-Riverola, F. (2014). Web scraping technologies in an API world. Briefings in Bioinformatics, 15(5), 788-797. https://doi.org/10.1093/bib/bbt026

Gupta, A., & Bhatia, R. (2021). Ensemble approach for web page classification. Multimedia Tools and Applications, 80, 25219-25240. https://doi.org/10.1007/s11042-021-10891-3

Hillen, J. (2019). Web scraping for food price research. British Food Journal, 121(12), 3350-3361. https://doi.org/10.1108/BFJ-02-2019-0081

Kadhim, A. I. (2019). Survey on supervised machine learning techniques for automatic text classification. Artificial Intelligence Review, 52(1), 273-292. https://doi.org/10.1007/s10462-018-09677-1

Karthikeyan, T., Sekaran, K., Ranjith, D., & Balajee, J. M. (2019). Personalized content extraction and text classification using effective web scraping techniques. International Journal of Web Portals (IJWP), 11(2), 41-52. https://doi.org/10.4018/IJWP.2019070103

Kaur, P. (2022). Sentiment analysis using web scraping for live news data with machine learning algorithms. Materials Today: Proceedings, 65, 3333-3341. https://doi.org/10.1016/j.matpr.2022.05.409

Kilimci, Z. H., & Akyokuş, S. (2018). Deep learning and word embedding-based heterogeneous classifier ensembles for text classification. Complexity, 2018, 7130146. https://doi.org/10.1155/2018/7130146

Kim, J. C., & Chung, K. (2019). Associative feature information extraction using text mining from health big data. Wireless Personal Communications, 105, 691-707. https://doi.org/10.1007/s11277-018-5722-5

Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4), 150. https://doi.org/10.3390/info10040150

Kuriyozov, E., Salaev, U., Matlatipov, S., & Matlatipov, G. (2023). Text classification dataset and analysis for the Uzbek language. arXiv preprint arXiv, 2302.14494. https://doi.org/10.48550/arXiv.2302.14494

Landu, T. T., Bousso, M., Loum, M. A., Sall, O., Faty, L., Dia, Y., & Sawadogo, I. (2022). Machine Learning Algorithm for Text Categorization of News Articles from Senegalese Online News Websites. In 2022, 17th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1-8. https://doi.org/10.23919/CISTI54924.2022.9820408

Londo, G. L. Y., Kartawijaya, D. H., Ivariyani, H. T., WP, Y. S. P., Rafi, A. P. M., & Ariyandi, D. (2019). A Study of Text Classification for Indonesian News Article. IEEE, In 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT), pp. 205-208. https://doi.org/10.1109/ICAIIT.2019.8834611

Lunn, S., Zhu, J., & Ross, M. (2020). Utilizing web scraping and natural language processing to better inform pedagogical practice. In 2020 IEEE, Frontiers in Education Conference (FIE), pp. 1-9. https://doi.org/10.1109/FIE44824.2020.9274270

Mirończuk, M. M., & Protasiewicz, J. (2018). A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications, 106, 36-54. https://doi.org/10.1016/j.eswa.2018.03.058

Mohammed, A., & Kora, R. (2022). An effective ensemble deep learning framework for text classification. Journal of King Saud University-Computer and Information Sciences, 34(10), 8825-8837. https://doi.org/10.1016/j.jksuci.2021.11.001

Muehlethaler, C., & Albert, R. (2021). Collecting data on textiles from the internet using web crawling and web scraping tools. Forensic Science International, 322, 110753. https://doi.org/10.1016/j.forsciint.2021.110753

Onan, A. (2018). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1), 28-47. https://doi.org/10.1177/0165551516677911

Onan, A. (2021). Sentiment analysis on massive open online course evaluations: a text mining and deep learning approach. Computer Applications in Engineering Education, 29(3), 572-589. https://doi.org/https://doi.org/10.1002/cae.22253

Patel, K., & Caragea, C. (2019). Exploring word embeddings in crf-based keyphrase extraction from research papers. In Proceedings of the 10th International Conference on Knowledge Capture, pp. 37-44. https://doi.org/10.1145/3360901.3364447

Pavani, K., & Sajeev, G. P. (2017). A novel web crawling method for vertical search engines. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1488-1493. https://doi.org/10.1109/ICACCI.2017.8126051

Persson, E. (2019). Evaluating tools and techniques for web scraping.

Priyadarshini, R. (2021). LeDoCl: A Semantic Model for Legal Documents Classification using Ensemble Methods. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 12(9), 1899-1908. https://doi.org/10.17762/turcomat.v12i9.3619

Saurkar, A. V., Pathare, K. G., & Gode, S. A. (2018). An overview of web scraping techniques and tools. International Journal on Future Revolution in Computer Science & Communication Engineering, 4(4), 363-367.

Schedlbauer, J., Raptis, G., & Ludwig, B. (2021). Medical informatics labor market analysis using web crawling, web scraping, and text mining. International Journal of Medical Informatics, 150, 104453. https://doi.org/10.1016/j.ijmedinf.2021.104453

Shrivastava, G. K., Pateriya, R. K., & Kaushik, P. (2023). An efficient focused crawler using LSTM-CNN-based deep learning. International Journal of System Assurance Engineering and Management, 14(1), 391-407. https://doi.org/10.1007/s13198-022-01808-w

Stein, R. A., Jaques, P. A., & Valiati, J. F. (2019). An analysis of hierarchical text classification using word embeddings. Information Sciences, 471, 216-232. https://doi.org/10.1016/j.ins.2018.09.001

Tanasescu, L. G., Vines, A., Bologa, A. R., & Vaida, C. A. (2022). Big Data ETL Process and Its Impact on Text Mining Analysis for Employees’ Reviews. Applied Sciences, 12(15), 7509. https://doi.org/10.3390/app12157509

Thota, P., & Ramez, E. (2021). Web scraping of COVID-19 news stories to create datasets for sentiment and emotion analysis. In The 14th Pervasive Technologies related to assistive environments conference, pp. 306-314. https://doi.org/10.1145/3453892.3461333

Xu, S. (2018). Bayesian Naïve Bayes classifiers to text classification. Journal of Information Science, 44(1), 48-59. https://doi.org/10.1177/0165551516677946

Yu, Y. B., Huang, S. L., Tashi, N., Zhang, H., Lei, F., & Wu, L. Y. (2018). A survey about algorithms utilized by focused web crawlers. Journal of Electronic Science and Technology, 16(2), 129-138. https://doi.org/10.11989/JEST.1674-862X.70116018

Yucel, A., Dag, A., Oztekin, A., & Carpenter, M. (2022). A novel text analytic methodology for classification of product and service reviews. Journal of Business Research, 151, 287-297. https://doi.org/10.1016/j.jbusres.2022.06.062

Zeng, K., Pan, Z., Xu, Y., & Qu, Y. (2020). An ensemble learning strategy for eligibility criteria text classification for clinical trial recruitment: algorithm development and validation. JMIR Medical Informatics, 8(7), e17832. https://doi.org/10.2196/17832