Classification and analysis for Focused Crawled Textual Dataset for retrieving Indian origin scientists
DOI:
https://doi.org/10.52756/ijerr.2023.v34spl.008Keywords:
Content Retrieval, focused crawler, natural language processing, Supervised Machine Learning, Text classification, web scrapingAbstract
Text classification also called (text categorization or text tagging) is a crucial and extensively used approach in Natural Language Processing (NLP), to predict unseen content documents into prearranged categories. In this paper, we evaluate the dataset construction and evaluation process as a component of text classification. To begin with, we produced a newly created dataset for Indian Origin Scientists for text classification, which was collected by applying focused crawling and web scraping techniques. We then demonstrate an extensive evaluation of numerous models on this recently constructed dataset. Our evaluations display that the Random forest model outperforms the rest of the supervised models. Our results produce a fine beginning for additional research in Indian Origin Scientists' classification of text. Investigational outcome with K Nearest Neighbor, Logistic Regression, and Support Vector Machine for Indian-origin scientists produced much better performances for Random Forest when combined with SMOTE and K fold cross-validation techniques. We apply the Area under the ROC Curve to compute the effectiveness of the chosen models. Overall, the Random Forest classifier exhibited the best output along with 90% micro-average AUC.
References
Anglin, K. L. (2019). Gather-narrow-extract: A framework for studying local policy variation using web-scraping and natural language processing. Journal of Research on Educational Effectiveness, 12(4), 685-706. https://doi.org/10.1080/19345747.2019.1654576
Bajaj, K., Jain, S., & Singh, R. (2023). Context-Aware Offloading for IoT Application using Fog-Cloud Computing. International Journal of Electrical and Electronics Research, 11(1), 69-83. https://doi.org/10.37391/ijeer.110110
Bajaj, K., Sharma, B., Singh, R., Kumar, M., & Chowdhury, S. (2022). A comparative analysis of cloud-based services platform. In 6th Smart Cities Symposium (SCS 2022), 2022, 243-247. https://doi.org/10.1049/icp.2023.0424
Dallmeier, E. C. (2021). Computer vision-based web scraping for internet forums. IEEE, In 2021 7th International Conference on Optimization and Applications (ICOA), pp. 1-5. https://doi.org/10.1109/ICOA51614.2021.9442634
Deeksha, D., Bhatia, R., Bhardwaj, S., Kumar, M., Bhatia, K., & Gill, S. S. (2021). Stacking Ensemble-based Automatic Web Page Classification. IEEE, In 2021 Fourth International Conference on Computational Intelligence and Communication Technologies (CCICT), pp. 169-174. https://doi.org/10.1109/CCICT53244.2021.00042
Dzisevič, R., & Šešok, D. (2019). Text classification using different feature extraction approaches. IEEE, In 2019 Open Conference of Electrical, Electronic and Information Sciences (eStream), pp. 1-4. https://doi.org/10.1109/eStream.2019.8732167
Glez-Peña, D., Lourenço, A., López-Fernández, H., Reboiro-Jato, M., & Fdez-Riverola, F. (2014). Web scraping technologies in an API world. Briefings in Bioinformatics, 15(5), 788-797. https://doi.org/10.1093/bib/bbt026
Gupta, A., & Bhatia, R. (2021). Ensemble approach for web page classification. Multimedia Tools and Applications, 80, 25219-25240. https://doi.org/10.1007/s11042-021-10891-3
Hillen, J. (2019). Web scraping for food price research. British Food Journal, 121(12), 3350-3361. https://doi.org/10.1108/BFJ-02-2019-0081
Kadhim, A. I. (2019). Survey on supervised machine learning techniques for automatic text classification. Artificial Intelligence Review, 52(1), 273-292. https://doi.org/10.1007/s10462-018-09677-1
Karthikeyan, T., Sekaran, K., Ranjith, D., & Balajee, J. M. (2019). Personalized content extraction and text classification using effective web scraping techniques. International Journal of Web Portals (IJWP), 11(2), 41-52. https://doi.org/10.4018/IJWP.2019070103
Kaur, P. (2022). Sentiment analysis using web scraping for live news data with machine learning algorithms. Materials Today: Proceedings, 65, 3333-3341. https://doi.org/10.1016/j.matpr.2022.05.409
Kilimci, Z. H., & Akyokuş, S. (2018). Deep learning and word embedding-based heterogeneous classifier ensembles for text classification. Complexity, 2018, 7130146. https://doi.org/10.1155/2018/7130146
Kim, J. C., & Chung, K. (2019). Associative feature information extraction using text mining from health big data. Wireless Personal Communications, 105, 691-707. https://doi.org/10.1007/s11277-018-5722-5
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4), 150. https://doi.org/10.3390/info10040150
Kuriyozov, E., Salaev, U., Matlatipov, S., & Matlatipov, G. (2023). Text classification dataset and analysis for the Uzbek language. arXiv preprint arXiv, 2302.14494. https://doi.org/10.48550/arXiv.2302.14494
Landu, T. T., Bousso, M., Loum, M. A., Sall, O., Faty, L., Dia, Y., & Sawadogo, I. (2022). Machine Learning Algorithm for Text Categorization of News Articles from Senegalese Online News Websites. In 2022, 17th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1-8. https://doi.org/10.23919/CISTI54924.2022.9820408
Londo, G. L. Y., Kartawijaya, D. H., Ivariyani, H. T., WP, Y. S. P., Rafi, A. P. M., & Ariyandi, D. (2019). A Study of Text Classification for Indonesian News Article. IEEE, In 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT), pp. 205-208. https://doi.org/10.1109/ICAIIT.2019.8834611
Lunn, S., Zhu, J., & Ross, M. (2020). Utilizing web scraping and natural language processing to better inform pedagogical practice. In 2020 IEEE, Frontiers in Education Conference (FIE), pp. 1-9. https://doi.org/10.1109/FIE44824.2020.9274270
Mirończuk, M. M., & Protasiewicz, J. (2018). A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications, 106, 36-54. https://doi.org/10.1016/j.eswa.2018.03.058
Mohammed, A., & Kora, R. (2022). An effective ensemble deep learning framework for text classification. Journal of King Saud University-Computer and Information Sciences, 34(10), 8825-8837. https://doi.org/10.1016/j.jksuci.2021.11.001
Muehlethaler, C., & Albert, R. (2021). Collecting data on textiles from the internet using web crawling and web scraping tools. Forensic Science International, 322, 110753. https://doi.org/10.1016/j.forsciint.2021.110753
Onan, A. (2018). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1), 28-47. https://doi.org/10.1177/0165551516677911
Onan, A. (2021). Sentiment analysis on massive open online course evaluations: a text mining and deep learning approach. Computer Applications in Engineering Education, 29(3), 572-589. https://doi.org/https://doi.org/10.1002/cae.22253
Patel, K., & Caragea, C. (2019). Exploring word embeddings in crf-based keyphrase extraction from research papers. In Proceedings of the 10th International Conference on Knowledge Capture, pp. 37-44. https://doi.org/10.1145/3360901.3364447
Pavani, K., & Sajeev, G. P. (2017). A novel web crawling method for vertical search engines. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1488-1493. https://doi.org/10.1109/ICACCI.2017.8126051
Persson, E. (2019). Evaluating tools and techniques for web scraping.
Priyadarshini, R. (2021). LeDoCl: A Semantic Model for Legal Documents Classification using Ensemble Methods. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 12(9), 1899-1908. https://doi.org/10.17762/turcomat.v12i9.3619
Saurkar, A. V., Pathare, K. G., & Gode, S. A. (2018). An overview of web scraping techniques and tools. International Journal on Future Revolution in Computer Science & Communication Engineering, 4(4), 363-367.
Schedlbauer, J., Raptis, G., & Ludwig, B. (2021). Medical informatics labor market analysis using web crawling, web scraping, and text mining. International Journal of Medical Informatics, 150, 104453. https://doi.org/10.1016/j.ijmedinf.2021.104453
Shrivastava, G. K., Pateriya, R. K., & Kaushik, P. (2023). An efficient focused crawler using LSTM-CNN-based deep learning. International Journal of System Assurance Engineering and Management, 14(1), 391-407. https://doi.org/10.1007/s13198-022-01808-w
Stein, R. A., Jaques, P. A., & Valiati, J. F. (2019). An analysis of hierarchical text classification using word embeddings. Information Sciences, 471, 216-232. https://doi.org/10.1016/j.ins.2018.09.001
Tanasescu, L. G., Vines, A., Bologa, A. R., & Vaida, C. A. (2022). Big Data ETL Process and Its Impact on Text Mining Analysis for Employees’ Reviews. Applied Sciences, 12(15), 7509. https://doi.org/10.3390/app12157509
Thota, P., & Ramez, E. (2021). Web scraping of COVID-19 news stories to create datasets for sentiment and emotion analysis. In The 14th Pervasive Technologies related to assistive environments conference, pp. 306-314. https://doi.org/10.1145/3453892.3461333
Xu, S. (2018). Bayesian Naïve Bayes classifiers to text classification. Journal of Information Science, 44(1), 48-59. https://doi.org/10.1177/0165551516677946
Yu, Y. B., Huang, S. L., Tashi, N., Zhang, H., Lei, F., & Wu, L. Y. (2018). A survey about algorithms utilized by focused web crawlers. Journal of Electronic Science and Technology, 16(2), 129-138. https://doi.org/10.11989/JEST.1674-862X.70116018
Yucel, A., Dag, A., Oztekin, A., & Carpenter, M. (2022). A novel text analytic methodology for classification of product and service reviews. Journal of Business Research, 151, 287-297. https://doi.org/10.1016/j.jbusres.2022.06.062
Zeng, K., Pan, Z., Xu, Y., & Qu, Y. (2020). An ensemble learning strategy for eligibility criteria text classification for clinical trial recruitment: algorithm development and validation. JMIR Medical Informatics, 8(7), e17832. https://doi.org/10.2196/17832