Advanced News Archiving System with Machine Learning-Driven Web Scraping and AI-Powered Summarization Using T5, Pegasus, BERT and BART Architectures
DOI:
https://doi.org/10.52756/ijerr.2024.v46.017Keywords:
News Summarization, Web Scraping, Structured Archive, News Organization, Information Retrieval, BERT, BARTAbstract
Data plays a crucial role in the contemporary era of technology, as it is a vital element in the publication of news on the internet or a website. Nevertheless, understanding long reports in order to fully comprehend events can be a challenging endeavor, frequently leading to subjective judgments. The application's architecture integrates the categorization of news stories by day, resulting in a well-organized and readily accessible archive. The application employs the web scraping method, which entails pulling pertinent news articles from numerous internet sources. The application employed sophisticated summarizing libraries, including the BERT, BART, T5 model and Google Pegasus, to condense the information into a succinct and comprehensible style. The T5 model performs exceptionally well in text summarization and other natural language processing tasks because of its text-to-text structure; it is also a very customizable language model. Google Pegasus, an expert in abstractive summarizing, uses self-attention mechanisms and rigorous pre-training to generate high-quality, concise news summaries. To summarize, these are the most important parts of our app's process. When it comes to collecting, storing, and summarizing news articles, the system has you covered. In addition, it will offer a straightforward design that makes it simple to browse past news stories and their summaries.
References
Abodayeh, A., Hejazi, R., Najjar, W., Shihadeh, L., & Latif, R. (2023). Web Scraping for Data Analytics: A BeautifulSoup Implementation. 2023 Sixth International Conference of Women in Data Science at Prince Sultan University (WiDS PSU), pp. 65–69. https://doi.org/10.1109/wids-psu57071.2023.00025
Aniche, M., Treude, C., Steinmacher, I., Wiese, I., Pinto, G., Storey, M.-A., & Gerosa, M. A. (2018). How modern news aggregators help development communities shape and share knowledge. Proceedings of the 40th International Conference on Software Engineering. https://doi.org/10.1145/3180155.3180180
Asmitha, M., Kavitha, C.R., & Radha D. (2024). Summarizing News: Unleashing the Power of BART, GPT-2, T5, and Pegasus Models in Text Summarization. 2024 4th International Conference on Intelligent Technologies (CONIT), Karnataka, India. 2024, 1-6.
Dharrao, D., Mishra, M., Kazi, A., Pangavhane, M., Pise, P., & Bongale, A.M. (2024). Summarizing business news: Evaluating BART, T5, and PEGASUS for effective information extraction. Revue d'Intelligence Artificielle, 38(3), 847-855. https://doi.org/10.18280/ria.380311
Dharrao, D., Bongale, A.M., Kadalaskar, V., Singh, U., & Singharoy, T. (2023). Patients’ medical history summarizer using NLP. In 2023 International Conference on Advances in Intelligent Computing and Applications (AICAPS), Kochi, India, pp. 1-6. https://doi.org/10.1109/AICAPS57044.2023.10074336
Gite, S., Patil, S., Dharrao, D., Yadav, M., Basak, S., Rajendran, A., & Kotecha, K. (2023). Textual feature extraction using ant colony optimization for hate speech classification. Big Data and Cognitive Computing, 7(1), 45. https://doi.org/10.3390/bdcc7010045
Haque, S., Eberhart, Z., Bansal, A., & McMillan, C. (2022). Semantic similarity metrics for evaluating source code summarization. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, Pittsburgh, PA, USA, pp. 36-47. https://doi.org/10.1145/3524610.3527909
Keerthana, B., Vamsinath, J., Kumari, C. S., Appaji, S. V. S., Rani, P. P., & Chilukuri, S. (2024). Machine Learning Techniques for Medicinal Leaf Prediction and Disease Identification. International Journal of Experimental Research and Review, 42, 320–327. https://doi.org/10.52756/ijerr.2024.v42.028
Khilji, A. F. U. R., Sinha, U., Singh, P., Ali, A., & Pakray, P. (2021). Abstractive Text Summarization Approaches with Analysis of Evaluation Techniques. In Communications in Computer and Information Science, pp. 243–258. https://doi.org/10.1007/978-3-030-75529-4_19
Mastropaolo, A., Scalabrino, S., Cooper, N., Nader Palacio, D., Poshyvanyk, D., Oliveto, R., & Bavota, G. (2021). Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 336–347. https://doi.org/10.1109/icse43902.2021.00041
Mohamed, A., Ibrahim, M., Yasser, M., Ayman, M., Gamil, M., & Hassan, W. (2020). News aggregator and efficient summarization system. International Journal of Advanced Computer Science and Applications, 11(6). https://doi.org/10.14569/ijacsa.2020.0110677
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140), 1–67. https://jmlr.org/papers/volume21/20-074/20-074.pdf
Sekhar, C., Devi, J., Kumar, M., Swathi, K., Ratnam, P., & Rao, M. (2024). Enhancing Sign Language Understanding through Machine Learning at the Sentence Level. International Journal of Experimental Research and Review, 41(Spl Vol), 11-18. https://doi.org/10.52756/ijerr.2024.v41spl.002
Suleiman, D., & Awajan, A. (2020). Deep learning based abstractive text summarization: Approaches, datasets, evaluation measures, and challenges. Mathematical Problems in Engineering, 2020, 1-29. https://doi.org/10.1155/2020/9365340
Sundaramoorthy, K., Durga, R., & Nagadarshini, S. (2017). NewsOne — An Aggregation System for News Using Web Scraping Method. 2017 International Conference on Technical Advancements in Computers and Communications (ICTACC), pp. 136–140. https://doi.org/10.1109/ictacc.2017.43
Wang, M., Xie, P., Du, Y., & Hu, X. (2023). T5-Based model for abstractive summarization: A semi-supervised learning approach with consistency loss functions. Applied Sciences, 13(12), 7111. https://doi.org/10.3390/app13127111
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., Von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., . . . Rush, A. (2020). Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. https://doi.org/10.18653/v1/2020.emnlp-demos.6
Widyassari, A. P., Rustad, S., Shidik, G. F., Noersasongko, E., Syukur, A., Affandy, A., & Setiadi, D. R. I. M. (2022). Review of automatic text summarization techniques & methods. Journal of King Saud University - Computer and Information Sciences, 34(4), 1029–1046. https://doi.org/10.1016/j.jksuci.2020.05.006
Zhang, J., Zhao, Y., Saleh, M., & Liu, P. (2020). Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning. PMLR, pp. 11328-11339. https://doi.org/10.48550/arXiv.1912.08777
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 International Academic Publishing House (IAPH)
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.