Enhancing Academic Integrity: An Analysis of Advanced Techniques for Plagiarism Detection using LESK, Word Sense Disambiguation, and SVM
DOI:
https://doi.org/10.52756/ijerr.2024.v39spl.007Keywords:
Word Sense Disambiguation (WSD), LESK, Semantic Analysis, Support Vector Machine (SVM), Universal Language Translator (ULT)Abstract
Plagiarism is widespread in academia, from ancient literature to modern research, where scholars' work is copied and published without authorization. In the late 90s, researchers explored various methods to detect plagiarism, including Word Sense Disambiguation (WSD), LESK, and Support Vector Machine (SVM). However, these conventional techniques have shown limitations in aligning with contemporary writing styles. This paper proposes an improved LESK algorithm for word sense detection and Improved SVM for feature extraction, addressing the shortcomings of existing methods and offering enhanced accuracy and efficiency in identifying plagiarized content. The study evaluates the proposed system using three datasets from PAN 2012, PAN 2013, and PAN 2014 documents to assess its performance across different types of text plagiarism. Results demonstrate the system's superiority, achieving higher classification accuracy when trained on the Second Dataset. A comprehensive analysis of the feature’s significance in the training database reveals the importance of discriminative sentence similarity. The proposed system contributes to combating academic dishonesty, ensuring the authenticity of digital content in various contexts. Future work will explore cross-lingual plagiarism detection and image duplicity identification using Word Sense Disambiguation techniques. Additionally, efforts will be made to optimize time complexity for faster execution.
References
Abdi, A., Shamsuddin, S. M., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2017). A linguistic treatment for automatic external plagiarism detection. Knowledge-Based Systems, 135, 135-146. https://doi.org/10.1016/j.knosys.2017.08.008
Abnar, S., Dehghani, M., Zamani, H., & Shakery, A. (2014). Expanded n-grams for semantic text alignment.In: CLEF (working notes) 1180:928-938. Available: http:// ceur- ws. org/ Vol- 1180/ CLEF2 014wn-Pan- Abnar Et2014. Pdf
Agarwal, J., Goudar, R. H., Kumar, P., Sharma, N., Parshav, V., Sharma, R., ... & Rao, S. (2013, August). Intelligent plagiarism detection mechanism using semantic technology: A different approach. IEEE, In 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 779-783. https://doi.org/10.1109/ICACCI.2013.6637273
Altheneyan, A. S., & Menai, M. E. B. (2020). Automatic plagiarism detection in obfuscated text. Pattern Analysis and Applications, 23, 1627-1650. https://doi.org/10.1007/s10044-020-00882-9
Alzahrani, S. M., Salim, N., & Abraham, A. (2012). Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2), 133–149. https://doi.org/10.1109/tsmcc.2011.2134847
Ayetiran, E. F., & Agbele, K. (2016). An Optimized Lesk-Based Algorithm for Word Sense Disambiguation. Open Computer Science, 8(1), 165–172. https://doi.org/10.1515/comp-2018-0015
Banerjee, S., & Pedersen, T. (2002, February). An adapted Lesk algorithm for word sense disambiguation using WordNet. Berlin, Heidelberg: Springer Berlin Heidelberg, In International conference on intelligent text processing and computational linguistics, pp. 136-145.
Basile, P., Caputo, A., & Semeraro, G. (2014, August). An enhanced lesk word sense disambiguation algorithm through a distributional semantic model. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 1591-1600.
Chong, M., Specia, L., & Mitkov, R. (2010, June). Using natural language processing for automatic plagiarism detection. In Proc. of 4th International Plagiarism Conference, Northrumbia University Newcastle-upon-Tyne, UK.
El-Rashidy, M. A., Mohamed, R. G., El-Fishawy, N. A., & Shouman, M. A. (2023). An effective text plagiarism detection system based on feature selection and SVM techniques. Multimedia Tools and Applications, 83(1), 2609–2646. https://doi.org/10.1007/s11042-023-15703-4
En, A. C. M., Karim, A. A., Noor, N. M., & Majid, M. Z. A. (2023). Plagiarism Experience among Higher Education Students. International Journal of Academic Research in Business and Social Sciences, 13(9), 1877–1883. http://dx.doi.org/10.6007/IJARBSS/v13-i9/18611
Gillam, L., & Notley, S. (2014, September). Evaluating Robustness for 'IPCRESS': Surrey's Text Alignment for Plagiarism Detection-Notebook for PAN at CLEF 2014. In CLEF 2014 Evaluation Labs and Workshop—Working Notes Papers, 15-18 September, Sheffield, UK (pp. 951-957). CEUR-WS. org.
Gipp, B., Meuschke, N., & Breitinger, C. (2014). Citation-based plagiarism detection: Practicability on a large-scale scientific corpus. Journal of the Association for Information Science and Technology, 65(8), 1527–1540. https://doi.org/10.1002/asi.23228
Glinos, D. G. (2014, September). A Hybrid Architecture for Plagiarism Detection. In CLEF (working notes), pp. 958-965.
Gross, P., & Modaresi, P. (2014, September). Plagiarism Alignment Detection by Merging Context Seeds. In CLEF (working notes), pp. 966-972.
Haloi, R., Chanda, D., Hazarika, J., & Barman, A. (2023). Statistical feature-based EEG signals classification using ANN and SVM classifiers for Parkinson’s disease detection. Int. J. Exp. Res. Rev., 31(Spl Volume), 141-149. https://doi.org/10.52756/10.52756/ijerr.2023.v31spl.014
Hiremath, S. A., & Otari, M. S. (2014). Plagiarism detection-different methods and their analysis. International Journal of Innovative Research in Advanced Engineering, 1(7), 41-47.
Joshi, M., & Khanna, K. (2013). Plagiarism Detection over the Web: Review. International Journal of Computer Applications, 68(15), 17–20. https://doi.org/10.5120/11655-7163
Kumar, M., Mukherjee, P., Hendre, M., Godse, M., & Chakraborty, B. (2020). Adapted Lesk Algorithm based Word Sense Disambiguation using the Context Information. International Journal of Advanced Computer Science and Applications, 11(3). https://doi.org/10.14569/ijacsa.2020.0110330
Kumari, L., & Kumar, S. (2023). Optimizing word sense disambiguation for Hindi language using extended Lesk and conceptual density. 8th International Conference on Computing in Engineering and Technology (ICCET 2023). https://doi.org/10.1049/icp.2023.1493
Mahdavi, P., Siadati, Z., & Yaghmaee, F. (2014, October). Automatic external Persian plagiarism detection using vector space model. IEEE, In 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE), pp. 697-702. https://doi.org/10.1109/ICCKE.2014.6993398
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Text classification and naive bayes. Introduction to Information Retrieval, 1(6). https://doi.org/10.1017/CBO9780511809071.014
Maurya, A., & Madhusudhan, M. (2023). Plagiarism in Research: Problems and its Solutions. Journal of Advancements in Library Sciences, 10(1), 59–69. https://doi.org/10.37591/joals.v10i1.3688
Mentari, M., Rozi, I. F., & Rahayu, M. P. (2022). Cross-Language Text Document Plagiarism Detection System Using Winnowing Method. Journal of Applied Intelligent System, 7(1), 44–57. https://doi.org/10.33633/jais.v7i1.5950
Mozgovoy, M. (2011). Dependency-based rules for grammar checking with LanguageTool. IEEE, In 2011 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 209-212.
Nguyen, Q. H. (2023). AI and Plagiarism: Opinion from Teachers, Administrators and Policymakers. Proceedings of the Asia CALL International Conference, 4, 75–85. https://doi.org/10.54855/paic.2346
Oberreuter, G., Carrillo-Cisneros, D., Scherson, I. D., & Velásquez, J. D. (2014). Submission to the 4th international competition on plagiarism detection. In Proc. of 2014 Cross Language Evaluation Forum Conference, Working Notes Papers of the CLEF 2014 Evaluation Labs, CEUR Workshop Proceedings.
Palkovskii, Y., & Belov, A. (2014). Developing high-resolution universal multi-type n-gram plagiarism detector. Working Notes Papers of the CLEF 2014 Evaluation Labs, 984-989.
Prasanth, S., & Rajshree, R. (2014). A Survey on Plagiarism Detection. International Journal of Computer Applications, 86(19). https://doi.org/10.5120/15104-3428
Ranjan Pal, A., Kundu, A., Singh, A., Shekhar, R., & Sinha, K. (2013). Hybrid Approach to Word Sense Disambiguation Combining Supervised and Unsupervised Learning. International Journal of Artificial Intelligence & Applications, 4(4), 89–101. https://doi.org/10.5121/ijaia.2013.4409
Sanchez-Perez, M. A., Sidorov, G., & Gelbukh, A. F. (2014). A Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014. CLEF (Working Notes), 2014, 1004-1011.
Sánchez-Vega, F., Villatoro-Tello, E., Montes-y-Gómez, M., Villaseñor-Pineda, L., & Rosso, P. (2013). Determining and characterizing the reused text for plagiarism detection. Expert Systems with Applications, 40(5), 1804-1813.
Sedaghat, S. (2024). Plagiarism and Wrong Content as Potential Challenges of Using Chatbots Like ChatGPT in Medical Research. J. Acad. Ethics, pp.1-3. https://doi.org/10.1007/s10805-024-09533-8
Shrestha, P., Maharjan, S., & Solorio, T. (2014). Machine Translation Evaluation Metric for Text Alignment. In CLEF (working notes), pp. 1012-1016.
Slimani, T. (2013). Description and evaluation of semantic similarity measures approaches. arXiv preprint arXiv, 1310.8059.
Upadhyay, D. K., Mohapatra, S., Singh, N. K., & Bakhla, A. K. (2021). Stacked SVM model for Dysthymia prediction in undergraduates students. IEEE, In 2021 8th International Conference on Signal Processing and Integrated Networks (SPIN), pp. 1148-1153.
Vani, K., & Gupta, D. (2017). Text plagiarism classification using syntax-based linguistic features. Expert Systems with Applications, 88, 448-464. https://doi.org/10.1016/j.eswa.2017.07.006
Vasilescu, F., Langlais, P., & Lapalme, G. (2004, May). Evaluating Variants of the Lesk Approach for Disambiguating Words. In Lrec.
Vrbanec, T., & Meštrović, A. (2020). Corpus-Based Paraphrase Detection Experiments and Review. Information, 11(5), 241. https://doi.org/10.3390/info11050241
Weber-Wulff, D. (2018). Why does plagiarism detection software not find all plagiarism? Student Plagiarism in Higher Education, pp. 62–73. https://doi.org/10.4324/9781315166148-5