Enhancing Academic Integrity: An Analysis of Advanced Techniques for Plagiarism Detection using LESK, Word Sense Disambiguation, and SVM

Keywords: Word Sense Disambiguation (WSD), LESK, Semantic Analysis, Support Vector Machine (SVM), Universal Language Translator (ULT)

Abstract

Plagiarism is widespread in academia, from ancient literature to modern research, where scholars' work is copied and published without authorization. In the late 90s, researchers explored various methods to detect plagiarism, including Word Sense Disambiguation (WSD), LESK, and Support Vector Machine (SVM). However, these conventional techniques have shown limitations in aligning with contemporary writing styles. This paper proposes an improved LESK algorithm for word sense detection and Improved SVM for feature extraction, addressing the shortcomings of existing methods and offering enhanced accuracy and efficiency in identifying plagiarized content. The study evaluates the proposed system using three datasets from PAN 2012, PAN 2013, and PAN 2014 documents to assess its performance across different types of text plagiarism. Results demonstrate the system's superiority, achieving higher classification accuracy when trained on the Second Dataset. A comprehensive analysis of the feature’s significance in the training database reveals the importance of discriminative sentence similarity. The proposed system contributes to combating academic dishonesty, ensuring the authenticity of digital content in various contexts. Future work will explore cross-lingual plagiarism detection and image duplicity identification using Word Sense Disambiguation techniques. Additionally, efforts will be made to optimize time complexity for faster execution.

References

Abdi, A., Shamsuddin, S. M., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2017). A linguistic treatment for automatic external plagiarism detection. Knowledge-Based Systems, 135, 135-146. https://doi.org/10.1016/j.knosys.2017.08.008

Abnar, S., Dehghani, M., Zamani, H., & Shakery, A. (2014). Expanded n-grams for semantic text alignment.In: CLEF (working notes) 1180:928-938. Available: http:// ceur- ws. org/ Vol- 1180/ CLEF2 014wn-Pan- Abnar Et2014. Pdf

Agarwal, J., Goudar, R. H., Kumar, P., Sharma, N., Parshav, V., Sharma, R., ... & Rao, S. (2013, August). Intelligent plagiarism detection mechanism using semantic technology: A different approach. IEEE, In 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 779-783. https://doi.org/10.1109/ICACCI.2013.6637273

Altheneyan, A. S., & Menai, M. E. B. (2020). Automatic plagiarism detection in obfuscated text. Pattern Analysis and Applications, 23, 1627-1650. https://doi.org/10.1007/s10044-020-00882-9

Alzahrani, S. M., Salim, N., & Abraham, A. (2012). Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2), 133–149. https://doi.org/10.1109/tsmcc.2011.2134847

Ayetiran, E. F., & Agbele, K. (2016). An Optimized Lesk-Based Algorithm for Word Sense Disambiguation. Open Computer Science, 8(1), 165–172. https://doi.org/10.1515/comp-2018-0015

Banerjee, S., & Pedersen, T. (2002, February). An adapted Lesk algorithm for word sense disambiguation using WordNet. Berlin, Heidelberg: Springer Berlin Heidelberg, In International conference on intelligent text processing and computational linguistics, pp. 136-145.

Basile, P., Caputo, A., & Semeraro, G. (2014, August). An enhanced lesk word sense disambiguation algorithm through a distributional semantic model. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 1591-1600.

Chong, M., Specia, L., & Mitkov, R. (2010, June). Using natural language processing for automatic plagiarism detection. In Proc. of 4th International Plagiarism Conference, Northrumbia University Newcastle-upon-Tyne, UK.

El-Rashidy, M. A., Mohamed, R. G., El-Fishawy, N. A., & Shouman, M. A. (2023). An effective text plagiarism detection system based on feature selection and SVM techniques. Multimedia Tools and Applications, 83(1), 2609–2646. https://doi.org/10.1007/s11042-023-15703-4

En, A. C. M., Karim, A. A., Noor, N. M., & Majid, M. Z. A. (2023). Plagiarism Experience among Higher Education Students. International Journal of Academic Research in Business and Social Sciences, 13(9), 1877–1883. http://dx.doi.org/10.6007/IJARBSS/v13-i9/18611

Gillam, L., & Notley, S. (2014, September). Evaluating Robustness for 'IPCRESS': Surrey's Text Alignment for Plagiarism Detection-Notebook for PAN at CLEF 2014. In CLEF 2014 Evaluation Labs and Workshop—Working Notes Papers, 15-18 September, Sheffield, UK (pp. 951-957). CEUR-WS. org.

Gipp, B., Meuschke, N., & Breitinger, C. (2014). Citation-based plagiarism detection: Practicability on a large-scale scientific corpus. Journal of the Association for Information Science and Technology, 65(8), 1527–1540. https://doi.org/10.1002/asi.23228

Glinos, D. G. (2014, September). A Hybrid Architecture for Plagiarism Detection. In CLEF (working notes), pp. 958-965.

Gross, P., & Modaresi, P. (2014, September). Plagiarism Alignment Detection by Merging Context Seeds. In CLEF (working notes), pp. 966-972.

Haloi, R., Chanda, D., Hazarika, J., & Barman, A. (2023). Statistical feature-based EEG signals classification using ANN and SVM classifiers for Parkinson’s disease detection. Int. J. Exp. Res. Rev., 31(Spl Volume), 141-149. https://doi.org/10.52756/10.52756/ijerr.2023.v31spl.014

Hiremath, S. A., & Otari, M. S. (2014). Plagiarism detection-different methods and their analysis. International Journal of Innovative Research in Advanced Engineering, 1(7), 41-47.

Joshi, M., & Khanna, K. (2013). Plagiarism Detection over the Web: Review. International Journal of Computer Applications, 68(15), 17–20. https://doi.org/10.5120/11655-7163

Kumar, M., Mukherjee, P., Hendre, M., Godse, M., & Chakraborty, B. (2020). Adapted Lesk Algorithm based Word Sense Disambiguation using the Context Information. International Journal of Advanced Computer Science and Applications, 11(3). https://doi.org/10.14569/ijacsa.2020.0110330

Kumari, L., & Kumar, S. (2023). Optimizing word sense disambiguation for Hindi language using extended Lesk and conceptual density. 8th International Conference on Computing in Engineering and Technology (ICCET 2023). https://doi.org/10.1049/icp.2023.1493

Mahdavi, P., Siadati, Z., & Yaghmaee, F. (2014, October). Automatic external Persian plagiarism detection using vector space model. IEEE, In 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE), pp. 697-702. https://doi.org/10.1109/ICCKE.2014.6993398

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Text classification and naive bayes. Introduction to Information Retrieval, 1(6). https://doi.org/10.1017/CBO9780511809071.014

Maurya, A., & Madhusudhan, M. (2023). Plagiarism in Research: Problems and its Solutions. Journal of Advancements in Library Sciences, 10(1), 59–69. https://doi.org/10.37591/joals.v10i1.3688

Mentari, M., Rozi, I. F., & Rahayu, M. P. (2022). Cross-Language Text Document Plagiarism Detection System Using Winnowing Method. Journal of Applied Intelligent System, 7(1), 44–57. https://doi.org/10.33633/jais.v7i1.5950

Mozgovoy, M. (2011). Dependency-based rules for grammar checking with LanguageTool. IEEE, In 2011 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 209-212.

Nguyen, Q. H. (2023). AI and Plagiarism: Opinion from Teachers, Administrators and Policymakers. Proceedings of the Asia CALL International Conference, 4, 75–85. https://doi.org/10.54855/paic.2346

Oberreuter, G., Carrillo-Cisneros, D., Scherson, I. D., & Velásquez, J. D. (2014). Submission to the 4th international competition on plagiarism detection. In Proc. of 2014 Cross Language Evaluation Forum Conference, Working Notes Papers of the CLEF 2014 Evaluation Labs, CEUR Workshop Proceedings.

Palkovskii, Y., & Belov, A. (2014). Developing high-resolution universal multi-type n-gram plagiarism detector. Working Notes Papers of the CLEF 2014 Evaluation Labs, 984-989.

Prasanth, S., & Rajshree, R. (2014). A Survey on Plagiarism Detection. International Journal of Computer Applications, 86(19). https://doi.org/10.5120/15104-3428

Ranjan Pal, A., Kundu, A., Singh, A., Shekhar, R., & Sinha, K. (2013). Hybrid Approach to Word Sense Disambiguation Combining Supervised and Unsupervised Learning. International Journal of Artificial Intelligence & Applications, 4(4), 89–101. https://doi.org/10.5121/ijaia.2013.4409

Sanchez-Perez, M. A., Sidorov, G., & Gelbukh, A. F. (2014). A Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014. CLEF (Working Notes), 2014, 1004-1011.

Sánchez-Vega, F., Villatoro-Tello, E., Montes-y-Gómez, M., Villaseñor-Pineda, L., & Rosso, P. (2013). Determining and characterizing the reused text for plagiarism detection. Expert Systems with Applications, 40(5), 1804-1813.

Sedaghat, S. (2024). Plagiarism and Wrong Content as Potential Challenges of Using Chatbots Like ChatGPT in Medical Research. J. Acad. Ethics, pp.1-3. https://doi.org/10.1007/s10805-024-09533-8

Shrestha, P., Maharjan, S., & Solorio, T. (2014). Machine Translation Evaluation Metric for Text Alignment. In CLEF (working notes), pp. 1012-1016.

Slimani, T. (2013). Description and evaluation of semantic similarity measures approaches. arXiv preprint arXiv, 1310.8059.

Upadhyay, D. K., Mohapatra, S., Singh, N. K., & Bakhla, A. K. (2021). Stacked SVM model for Dysthymia prediction in undergraduates students. IEEE, In 2021 8th International Conference on Signal Processing and Integrated Networks (SPIN), pp. 1148-1153.

Vani, K., & Gupta, D. (2017). Text plagiarism classification using syntax-based linguistic features. Expert Systems with Applications, 88, 448-464. https://doi.org/10.1016/j.eswa.2017.07.006

Vasilescu, F., Langlais, P., & Lapalme, G. (2004, May). Evaluating Variants of the Lesk Approach for Disambiguating Words. In Lrec.

Vrbanec, T., & Meštrović, A. (2020). Corpus-Based Paraphrase Detection Experiments and Review. Information, 11(5), 241. https://doi.org/10.3390/info11050241

Weber-Wulff, D. (2018). Why does plagiarism detection software not find all plagiarism? Student Plagiarism in Higher Education, pp. 62–73. https://doi.org/10.4324/9781315166148-5

Published
2024-05-30
How to Cite
Upadhyay, D., & Sinha, K. (2024). Enhancing Academic Integrity: An Analysis of Advanced Techniques for Plagiarism Detection using LESK, Word Sense Disambiguation, and SVM. International Journal of Experimental Research and Review, 39(Spl Volume), 92-108. https://doi.org/10.52756/ijerr.2024.v39spl.007