A Novel Framework for Multilingual Script Detection and Pattern Analysis in Mixed Script Queries

Anu Chaudhary; Rahul Pradhan; Shashi Shekhar

doi:10.52756/ijerr.2024.v43spl.016

Authors

Anu Chaudhary Department of Computer Science and Engineering, GLA University Mathura, India https://orcid.org/0009-0005-4133-5565
Rahul Pradhan Department of Computer Science and Engineering, GLA University Mathura, India https://orcid.org/0000-0002-5774-4698
Shashi Shekhar Department of Computer Science and Engineering, Amity University, Patna, India https://orcid.org/0000-0001-8824-1447

DOI:

https://doi.org/10.52756/ijerr.2024.v43spl.016

Keywords:

Language identification, Mixed script, Pattern analysis, Script Detection, Word identification

Abstract

A script detection system that is capable of handling several languages is becoming more necessary in today's world. The task of identifying scripts written in various languages has been substantially facilitated by the use of machine learning and deep learning, respectively. Machine learning techniques have used the Naive Bayes and Support Vector Machines (SVM) mechanism for the purpose of language detection. On the other hand, this paper reviews several unique deep-learning processes that have considered a range of methodologies, including LSTM and Bert. On the other hand, it has been shown that there is a need to improve the accuracy and the scalability often incorporated in multilingual systems. As a consequence of this, the primary focus of the present investigation is on the development of an innovative framework that is capable of recognizing scripts in a variety of languages. In addition, this technique considers pattern analysis while considering mixed script queries. A scalable, efficient, and adaptive approach has been established via study to increase the accuracy of the identification of a large number of languages. Accuracy, recall, and F1-score are some of the performance metrics that have been calculated in order to evaluate the efficacy of the multilingual script identification that has been presented. In conclusion, it has been found that the approach that was provided has supplied a solution that is both efficient and scalable for the detection of multilingual scripts.

References

Anand, M., Sahay, K.B., Ahmed, M.A., Sultan, D., Chandan, R.R., & Singh, B. (2022). Deep learning and natural language processing in computation for offensive language detection in online social networks by feature selection and ensemble classification techniques. Theor. Comput. Sci., 943, 203-218.

Ansari, M. Z., Beg, M. S., Ahmad, T., Khan, M. J., & Wasim, G. (2021). Language Identification of Hindi-English tweets using code-mixed BERT. IEEE, In 2021 IEEE 20th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 248-252. https://doi.org/10.1109/ICCICC53683.2021.9811292.

Chaitanya, I., Madapakula, I., Gupta, S. K., & Thara, S. (2018). Word level language identification in code-mixed data using word embedding methods for Indian languages. IEEE. In 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1137-1141. https://doi.org/10.1109/ICACCI.2018.8554501.

Chakravarthi, B. R., Priyadharshini, R., Muralidaran, V., Jose, N., Suryawanshi, S., Sherly, E., & McCrae, J. P. (2022). Dravidiancodemix: Sentiment analysis and offensive language identification dataset for dravidian languages in code-mixed text. Language Resources and Evaluation, 56(3), 765-806. https://doi.org/10.1007/s10579-022-09583-7.

Dey, S., Thakur, S., Kandwal, A., Kumar, R., Dasgupta, S., & Roy, P.P. (2024). BharatBhasaNet-A Unified Framework to Identify Indian Code Mix Languages. IEEE Acess, 12, 68893-68904. https://doi.org/10.1109/ACCESS.2024.3396290

Dutta, S., Saha, T., Banerjee, S., & Naskar, S. K. (2015). Text normalization in code-mixed social media text. IEEE, In 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS), pp. 378-382. https://doi.org/10.1109/ReTIS.2015.7232908.

Feurer, M., & Hutter, F. (2019). Hyperparameter optimization. Automated machine learning: for multi-script information retrieval. ©TheAuthor(s) 2019 F. Hutter et al. (eds.), Automated Machine Learning, The Springer Series on Challenges in Machine Learning, pp. 1-33. https://doi.org/10.1007/978-3-030-05318-5_1

Gella, S., Bali, K., & Choudhury, M. (2014). ye word kis lang ka hai bhai? Testing the Limits of Word level Language Identification. In Proceedings of the 11th International Conference on Natural Language Processing, pp. 368-377.

Gupta, P., Bali, K., Banchs, R. E., Choudhury, M., & Rosso, P. (2014). Query expansion for mixed-script information retrieval. SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. pp. 677 – 686. https://doi.org/10.1145/2600428.2609622

Jitta, D. S., Chandu, K. R., Pamidipalli, H., & Mamidi, R. (2017). nee intention enti? towards dialog act recognition in code-mixed conversations. IEEE, In 2017 International Conference on Asian Language Processing (IALP), pp. 243-246.

Karimi, S., Scholer, F., & Turpin, A. (2011). Machine transliteration survey. ACM Computing Surveys (CSUR), 43(3), 1-46. https://doi.org/10.1145/1922649.1922654.

Kazi, M., Mehta, H., & Bharti, S. (2020). Sentence level language identification in Gujarati-Hindi code-mixed scripts. IEEE, In 2020 IEEE International Symposium on Sustainable Energy, Signal Processing and Cyber Security (iSSSC), pp. 1-6. https://doi.org/10.1109/iSSSC50941.2020.9358837

Khan, Z. F., & Sawarkar, S.D. (2024). Enhancing Sentiment Analysis of Marathi-English Code-Mixed Texts using an Ensemble Model. International Journal of Intelligent Systems and Applications in Engineering, 12(18s), 741. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5038

Kozhirbayev, Z., Yessenbayev, Z., & Makazhanov, A. (2018). Document and word-level language identification for noisy user generated text. IEEE, In 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT), pp. 1-4. https://doi.org/10.1109/ICAICT.2018.8747138.

Kumar, A., & Lehal, G. S. (2023). A Hybrid Approach for Complex Layout Detection of Newspapers in Gurumukhi Script Using Deep Learning. International Journal of Experimental Research and Review, 35, 34–42. https://doi.org/10.52756/ijerr.2023.v35spl.004

Mabokela, K. R. (2019). A multilingual ASR of Sepedi-English code-switched speech for automatic language identification. IEEE, In 2019 International Multidisciplinary Information Technology and Engineering Conference (IMITEC), pp. 1-8.

Mandal, S., & Singh, A. K. (2018). Language identification in code-mixed data using multichannel neural networks and context capture. arXiv preprint arXiv:1808.07118. dhttps://doi.org/10.18653/v1/w18-6116.

Mandl, T., Modha, S., Kumar M, A., & Chakravarthi, B. R. (2020). Overview of the havoc track at Fire 2020: Hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In Proceedings of the 12th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 29-32.

Mosa, M. A. (2020). A novel hybrid particle swarm optimization and gravitational search algorithm for multi-objective optimization of text mining. Applied Soft Computing, 90, 106189. https://doi.org/10.1016/j.asoc.2020.106189.

Naosekpam, V., & Sahu, N. (2023). A Hybrid Scene Text Script Identification Network for Regional Indian Languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 23(8), Article 124 (26 pages). https://doi.org/10.1145/3649439

Nayel, H. A., & Shashirekha, H. L. (2019). DEEP at HASOC2019: A Machine Learning Framework for Hate Speech and Offensive Language Detection. In FIRE (working notes), pp. 336-343.

Ojo, O.E., Gelbukh, A., Calvo, H., Feldman, A., Adebanji, O.O., & Armenta-Segura, J. (2022). Language Identification at the Word Level in Code-Mixed Texts Using Character Sequence and Word Embedding. Proc. 19th Int. Conf. Nat. Lang. Process. Shar. Task Word Lev. Lang. Identif. Code-mixed Kannada-English Texts, pp. 1–6, 2022.

Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., ... & Ward, R. (2016). Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(4), 694-707. https://doi.org/10.1109/TASLP.2016.2520371.

Patel, D., & Parikh, R. (2020). Language Identification and Translation of English and Gujarati code-mixed data. IEEE, In 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE), pp. 1-4. https://doi.org/10.1109/ic-ETITE47903.2020.410.

Patel, P., & Bhattacharyya, P. (2019). Recent Work in Machine Transliteration for Indian Languages, pp. 1-12.

Prabhakar, D.K., & Pal, S. (2018). Machine transliteration and transliterated text retrieval: a survey. S?dhan?, 43, 93. https://doi.org/10.1007/s12046-018-0828-8

Raghavi, K. C., Chinnakotla, M. K., & Shrivastava, M. (2015, May). " Answer ka type kya he?" Learning to Classify Questions in Code-Mixed Language. In Proceedings of the 24th International Conference on World Wide Web, pp. 853-858. https://doi.org/10.1145/2740908.2743006.

Ravikiran, M., & Annamalai, S. (2021). DOSA: Dravidian code-mixed offensive span identification dataset. In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pp. 10-17.

Roy, R. S., Katare, R., Ganguly, N., Laxman, S., & Choudhury, M. (2015). Discovering and understanding word-level user intent in web search queries. Journal of Web Semantics, 30, 22-38. https://doi.org/10.1016/j.websem.2014.07.010.

Sarma, N., Singh, S. R., & Goswami, D. (2018). Word level language identification in Assamese-Bengali-Hindi-English code-mixed social media text. IEEE, In 2018 International Conference on Asian Language Processing (IALP), pp. 261-266. https://doi.org/10.1109/IALP.2018.8629104.

Sasidhar, T. T., Premjith, B., & Soman, K. P. (2020). Emotion detection in hinglish (hindi+ english) code-mixed social media text. Procedia Computer Science, 171, 1346-1352. https://doi.org/10.1016/j.procs.2020.04.144.

Shanmugalingam, K., Sumathipala, S., & Premachandra, C. (2018). Word level language identification of code mixing text in social media using NLP. IEEE, In 2018 3rd International Conference on Information Technology Research (ICITR), pp. 1-5. https://doi.org/10.1109/ICITR.2018.8736127.

Sharma, V. K., & Mittal, N. (2018). Cross-lingual information retrieval: A dictionary-based query translation approach. In Advances in Computer and Computational Sciences: Proceedings of ICCCCS 2016, Volume 2, pp. 611-618. Springer Singapore. https://doi.org/10.1007/978-981-10-3773-3_59.

Shashirekha, H. L., Balouchzahi, F., Anusha, M. D., & Sidorov, G. (2022). CoLI-machine learning approaches for code-mixed language identification at the word level in Kannada-English texts. arXiv preprint arXiv: 2211.09847. https://doi.org/10.12700/APH.19.10.2022.10.8.

Shekhar, S., & Sharma, D. K. (2020). Computational intelligence for temporal expression retrieval in code-mixed text. IEEE, In 2020 International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC), pp. 386-390. https://doi.org/10.1109/PARC49193.2020.236634.

Shekhar, S., Sharma, D. K., & Beg, M. S. (2018). Hindi roman linguistic framework for retrieving transliteration variants using bootstrapping. Procedia Computer Science, 125, 59-67. https://doi.org/10.1016/j.procs.2017.12.010.

Shekhar, S., Sharma, D. K., & Beg, M. S. (2020). Language identification framework in code-mixed social media text based on quantum LSTM—the word belongs to which language? Modern Physics Letters B, 34(06), 2050086. https://doi.org/10.1142/S0217984920500864.

Sristy, N. B., Krishna, N. S., Krishna, B. S., & Ravi, V. (2017). Language identification in mixed script. In Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 14-20. https://doi.org/10.1145/3158354.3158357.

Thara, S., & Poornachandran, P. (2018). Code-mixing: A brief survey. IEEE, In 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 2382-2388 https://doi.org/10.1109/ICACCI.2018.8554413.

Velankar, A., Patil, H., & Joshi, R. (2022). A review of challenges in machine learning based automated hate speech detection. arXiv preprint arXiv: 2209.05294.