APPLICATION OF HYBRID APPROACH FOR WOLAITA LANGUAGE PART OF SPEECH TAGGING

  • Birhanesh Fikre Shirko Wolaita Sodo University
Keywords: NLP, HMM, TBL, NLTK and Hybrid

Abstract

The main purpose of this study is to develop part-of-speech tagger for Wolaita Language using hybrid approach. Part of speech tagger is one of the subtasks in NLP application which is important for other Natural Language Processing (NLP) applications, like parser, machine translator, speech recognizer and search engines. PoST is a process of tagging a corresponding part of speech tag for a word that tag defines how the word is used in a sentence. The PoST for Wolaita language is not enough yet to be used as one vital module in other natural language processing applications. In this study, the development of PoS tagger using hybrid approach that combines HMM and rule based approaches was conducted for Wolaita language. In general HMM model need large data to increase the performance and the rule based model learner rule based on the language features. The HMM tagger, tags the words based on the optimal path for a given sequence of words and transformation based learning (TBL) is a rule based approaches that learns rule directly from the training corpus without expert knowledge. The developed hybrid approach of Wolaita language PoS tagger uses HMM tagger as initial annotators and rule based tagger as a corrector based on fixed threshold values. For implementation and experiment purpose the researcher used python programming and NLTK. For training and testing the models, 1256 sentences or 15,268 words  are collected from three different categories (Bible, Social media in Wolaita language (Wogetta FM 96.6 ) and Wolaita language department) and annotated data manually. For tagging purpose 26 PoS tag are identified. From entire corpus, 90% for training and the remaining of entire corpus for testing purpose. The performance of the taggers, are tested by using different experiments. After experiment the researcher found that the performance of HMM, rule based and hybrid taggers shows 88.14%, 93.19% and 94.82% respectively. Generally, hybrid approach showed the better performance to assigning part of speech tag for Wolaita language.

Published
2024-05-14