PyCon Indonesia 2017: Part-of-Speech Tagger for Bahasa Indonesia Using Hidden Markov Model and Viterbi Algorithm

Date:

Each word in a sentence has its own word class. In Natural Language Processing, the word class is also known as part of speech (POS). Some examples of word class are noun, verb, adverb, adjective, and so on. This word class denotes the role of a word in a sentence. Moreover, their sequence builds the structure of a sentence. For instance, a sentence has a general structure, namely the sequence of noun, verb, and noun.

This talk gives discusses an approach for predicting the possible sequence of part of speech based on the given sentence (the sequence of words). In probabilistic way, it could be stated as the probability of Y given X, where Y is the sequence of part of speech and X is the sequence of words.

The model is built by implementing the Hidden Markov Model (HMM) and Viterbi algorithm. The creation process uses the transition probability and emission probability from every word found in the training data. The transition probability denotes the probability of a part of speech to occur given certain part of speech. For instance, P(V given N) is the probability of Verb class to occur after the Noun class. While the emission probability denotes the probability of word to occur given certain part of speech. For instance, the emission probability P(language given N) is the probability of word ‘language’ to occur and have Noun class as its part of speech.

To evaluate the built model, the Viterbi algorithm is used. This algorithm searches for the best sequence of part of speech and is consisted of two steps, namely forward step and backward step. The forward step is used to find the best path containing the part of speech, whereas the backward step is used to regain the best path.