Description
Overview
Natural Language Processing (NLP) is a subset of AI that focuses on the understanding
and generation of written and spoken language. This involves a series of tasks from
low-level speech recognition on audio signals up to high-level semantic understanding
and inferencing on the parsed sentences.
One task within this spectrum is Part-Of-Speech (POS) tagging. Every word and
punctuation symbol is understood to have a syntactic role in its sentence, such as
nouns (denoting people, places or things), verbs (denoting actions), adjectives (which
describe nouns) and adverbs (which describe verbs), to name a few. Each word in a
piece of text is therefore associated with a part-of-speech tag (usually assigned by
hand), where the total number of tags can depend on the organization tagging the text.
A list of all the part-of-speech tags can be found here.
While this task falls under the domain of NLP, having prior language experience doesn’t
offer any particular advantage. In the end, the main task is to create an HMM model that
can figure out a sequence of underlying states given a sequence of observations.
What You Need To Do:
Your task for this assignment is to create a Hidden Markov Model (HMM) for POS
tagging, including:
1. Training probability tables (i.e., initial, transition and emission) for HMM from
training files containing text-tag pairs
2. Performing inference with your trained HMM to predict appropriate POS tags
for untagged text.
Your solution will be graded based on the learned probability tables and the accuracy
on our test files, as well as the efficiency of your algorithm. See Mark Breakdown for
more details.
Starter Code & Validation Program
The starter code contains one Python starter file, a validation program and several
training and test files. You can download the code and supporting files as a zip file
starter-code.zip. In that archive, you will find the following files:
Project File (the file you will edit and submit on Markus):
tagger.py The file where you will implement your POS tagger; this is the only file tand graded.
Training Files (don’t modify):
data/training1.txt –
training5.txt Training files (in text format) containing large texts with POS tags on eacTesting Files (don’t modify):
data/test1.txt – test5.txt Test files (in text format), identical to the training files but without the POValidation Files:
validation/tagger-validate.py The public validation script for testing your solution with a set of providefiles.
Running the Code
You can run the POS tagger by typing the following at a command line:
$ python3 tagger.py -d -t -o