Sale!

CSCI 4152/6509 Assignment 1 solution

$30.00 $25.50

Category:

Description

5/5 - (7 votes)

Natural Language Processing

1) (22 marks) Complete the Lab 1 as instructed. In particular, you will need to properly:
a) (4 marks) Submit the file ‘hello.pl’ as instructed.
b) (4 marks) Submit the file ‘lab1-example2.pl’ as instructed.
c) (4 marks) Submit the file ‘lab1-example5.pl’ as instructed.
d) (5 marks) Submit the file ‘lab1-task1.pl’ as instructed.
e) (5 marks) Submit the file ‘lab1-task2.pl’ as instructed.
All examples must compile and run correctly. For example, if a syntax error is introduced
to an example program by your typing mistake or by introducing incorrect characters through
copying and pasting from a PDF file, the solution will not be accepted. The lab instructions
state that the programs should be tested before being submitted. If some code is given to
be used, you should type it rather than copy-paste it, unless specified differently. This gives
you also a better opportunity to learn the programming language and illustrated concepts.
1 CSCI 4152/6509
2) (19 marks) Submit your answer to this question as a plain-text file called a1q2.txt
using the submit-nlp command, or the course web site. Clearly separate your answers into
parts a), b), and c).
A plain-text file is preferred, but if you want to include a figure you can instead submit
a file named a1q2.pdf, a1q2.jpg, or a1q2.png in the format specified by the file name
extension (pdf, jpg, or png). Drawing a figure by hand is okay if it is clear and readable.
a) (7 marks) List the levels of NLP, with one-sentence description for each of them.
b) (12 marks) Consider an application on your phone or computer, which you can ask
a question using your voice and it tries to give you an answer. Apple Siri is an example of
such application. Choose three levels of NLP and briefly describe processing done at those
levels that could be used in such application.
3) (19 marks) Submit answer to this question in a plain-text file named a1q3.txt using
the submit-nlp command, or you can submit it as a pdf, jpg, or png file.
Clearly separate a) and b) parts in the solution.
Consider an NFA (Non-deterministic Finite Automaton) described by the following graph
(note that NFA includes epsilon transitions):
q0 q1 q2 q3 q4 q5
q6
a a
a
b a a
ε
ε
ε
a) (3 marks) Give three examples of words accepted by this NFA.
b) (3 marks) Briefly described in plain English what language is accepted by this NFA;
i.e., what words are accepted by this NFA.
c) (3 marks) Write a regular expression that is equivalent to this NFA. (You do not have
to use starting and ending anchors.)
d) (10 marks) Translate this NFA into a DFA using the process discussed in class. Submit
your solution as the table and you must follow the algorithm described in class. You can
submit your table either as part of a pdf or image file, or as a part of text file. (The filename
must be as specified at the beginning of the assignment.)
If you use the plain-text format, then follow the table format as shown below as an
example:
2 CSCI 4152/6509
State | a | b | <– (JUST AN EXAMPLE)
———–+——–+——–+
S: q0q1 | q0q1 | q1 |
———–+——–+——–+
q1 | q0q2 | q0q2 |
———–+——–+——–+
F: q0q2 | q0q1 | q1 |
———–+——–+——–+
Explanation: Use characters minus, vertical line, and plus to draw the table. The columns
correspond to input characters. The DFA states are set of NFA states shows as sequences
of states in a sorted order by index (for example, use q0q1 rather than q1q0). use labels S:
and F: to denote start and finish states. If an NFA state is empty set, then use the word
‘empty’ to denote it.
4) (15 marks) Write a Perl program named a1q4.pl and submit it using the submit-nlp
command.
Sometimes we want to go through a large amount of text and search for any email
addresses that appear in the text. Your task is to write the program which will search for
lines containing an email address, and print then the email address, followed by a colon (:),
followed by a space, and then the line containing that email.
In order to capture realistic email addresses, we will consider email address to be a string
with the following conditions:
1. it starts with a letter,
2. after letter, it can be followed by an arbitrary sequence of letters, digits, minus sign
(-), equal sign (=), period (.), plus sign (+), or underscore ( ),
3. one at-sign (@),
4. after at-sign, there must be again a non-empty sequence of letters, digits, minus sign,
equal sign, period, plus sign, or underscore, which must start with a letter or digit,
must include at least one period, and the last character must be a letter.
For each line in which your program recognizes an email address, you should print that
email address, a colon, a space, and the line itself. If a line contains more than one valid
email, you must print only the first one, then colon, space, and the full line.
These requirements are far from perfect for the real task, but you must follow them
exactly for this assignment. For example, if you read line input:
This is a line with email email@blah, or 123@Dal.ca, or a@b.
3 CSCI 4152/6509
the program would not print the line because there is no valid email. On the other hand,
with a few minor modifications, such as:
This is a line with email email@bla.h, or 123@Dal.ca, or a@b.
This is a line with email email@blah, or 12a3@Dal.ca, or a@b.
This is a line with email email@blah, or 123@Dal.ca, or a@b.X.
all three lines contain one valid email each: ‘email@bla.h’, ‘a3@Dal.ca’, and ‘a@b.X’.
You should use the Perl ‘diamond’ operator so that the program can read either standard
input, or open files with filenames given as parameters. After reading each line, the program
must decide whether to print output or not. For example, for the above input of three lines,
which all contain valid email addresses, the output should look like this:
email@bla.h: This is a line with email email@bla.h, or 123@Dal.ca, or a@b.
a3@Dal.ca: This is a line with email email@blah, or 12a3@Dal.ca, or a@b.
a@b.X: This is a line with email email@blah, or 123@Dal.ca, or a@b.X.
If you are not familiar with the concepts of the standard input and standard output, you
should look them up, particularly in the context of Linux or Unix operating systems. The
standard input basically means that you can type the input using the keyboard after running
the program from the command line, or redirect input from a file using a command such
as ‘./prog.pl < input.txt’. The standard output means that the program prints to the
screen, or that output can also be captured into a file using a command such as ./prog.pl
< input.txt > output.txt.
Important: Do not print any extra output other than specified! Sometimes students
in order to make a more user-friendly interface print additional output such as prompts to
users to enter input. This would be an error. You must follow precisely the specifications of
the problem.
Your solution will be mainly marked on correctness. The markers will also look at the
style. There are no significant requirements on the style other than reasonable indentation,
and reasonable clarity of code. You can include comments, but they are not mandatory
other than the header comment, which must identify file name, author, and the course. This
is an example of such comment:
#!/usr/bin/perl
# File: a1q4.pl Author: Vlado Keselj
# Solution to question 4 of assignment 1, CSCI4152/6509 Fall 2021
5) (15 marks, programming) Write and submit a program written in Perl, Python, C,
C++, or Java, named a1q5.pl, a1q5.py, a1q5.c, a1q5.cc, or a1q5.java, for the task
given below.
4 CSCI 4152/6509
The program is meant to give a basic statistic analysis of HTML tags in a file. We will
consider HTML tag to be any string that starts with less-then sign (<) and ends with greater than sign (>). A tag can span more than one line. If outside of a tag we see the string .
The part of the content inside a comment is ignored even if it appear to have some tags.
If a comment does not end; i.e., you cannot find ending –>, you should consider the end
of the input to be the end of the comment.
If a tag does not end, you should not consider it to be a valid tag.
After a start of a tag, you should always look only for the end of a tag (>) even when the
tag contains again strings such as ‘<’ or ‘
the program must print out:
Tag count: 3
Min length: 2
Max length: 16
Avg length: 8.00
5 CSCI 4152/6509
Your solution will be mainly marked on correctness. The markers will also look at the
style. There are no significant requirements on the style other than reasonable indentation,
and reasonable clarity of code. You can include comments, but they are not mandatory
other than the header comment, which must identify file name, author, and the course. This
is an example of such comment in Perl:
#!/usr/bin/perl
# File: a1q5.pl Author: Vlado Keselj
# Solution to question 5 of assignment 1, CSCI4152/6509