Sale!

DSCI 558 Homework 3: Entity Resolution, Blocking & Knowledge Representation solution

$30.00 $25.50

Category:

Description

5/5 - (6 votes)

Building Knowledge Graphs
In this homework, you will link movies from the Internet Movie Database (IMDb) to the
American Film Institute (AFI), implement and test two blocking techniques, and then
represent some of your data using RDF. The entity resolution and blocking tasks will be
done using The Record Linkage ToolKit (RLTK), an open-source record linkage platform. You
will use RDFLib, a Python library for working with RDF, for the task of knowledge
representation. We provide a python notebook (ER_KR.ipynb), which contains instructions,
code and descriptions on how to use the tools we mentioned.

Task 1: ER (4 points)
In this task, you are given a dataset of movies from IMDb (imdb.jl) and a dataset of movies
from AFI (afi.jl). Your goal is to match records from these 2 datasets using record linkage
methods. This means you need to figure out which pairs of movies in the two datasets are
referring to the same movie.
IMDb and AFI datasets contain several attributes, some are unique for each dataset, some
are present in both. For the task of linking, we are interested in the 3 fields that are present
in both datasets (movie title/name, release date/year and genre).
We provide the template file hw03_tasks_1_2.py which includes some of the code you see in
the given notebook, you may use the notebook to develop and test your code, but the final
code for tasks 1.1 and 1.2 (and also task 2) should be implemented in this file.
Task 1.1 (2 points)
For each attribute (total of 3): (1) Analyze the given data and choose string similarities that
you think are appropriate. (2) Explain your choices in the report. (3) Implement a method
that computes the field similarity of this field between the records for the 2 datasets (look
for `# ** STUDENT CODE. Task 1.1` in the submission file)
Notes:
• You may customize string similarity methods or change/clean attribute values if
necessary. For example, you can choose the Levenshtein similarity method for the
movie names, which you derive from the original attribute values.
Task 1.2 (2 points)
Design a scoring function to combine your field similarities. Explain your choices of weights
in the scoring function in the report. Implement a method that predicts the corresponding
AFI movies for the IMDb movies (in imdb.jl) using your scoring function (complete the code
in main()). Set the value to null if there is no corresponding entry in the AFI dataset. Export
your prediction to an output file (Firstname_Lastname_hw03_imdb_afi_el.json) with the format:
[ { “imdb_movie”: , “afi_movie”: }, … ]
2
For example:
[
{
“imdb_movie”: “https://www.imdb.com/title/tt0033467/”,
“afi_movie”: “https://catalog.afi.com/#dc440a1a7fa4a6bd30f183eded493ef2”
},
{
“imdb_movie”: “https://www.imdb.com/title/tt0108052/”,
“afi_movie”: “https://catalog.afi.com/#642a1d0b14872b56d8fde9228170da6f”
}
]
Notes:
• Look for `# ** STUDENT CODE. Task 1.2` in the submission file
• The attached notebook offers an RLTK-built-in method for running evaluation. You
can use the provided code to test your performance over the provided development set
(imdb_afi_el.dev.json) before finalizing your implementation in the file.
Task 2: Blocking (3 points)
In this task, you will use RLTK to implement two blocking techniques and evaluate their
effectiveness. Your code for this task will be implemented in the same file from the previous
task (hw03_tasks_1_2.py).
Task 2.1 (1 pts)
Complete the missing code for this task (Look for `# ** STUDENT CODE. Task 2.1` in the
submission file).
Task 2.2 (1 pt)
Implement hash-based and token-based blocking (Look for `# ** STUDENT CODE. Task 2.2` in
the submission file).
Task 2.3 (1 pts)
To evaluate the performance of the blocking techniques you implemented we calculate the
reduction ratio and pairs completeness.
2.3.1 Run the script hw03_eval_blocking.py with a ‘hash’ argument
(i.e., python hw03_eval_blocking.py hash).
Include the numbers you get in your final report.
2.3.2 Run the script with a ‘token’ argument (i.e., python hw03_eval_blocking.py token).
Include the numbers you get in your final report.
2.3.3 Explain the difference in the results you get between the two blocking techniques
(max 2 sentences, in your final report)
3
Task 3: KR (3 points)
In this task, you will represent the movie data (after linking) using RDF. The ontology
(vocabulary/schema) you will use is schema.org. As this ontology may not include all
necessary classes and properties to model your data, you will need to extend the ontology
with classes that you define on your own.
Task 3.1 (1 point)
Describe the model (in the report) you will use to generate the RDF data to describe the
merged movie entry with all of the available attributes from the two sources you have
matched.
There’s a total of 13 attributes: title, release-date, certificate, runtime, genre, imdb-rating,
imdb-metascore, imdb-votes, gross-income, producer, writer, cinematographer and
production-company.
Use the appropriate classes and properties from schema.org. Define your own if you could
not locate a suitable one in schema.org. Finalize the file describing your model (model.ttl)
with the missing attributes and rename it to Firstname_Lastname_hw03_model.ttl.
Notes:
• As a starting point, you may want to use the class `https://schema.org/Movie` to
represent a movie and the property `https://schema.org/datePublished` to represent
a predicate that describes that movie’s release time (as seen in model.ttl).
• The attribute ‘production company’ should not be referred to as a plain literal from
the movie entry. Instead, create a local URI for each production company (as depicted
in model.ttl, the movie’s production company is an instance of a class, not a literal).

Task 3.2 (1 point)
Implement a program that uses the data from the 2 datasets and your results file
(Firstname_Lastname_hw03_imdb_afi_el.json). The program should convert the combined
movie data to RDF triples (in turtle format, ttl) using the model you defined in task 3.1, the
generated file should be named Firstname_Lastname_hw03_movie_triples.ttl.
Notes:
• Use the IMDb URI as the identifier of the node (subject). You can discard the AFI URI
• See the attached notebook for an example of how create and generate RDF graph
(triples) in ttl format.
Task 3.3 (1 point)
Choose two movie instances and a single production company instance (locate them in your
ttl file, the two movies should refer to the same production company) and visualize the triple
data in your report. Use the online tool at http://www.ldf.fi/service/rdf-grapher to visualize
the triples in a graph. The result should look like what is shown in Figure 1 (the figure shows
partial data, yours should be complete).
4
Figure 1: An example of a graph visualization
Submission Instructions
You must submit (via Blackboard) the following files/folders in a single .zip archive named
Firstname_Lastname_hw03.zip:
• Firstname_Lastname_hw03_report.pdf: pdf file with your answers to Tasks 1, 2 & 3
• hw03_tasks_1_2.py: as described in Tasks 1+2
• Firstname_Lastname_hw03_imdb_afi_el.json: as described in Task 1.2
• Firstname_Lastname_hw03_model.ttl: as described in Task 3.1
• Firstname_Lastname_hw03_movie_triples.ttl: as described in Task 3.2
• source: This folder includes all the additional code you wrote to accomplish the tasks