Description
Data Annotation
In this assignment, we’re interested in the main topics discussed on the /r/mcgill subreddit vs. the /r/concordia
subreddit. We’ll do this using human annotation … and you’re the annotator 😊
Task 1: Data collection (10 pts)
First, let’s collect the latest 100 posts (using the /new endpoint (do not use the /hot endpoint)).
Write a script “collect_newest.py” that collects the 100 newest posts in the subreddit specified. It should run
as follows:
python3 collect_newest.py -o -s
Collect two data files – one for mcgill and one for concordia subreddits. This involves running your script two
times. Note that in the output data files, you should have exactly one post (in JSON format) per line. Do
not indent the JSON output. The files should be named concordia.json and mcgill.json. Place them in the
root folder of the submission template. Please read the README.md file in the repository for further
instructions.
Task 2: Prep for coding (10 pts)
Write a script extract_to_tsv.py that accepts one of the files you collected from Reddit and outputs a
random selection of posts from that file to a tsv (tab separated value) file. It should function like this:
python3 extract_to_tsv.py -o https://programmingmag.com
If is greater than the file length, then the script should just output all lines. If there
are more than (which is likely the case), then it should randomly select
num_posts_to_output (the parameter you passed to the script) of them and just output those.
The output format (written to out_file) is:
Name title coding