Sale!

DSC 450 Assignment Module 5 solution

$30.00 $25.50

Category:

Description

5/5 - (7 votes)

Database Processing for Large-Scale Analytics

Part 1

Using the company.sql database (posted in with this assignment), write the following SQL queries.

1. Find the names of all employees who are directly supervised by ‘Franklin T Wong’ (you cannot use Franklin’s SSN value in the query).

2. For each project, list the project name, project number, and the total hours per week (by all employees) spent on that project.

3. For each department, retrieve the department name and the average salary of all employees working in that department. Order the output by department number in ascending order.

4. Retrieve the average salary of all female employees.

5. For each department whose average salary is greater than $42,000, retrieve the department name and the number of employees in that department.

6. Retrieve the names of employees whose salary is within $21,000 of the salary of the employee who is paid the most in the company (e.g., if the highest salary in the company is $83,000, retrieve the names of all employees that make at least $62,000.). Naturally, your query should work for any salary.

7. Find all female employees using:
a. Plain SELECT query
b. Sub-query

8. Find all employees who are not assigned to any project using SET operation in SQL

Part 2

Create the table and use python to automate loading of the following file into SQLite:
http://dbgroup.cdm.depaul.edu/DSC450/Public_Chauffeurs_Short_hw3.csv

Find (using SQL)
a) how many records are in the Chauffeurs table and
b) how many of the records are missing the “Original Issue Date” entry.

It contains comma-separated data, with two changes: NULL may now be represented by NULL string or an empty string (e.g., either ,NULL, or ,,) and some of the names have the following form “Last, First” instead of “First Last”, which is problematic because when you split the string on a comma, you end up with too many values to insert.

You can use csvreader to automatically load the data for you:
import csv
fd = open(‘Public_Chauffeurs_Short_hw3.csv’, ‘r’)
reader = csv.reader(fd)
for row in reader:
print(row)
fd.close()

Part 3

We are going to work with a small extract of tweets (about 200 of them), available here: http://dbgroup.cdm.depaul.edu/DSC450/Module5.txt

NOTE 1: I do not recommend trying to copy-paste this text, because there is absolutely no knowing what might come out from paste on your system. You should be able to use “Save as…” function in your browser.

NOTE 2: The input data is separated by a string “EndOfTweet” which serves as a delimiter. The text itself consists of a single line, so using readline() or even readlines() will still only give you one row which needs to be split by the custom delimiter (i.e. .split(‘EndOfTweet’)).

a. Create a SQL table to contain the following attributes of a tweet:
“created_at”, “id_str”, “text”, “source”, “in_reply_to_user_id”, “in_reply_to_screen_name”, “in_reply_to_status_id”, “retweet_count”, “contributors”. Please assign reasonable data types to each attribute and use SQLite for this assignment.

b. Write python code to read through the Module5.txt file and populate your table from part a. Make sure your python code reads through the file and loads the data properly (including NULLs).