For this assignment, you will
- (Part 1)
- choose a data set,
- design a SQL data model for it,
- pose questions that can be answered using this data
- (Part 2)
- create the tables in SQLite,
- load data, and
- answer your questions.
Part 1: Design a data model
I want you to pick some business, social, scientific, or industrial phenomenon that is complex and which has some data publicly available. To get ideas, I suggest that you browse the public data sets at:
Define a relational data model using the "tables and arrows" diagram style that I used in the Lectures 11 and 12. You must:
- Have at least three significant tables. Many of the data sets at kaggle.com are too simplistic to require three tables (especially the data sets that have been gathered for machine learning tasks).
- Underline the primary keys.
- Draw arrows from foreign key columns to the columns that they refer to.
- Classify each table as either an object, event, or relationship
- Have at least three foreign keys
- List the data type of each column (int, float, text, or epoch time). This will be covered starting in Lecture 13.
You must choose a data domain for which you can somehow get real data. Your database must eventually have at least 200 rows in total. However, your data loading process should be automated, so you should feel free to tackle data sets with hundreds of thousands or millions of rows. Do not choose a data domain identical to one of the examples we already modeled in class. Please also read the Part 2 instructions so that you choose a database that will work with Part 2.
Your Part 1 submission should include
- A data model diagram (a photo of a drawing is OK).
- An explanation of what the source of your data will be. For example, you may have found certain CSV or JSON files online or you may be planning to scrape a series of web pages (scraping will be covered in Lecture 14). Your plan should not be to manually type-in all of the data.
- NOTE: A few data sets on Kaggle.com already provide sqlite files. Building that *.sqlite or *.db file is a major part of this assignment, so you may not use one of those data sets.
- List four interesting questions which can be answered using this database. At least two of these questions should combine data from multiple tables.
All of the above should be in one PDF file.
One of the teaching staff will review and return your answers to Part 1 before you can continue to Part 2. Ideally, you would come to office hours so that we can review your Part 1 in person. Please turn in Part 1 as soon as possible so that you can start Part 2 sooner.
Working with a partner
You have the option to work in a group of two, but this is not required or expected. If you do work with a partner, then we will expect a more complex project:
- A two-person project must include a significant programming component (eg., in Python or R). This could include:
- Using a web API to fetch data from an online source.
- Scraping data off a website.
- Implementing some rules to clean up inconsistent data. This is likely to be necessary if you are combining two different data sets.
- Programmatically generating plots or other reports.
- A two-person project must pose and answer at least 8 different questions.
If you do work with a partner, then only one submission is required, which should list both people. It should also list the contributions of both partners. The non-submitting partner should leave a comment in Canvas saying who they worked with.
Grading Rubric
Part 2 of the final project is worth 10 points. The basic criteria above will get you to 8 out of ten points (assuming there are no errors). You can earn an extra point by doing something extra, like:
- Additional programming steps, as describing in the "working with a partner" section above.
- Combining data from two or more data sources.
- Working with more than 100k total rows.
- Splitting a single CSV file into multiple appropriate tables. In other words, dealing with data that is not already divided into one CSV file per table.
- Anything else that looks difficult to me!