In the first half of this quarter, we will be building a news search engine, similar to https://news.google.com.  In this first assignment we'll get started by writing a data transfer tool.  The first half of the instructions will set up your environment and assignment itself is described later in four steps.

Most students have rated this as the most difficult assignment of the quarter, so please start early, read the instructions carefully, and make use of Campuswire and office hours.

NOTE: that this is NOT a group project.  Each student in the class should complete and submit the assignment.  Please see the syllabus for the class' collaboration policy.

Learning objectives

Search Engine basics

In the early days of the World Wide Web, there were no search engines.  All was dark.  A search engine's first task is to discover what information is published across the vast expanse of the Web.  A web "crawler" or "spider" is a program that starts from a small list of web pages and follows all the links it encounters to try to discover all the pages on the Web.  The crawler stores information about the pages it encounters in some kind of database.   It's a huge database.  At very least, the crawler must store the text of the webpage and the URL.  Once this data is stored, a search application can be built that answers user queries by looking for documents in the database that match the search terms (and returning the appropriate URLs).  Making this search efficient and relevant is another huge challenge (which we'll not really address in this class).

The Common Crawl

I am not going to ask you write and run your own web crawler.  That would be doable, but I found a way for us to skip this step.  There is a nonprofit project called Common Crawl that maintains and runs a full-scale Web crawler, and they publish their data openly for anyone to use.  That's lucky for us.

Every month Common Crawl publishes a full-web crawl.  For example, in the month of December of 2020,  this was 2.64 billion web pages, comprising 270 terabytes.   That's a lot of HTML text!

EXERCISE 1: If an 12 terabyte hard drive costs about $200, how much would it cost to store a copy of all that data, at minimum?

EXERCISE 2: How much would it cost us to store all that data in AWS S3 for the duration of the quarter? (see https://aws.amazon.com/s3/pricing/)

EXERCISE 3: A decent home cable Internet connection provides about 150 Mbit/s (mega-bits per second).  How long would it take to download this data on that connection, at minimum?

Unfortunately, we cannot afford to build a system that can handle all that data just for a class project.  Instead, we will use Common Crawl's news dataset.  This is smaller and restricted to just news articles, but it's also updated more frequently (giving about 20 gigabytes of new articles every day).  That's still a lot of data!

Amazon Web Services (AWS)

You'll be using your own personal AWS account to complete this class' assignments.  New AWS users get a small amount of usage free in the first year, so it may not even cost you anything (https://aws.amazon.com/free/).  However, you will have to provide a credit card number to create an account.  I hope it's not necessary, but you may have to spend up to $100 on AWS resources during the course of the quarter.

NOTE: Please contact the instructor if you cannot set up a personal AWS account.

SIDE NOTE: In past versions of this class, I tried to get a grant of free credits through the "AWS Educate" program, but I learned that this program is very limited and does not allow students to use all the advanced features that you need for the class projects.

You'll have to be careful not to waste resources.  Otherwise you could end up with a big bill to pay!  Just set up the resources that are described in the assignment.  For example, the assignment might ask you to set up an Elasticsearch "cluster" with one t2.small instance.  This would cost $0.036 per hour.  If you mistakenly create a cluster of ten i3.16xlarge instances, then the cost would be $80 per hour, and there are no refunds for this kind of mistake.  (See https://aws.amazon.com/elasticsearch-service/pricing/).  You should feel free to shut down resources after each assignment is completed.

SIDE NOTE: Almost everything I say in this class about AWS is also true for its competitors like Microsoft Azure, Google Cloud, and Alibaba Cloud.  Sometimes I will use "AWS" as a shorthand for "public computing platforms in general."

Log in to AWS

After you log in to AWS, look around and check out the dizzying number of tools that are available.  Notice also that you can switch geographic region using the control in the upper right.   I suggest that you primarily use US-East-2 region, which is in Ohio.  

EXERCISE 4: Visit https://www.cloudping.info/ and click the "HTTP ping" button to test the latency between your computer and AWS data centers in various regions.  The ability to rent resources around the world allows AWS' customers to build services that have good performance worldwide.

In order for your programs to gain access to our account's resources, they will authenticate using an "Access Key ID" and a "Secret Access Key" which is basically equivalent to a username/password.  You can get your keys from the AWS Console by clicking your profile name in the upper left, "my security credentials", then "access keys." 

Make a note of your Access Key pair.  Don't share it with anyone and don't hard-code your Access Keys in any of the code you'll be writing.  We'll see later how we can supply these secrets at runtime.

Setting up your Java build environment

Windows, Mac, or Linux can be used to develop your code.  You will develop and test code on your own machine, and then later on deploy it to the cloud (usually on a Linux machine).  

Download IntelliJ IDEA Community Edition.  It's free.  You don't need the Ultimate edition (their webpage says that Ultimate is needed for web development, but that's an exaggeration).  If you think you can get away with using Atom, Emacs, Vim, VSCode (or any other text editor) instead of using using an IDE, you are mistaken.  A good IDE is essential for professional Java development work because it allows you to quickly step through your code with a debugger and it gives you code completion and instant feedback on syntax errors.  It also has a plugin for live collaborative coding called Coded With Me.  If you want to use a different IDE (like Eclipse), I can't stop you, but we also can't help you with any issues you encounter.

You will also be using Maven to build and run your Java code, but Maven actually comes bundled with IntelliJ, so a separate download and install is not required.  Maven can also be downloaded separately and run on the command line; you'll have to do that when testing on moore (as described later), but you should do your initial testing through the IDE on your own machine.

Even if you are totally new to Java, I suggest you try to get started with the code as described in the instructions below.  However, if you feel too lost in the code syntax, then you can do this short Java tutorial.

Download the project skeleton

Unfortunately, it takes a bit of work to set up even a simple Java Maven project from scratch 😢.  However, since you're probably new to Java, I am providing some sample code that you will build the first project on top of.

Clone this repo: https://github.com/starzia/ssa-skeleton

Have a look at the pom.xml file.  This tells Maven how to compile the project and which 3rd party libraries to download (dependencies).  The pom.xml file is also used by IntelliJ to load and run the project.  The dependencies listed in the sample pom.xml are what I use to implement this homework.  This class is all about learning to write code like professionals.  That means we'll be using 3rd party libraries whenever it's easiest to do so (instead of building code from scratch just to exercise our hands). 

SIDE NOTE: If the functionality you need is very simple, then it can be easiest to build it from scratch even if a 3rd party library provides the same functionality.  That's because the effort of learning how to use someone else's code is sometimes greater than the effort to build something simple from scratch.

Now let's open the sample code.  Open IntelliJ and choose "open."  Select the pom.xml file that you downloaded above, and choose "open as project."  From the menu choose "run" -> "run" -> "Add configuration."  Click the "+", choose "Maven" and enter the command line:

clean package exec:java -Dexec.mainClass="edu.northwestern.ssa.App"

and click "OK."  Now you can run the code by clicking the green "run" or "debug" icons.   At the bottom of the IDE you should see some compiler messages ending with something like:

Hello world!
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.621 s
[INFO] Finished at: 2019-09-25T13:10:01-05:00
[INFO] ------------------------------------------------------------------------

The only difference between "run" and "debug" is that debug will stop at breakpoints that you set in the code.  Open the App.java file under src/main/java/edu/northwestern/ssa.  If you click to the left of line 5, a red circle will be added to the editor.  This is a breakpoint, indicating that the code will stop when it gets to this point.  When the debugger is active, you can examine the values of variables and you can even choose "run" -> "evaluate expressions" to arbitrary new code and see the result.  A powerful debugger like this makes your job much easier and eliminates most need to print debug messages.

NOTE: IntelliJ also has a "build" button (a green hammer).  Do not use this.  It will not produce a jar file.  If you need to compile, then run it, as above and stop the code.  Alternatively, create another run configuration that just includes "clean package".

The Assignment

Now that you are set up to write Java code, let's talk about the requirements for this homework.  This goal is to download the latest WARC file (web archive) from Common Crawl and post its data to your Elastic Search database.  You're specifically writing an ETL program (Extract, Transform, and Load).  The program will extract news articles from an online data set, transform them into another format, and load them into your own database.

Breaking it into four steps, you need to:

  1. Download the latest WARC file from a public S3 bucket on AWS.
  2. Open the WARC file and parse it into a series of HTTP responses (containing HTML documents).
  3. Convert each HTML page into plain text and extract the document title.
  4. Post each document's <title, url, txt> to an Elastic Search index on AWS.

 

Step 1: Download a WARC

The WARC file format is described here.  Common crawl stores its data in AWS' Scalable Storage Service (S3).  S3 is a huge distributed filesystem, and we'll talk later in class about how such systems are build and why they are scalable.  For now, you can just take for granted that S3 has excellent:

S3 storage is organized into buckets, which are just containers for files.

SIDE NOTE: The code samples below will use 3rd-party libraries to do a lot of the work.  The sample code skeleton pom.xml file already includes the dependencies to pull in all of the classes used below.  There are many alternative 3rd party libraries to choose from to access S3 buckets, but I highly recommend that you stick to using the same libraries because we know they work.

SIDE NOTE: You can browse the contents of the commoncrawl S3 bucket here: https://s3.console.aws.amazon.com/s3/buckets/commoncrawl?region=us-east-1&prefix=crawl-data/CC-NEWS/&showversions=false 

To download a file from Common Crawl's public S3 bucket:

S3Client s3 = S3Client.builder()
.region(Region.US_EAST_1)
.overrideConfiguration(ClientOverrideConfiguration.builder()
.apiCallTimeout(Duration.ofMinutes(30)).build())
.build();
s3.getObject(request, ResponseTransformer.toFile(f));

NOTE: If you are trying to complete this assignment on a very slow Internet connection, then you might find that the S3 download step is slowing down your testing.  In this case, I suggest that you download  the file once and then temporarily comment-out the section of code that does the download while you develop the remainder of your code.  Another option is to do your testing on murphy, as described below.

Step 2: Parse the WARC file

Notice that the sample pom.xml file has the following dependency:

<dependency>
<groupId>org.netpreserve.commons</groupId>
<artifactId>webarchive-commons</artifactId>
<version>1.1.9</version>
</dependency>

That library includes some helpful code for parsing files in the WARC format. 

Step 3: Parse the HTML

Notice that the sample pom.xml file has the following dependency:

<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.12.1</version>
</dependency>

Jsoup is a library for parsing HTML files.   It is well documented online.  You will have to:

Step 4: Post the <url, title, txt> to Elasticsearch 

This is the most complicated step, and the most difficult part is related to the authentication of REST requests to AWS (and in particular to Elasticsearch).  However, I have included in skeleton code a file AwsSignedRestRequest.java which correctly implements signing of REST requests to any AWS service.  Please read through this to understand how it works.  Note that it relies on the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables being set (see the "Parameters" section below).

We'll interact with our Elasticsearch cluster using its REST API, which is documented here: https://www.elastic.co/guide/en/elasticsearch/reference/current/rest-apis.html.

SIDE NOTE: REST APIs will be discussed in detail in Lecture 5.  At this point, if you don't know what a REST API is, then you should stop and do some reading on this topic.  For example, the first three chapters of this O'Reilly book cover the topic well in about an hour's time (and the Northwestern Library provides free online access to O'Reilly books!).  In summary, REST is a way to structure client-server, request-response interactions, using the HTTP protocol that was originally designed for web browsers to fetch pages from web servers.  For a 30-minute overview of HTTP, see this section of one of my CS-340 lectures.

After you're done, please delete the ElasticSearch domain through the AWS console (so that you don't run up a big bill).  You can always create another ElasticSearch domain if you need to run your code again.

Parameters

Your Java code should take several runtime parameters, allowing the user (ie, you or the TA testing your code) to change its behavior.  In the past, you may have used command-line arguments for program parameters, however in this class we will use environment variables.  This is a common way to configure cloud-deployed services.  Environment variables are nice because they are named and can be defined in any order.  In particular, your code should get the following constants from environment variables:

So, your code should not hard-code the values above, but rather get those data from environment variables using, for example, System.getenv("ELASTIC_SEARCH_HOST").

SIDE NOTE: the advantage of using environment variables for these values is that you will be able to compile your code, and give it to someone else and that other person will be able to run it to operate on a different AWS account and on a different filename, without any modification to your code.  The "user" of your code might also be a piece of cloud infrastructure, like Elastic Beanstalk, which has a nice user interface for setting environment variables.

In IntelliJ, you must edit your "run configuration" to provide these environment variables (under "Runner" -> uncheck "Use project settings" -> "Environment Variables").

Advanced Features

The following are not worth any points, but if you've already done the above and are looking for a challenge, then you should implement the following performance improvements:

Testing your code

I am providing a Python script to test your code.  It is essential that you run this test script on your jar file before submitting it to Canvas.  The provided script is very similar to the auto-grading script that we will be running.  You should run it on moore.wot.eecs.northwestern.edu.  The README.md file gives the exact commands to run the tester on moore.

Connecting to the EECS servers: You should be able to ssh into the department servers using your netid and your EECS account password (different than your netid password).  If you need to reset your password you can email help@eecs.northwestern.edu

The tester treats your java code as a "black box" that takes environment variables as input and which does not have any direct output but which (as an observable side effect) adds some data to the elastic search index specified in the environment variables.

The tester expects to find a file named news-import-1.0-SNAPSHOT.jar in the current working directory.  This should be copied from the "target" subdirectory of your project.  The script needs to know where to find the Java and Maven commands on your system.  If you are running the tester on your machine you will have to edit the script to change the following constants (defined near the beginning of the file): JAVA_HOME, MVN.

NOTE: The test script requires both Java and Maven to be installed on your system.  To test for Java, try running "jar" on the command line.  To test for Maven, try running "mvn" on the command line.  Mac users can refer to the Java Runtime that comes inside IntelliJ (as is done in the provided script), but you'll likely have to install Maven.

Look for ten lines in the output that say "TEST PASSED" or "TEST FAILED".

Submission

Your code is compiled to a .jar file in the "target" subdirectory of your project.  Please submit this jar file to Canvas.