In the first half of this quarter, we will be building a news search engine, similar to https://news.google.com. In this first assignment we'll get started by writing a data transfer tool. The first half of the instructions will set up your environment and assignment itself is described later in four steps.

Most students have rated this as the most difficult assignment of the quarter, so please start early, read the instructions carefully, and make use of Campuswire and office hours.

NOTE: that this is NOT a group project. Each student in the class should complete and submit the assignment. Please see the syllabus for the class' collaboration policy.

Learning objectives

Get comfortable developing Java applications using some professional software engineering tools -- the Maven build system and the IntelliJ IDE.
- Manage dependencies with Maven.
- Debug your code with IntelliJ.
Find 3rd-party Java libraries, read their documentation, and integrate them in your project.
Read the documentation for a REST api and write code that makes REST requests.
Explore the AWS management console.
Deliver a packaged program that can easily be run on any other machine, and that is configurable with environment variables.

Search Engine basics

In the early days of the World Wide Web, there were no search engines. All was dark. A search engine's first task is to discover what information is published across the vast expanse of the Web. A web "crawler" or "spider" is a program that starts from a small list of web pages and follows all the links it encounters to try to discover all the pages on the Web. The crawler stores information about the pages it encounters in some kind of database. It's a huge database. At very least, the crawler must store the text of the webpage and the URL. Once this data is stored, a search application can be built that answers user queries by looking for documents in the database that match the search terms (and returning the appropriate URLs). Making this search efficient and relevant is another huge challenge (which we'll not really address in this class).

The Common Crawl

I am not going to ask you write and run your own web crawler. That would be doable, but I found a way for us to skip this step. There is a nonprofit project called Common Crawl that maintains and runs a full-scale Web crawler, and they publish their data openly for anyone to use. That's lucky for us.

Every month Common Crawl publishes a full-web crawl. For example, in the month of December of 2020, this was 2.64 billion web pages, comprising 270 terabytes. That's a lot of HTML text!

EXERCISE 1: If an 12 terabyte hard drive costs about $200, how much would it cost to store a copy of all that data, at minimum?

EXERCISE 2: How much would it cost us to store all that data in AWS S3 for the duration of the quarter? (see https://aws.amazon.com/s3/pricing/)

EXERCISE 3: A decent home cable Internet connection provides about 150 Mbit/s (mega-bits per second). How long would it take to download this data on that connection, at minimum?

Unfortunately, we cannot afford to build a system that can handle all that data just for a class project. Instead, we will use Common Crawl's news dataset. This is smaller and restricted to just news articles, but it's also updated more frequently (giving about 20 gigabytes of new articles every day). That's still a lot of data!

Amazon Web Services (AWS)

You'll be using your own personal AWS account to complete this class' assignments. New AWS users get a small amount of usage free in the first year, so it may not even cost you anything (https://aws.amazon.com/free/). However, you will have to provide a credit card number to create an account. I hope it's not necessary, but you may have to spend up to $100 on AWS resources during the course of the quarter.

NOTE: Please contact the instructor if you cannot set up a personal AWS account.

SIDE NOTE: In past versions of this class, I tried to get a grant of free credits through the "AWS Educate" program, but I learned that this program is very limited and does not allow students to use all the advanced features that you need for the class projects.

You'll have to be careful not to waste resources. Otherwise you could end up with a big bill to pay! Just set up the resources that are described in the assignment. For example, the assignment might ask you to set up an Elasticsearch "cluster" with one t2.small instance. This would cost $0.036 per hour. If you mistakenly create a cluster of ten i3.16xlarge instances, then the cost would be $80 per hour, and there are no refunds for this kind of mistake. (See https://aws.amazon.com/elasticsearch-service/pricing/). You should feel free to shut down resources after each assignment is completed.

SIDE NOTE: Almost everything I say in this class about AWS is also true for its competitors like Microsoft Azure, Google Cloud, and Alibaba Cloud. Sometimes I will use "AWS" as a shorthand for "public computing platforms in general."

Log in to AWS

After you log in to AWS, look around and check out the dizzying number of tools that are available. Notice also that you can switch geographic region using the control in the upper right. I suggest that you primarily use US-East-2 region, which is in Ohio.

EXERCISE 4: Visit https://www.cloudping.info/ and click the "HTTP ping" button to test the latency between your computer and AWS data centers in various regions. The ability to rent resources around the world allows AWS' customers to build services that have good performance worldwide.

In order for your programs to gain access to our account's resources, they will authenticate using an "Access Key ID" and a "Secret Access Key" which is basically equivalent to a username/password. You can get your keys from the AWS Console by clicking your profile name in the upper left, "my security credentials", then "access keys."

Make a note of your Access Key pair. Don't share it with anyone and don't hard-code your Access Keys in any of the code you'll be writing. We'll see later how we can supply these secrets at runtime.

Setting up your Java build environment

Windows, Mac, or Linux can be used to develop your code. You will develop and test code on your own machine, and then later on deploy it to the cloud (usually on a Linux machine).

Download IntelliJ IDEA Community Edition. It's free. You don't need the Ultimate edition (their webpage says that Ultimate is needed for web development, but that's an exaggeration). If you think you can get away with using Atom, Emacs, Vim, VSCode (or any other text editor) instead of using using an IDE, you are mistaken. A good IDE is essential for professional Java development work because it allows you to quickly step through your code with a debugger and it gives you code completion and instant feedback on syntax errors. It also has a plugin for live collaborative coding called Coded With Me. If you want to use a different IDE (like Eclipse), I can't stop you, but we also can't help you with any issues you encounter.

You will also be using Maven to build and run your Java code, but Maven actually comes bundled with IntelliJ, so a separate download and install is not required. Maven can also be downloaded separately and run on the command line; you'll have to do that when testing on moore (as described later), but you should do your initial testing through the IDE on your own machine.

Even if you are totally new to Java, I suggest you try to get started with the code as described in the instructions below. However, if you feel too lost in the code syntax, then you can do this short Java tutorial.

Download the project skeleton

Unfortunately, it takes a bit of work to set up even a simple Java Maven project from scratch 😢. However, since you're probably new to Java, I am providing some sample code that you will build the first project on top of.

Clone this repo: https://github.com/starzia/ssa-skeleton

Have a look at the pom.xml file. This tells Maven how to compile the project and which 3rd party libraries to download (dependencies). The pom.xml file is also used by IntelliJ to load and run the project. The dependencies listed in the sample pom.xml are what I use to implement this homework. This class is all about learning to write code like professionals. That means we'll be using 3rd party libraries whenever it's easiest to do so (instead of building code from scratch just to exercise our hands).

SIDE NOTE: If the functionality you need is very simple, then it can be easiest to build it from scratch even if a 3rd party library provides the same functionality. That's because the effort of learning how to use someone else's code is sometimes greater than the effort to build something simple from scratch.

Now let's open the sample code. Open IntelliJ and choose "open." Select the pom.xml file that you downloaded above, and choose "open as project." From the menu choose "run" -> "run" -> "Add configuration." Click the "+", choose "Maven" and enter the command line:

clean package exec:java -Dexec.mainClass="edu.northwestern.ssa.App"

and click "OK." Now you can run the code by clicking the green "run" or "debug" icons. At the bottom of the IDE you should see some compiler messages ending with something like:

Hello world!
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.621 s
[INFO] Finished at: 2019-09-25T13:10:01-05:00
[INFO] ------------------------------------------------------------------------

The only difference between "run" and "debug" is that debug will stop at breakpoints that you set in the code. Open the App.java file under src/main/java/edu/northwestern/ssa. If you click to the left of line 5, a red circle will be added to the editor. This is a breakpoint, indicating that the code will stop when it gets to this point. When the debugger is active, you can examine the values of variables and you can even choose "run" -> "evaluate expressions" to arbitrary new code and see the result. A powerful debugger like this makes your job much easier and eliminates most need to print debug messages.

NOTE: IntelliJ also has a "build" button (a green hammer). Do not use this. It will not produce a jar file. If you need to compile, then run it, as above and stop the code. Alternatively, create another run configuration that just includes "clean package".

The Assignment

Now that you are set up to write Java code, let's talk about the requirements for this homework. This goal is to download the latest WARC file (web archive) from Common Crawl and post its data to your Elastic Search database. You're specifically writing an ETL program (Extract, Transform, and Load). The program will extract news articles from an online data set, transform them into another format, and load them into your own database.

Breaking it into four steps, you need to:

Download the latest WARC file from a public S3 bucket on AWS.
Open the WARC file and parse it into a series of HTTP responses (containing HTML documents).
Convert each HTML page into plain text and extract the document title.
Post each document's <title, url, txt> to an Elastic Search index on AWS.

Step 1: Download a WARC

The WARC file format is described here. Common crawl stores its data in AWS' Scalable Storage Service (S3). S3 is a huge distributed filesystem, and we'll talk later in class about how such systems are build and why they are scalable. For now, you can just take for granted that S3 has excellent:

Scalability: you can store many petabytes of data in S3 (for a price!)
Fault tolerance: your data will not be accidentally lost if a disk fails (there is redundancy)
Availability: your data will be readable at least 99.9% of the time (in practice always), from anywhere in the world.

S3 storage is organized into buckets, which are just containers for files.

SIDE NOTE: The code samples below will use 3rd-party libraries to do a lot of the work. The sample code skeleton pom.xml file already includes the dependencies to pull in all of the classes used below. There are many alternative 3rd party libraries to choose from to access S3 buckets, but I highly recommend that you stick to using the same libraries because we know they work.

SIDE NOTE: You can browse the contents of the commoncrawl S3 bucket here: https://s3.console.aws.amazon.com/s3/buckets/commoncrawl?region=us-east-1&prefix=crawl-data/CC-NEWS/&showversions=false

To download a file from Common Crawl's public S3 bucket:

Set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables (described later).
create an object of type software.amazon.awssdk.services.s3.S3Client.
- You must do this using the builder pattern because there is no public constructor defined for this class. For example:

S3Client s3 = S3Client.builder()
               .region(Region.US_EAST_1)
               .overrideConfiguration(ClientOverrideConfiguration.builder()
                         .apiCallTimeout(Duration.ofMinutes(30)).build())
               .build();

- Notice, that we told the S3 client to use the us-east-1 region (Virginia) because that's where the common crawl data is stored.
- Notice also that we set the download timeout to 30 minutes (otherwise it will give up after 5 minutes). On a fast connection, downloading one warc file will take about 5 minutes.
create an S3 request object of type software.amazon.awssdk.services.s3.model.GetObjectRequest
- in the builder, specify the "commoncrawl" bucket.
- for now, also specify "crawl-data/CC-NEWS/2017/02/CC-NEWS-20170202093341-00045.warc.gz" as the request key . This specifies the filename we are downloading from the bucket.
Execute the request (and download to a file) using a command like this:

s3.getObject(request, ResponseTransformer.toFile(f));

- Above, "request" is the GetObjectRequest we created above, and "f" is an object of type java.io.File.
When done, close the s3 object (or create it inside a try-with-resources clause)

NOTE: If you are trying to complete this assignment on a very slow Internet connection, then you might find that the S3 download step is slowing down your testing. In this case, I suggest that you download the file once and then temporarily comment-out the section of code that does the download while you develop the remainder of your code. Another option is to do your testing on murphy, as described below.

Step 2: Parse the WARC file

Notice that the sample pom.xml file has the following dependency:

<dependency>
  <groupId>org.netpreserve.commons</groupId>
  <artifactId>webarchive-commons</artifactId>
  <version>1.1.9</version>
</dependency>

That library includes some helpful code for parsing files in the WARC format.

Create an object of type org.archive.io.ArchiveReader using org.archive.io.warc.WARCReaderFactory.
Iterate through the ArchiveRecords provided by the ArchiveReader
Each ArchiveRecord is an HTTP response. Use the read() method to iterate through the data of the ArchiveRecord. The first occurrence of "\r\n\r\n" (a blank new line) marks the transition from the HTTP header to the document body (usually HTML).
The URL of the page can be extracted with by calling .getHeader().getUrl() on the ArchiveRecord.

Step 3: Parse the HTML

Notice that the sample pom.xml file has the following dependency:

<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.12.1</version>
</dependency>

Jsoup is a library for parsing HTML files. It is well documented online. You will have to:

create a org.jsoup.nodes.Document object from the HTML string you extracted above. Note that you should NOT use the url to download the page from the Internet. The HTML data is provided in the WARC file.
- A small percentage of files may not be parsable (for example if they are PDF files not HTML) and you can skip these.
use the .text() method to extract the plain text.
use the .title() method to extract the document title.

Step 4: Post the <url, title, txt> to Elasticsearch

This is the most complicated step, and the most difficult part is related to the authentication of REST requests to AWS (and in particular to Elasticsearch). However, I have included in skeleton code a file AwsSignedRestRequest.java which correctly implements signing of REST requests to any AWS service. Please read through this to understand how it works. Note that it relies on the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables being set (see the "Parameters" section below).

We'll interact with our Elasticsearch cluster using its REST API, which is documented here: https://www.elastic.co/guide/en/elasticsearch/reference/current/rest-apis.html.

SIDE NOTE: REST APIs will be discussed in detail in Lecture 5. At this point, if you don't know what a REST API is, then you should stop and do some reading on this topic. For example, the first three chapters of this O'Reilly book cover the topic well in about an hour's time (and the Northwestern Library provides free online access to O'Reilly books!). In summary, REST is a way to structure client-server, request-response interactions, using the HTTP protocol that was originally designed for web browsers to fetch pages from web servers. For a 30-minute overview of HTTP, see this section of one of my CS-340 lectures.

First, sign into the AWS console and create a new Elasticsearch domain.
- Choose "Development and testing"
- Give it an Elasticsearch domain name that includes your netid.
- IMPORTANT: for Instance type, choose t2.small and one instance.
- Under "Network configuration," choose "Public access" because we want our Northwestern and home IP addresses to be able to reach it.
- For the "Access policy," choose "Custom access policy" and set the three fields to:
  - "IAM ARN" "arn:aws:iam::XXXXXXXXXXXX:root" "Allow"
    where you replace the X's with your 12-digit account id (which you can find under "My account."
  - This setting will cause your Elasticsearch instance to reject requests that are not accompanied with your account's access key.
- Otherwise, choose the default settings.
- On the "Review" page, double check that the instance type is t2.small, and click "Confirm"
- After a few minutes, you can view the new domain's details to get the hostname (something like: xxxxxxx-xxxxxxx.us-east-2.es.amazonaws.com).
- The next steps will load data into this index in the cloud.
Create an ElasticSearch class which extends AwsSignedRestRequest.
Write a method to create a new Elasticsearch index, using the Create Index API endpoint.
Write a method to post a new document to an Elasticsearch index, using the Document API. The documents you store should be dictionaries with three keys: title, txt, and url. It's very important that you use these exact key names ("txt" not "text"), because our testing script will query for data under these keys.
For testing purpose, write some methods that search through the documents that were indexed. You should also use AWS console to verify that you have stores thousands of documents in your Elasticsearch index (click the Elasticsearch domain, then click the "Indices" tab).

After you're done, please delete the ElasticSearch domain through the AWS console (so that you don't run up a big bill). You can always create another ElasticSearch domain if you need to run your code again.

Parameters

Your Java code should take several runtime parameters, allowing the user (ie, you or the TA testing your code) to change its behavior. In the past, you may have used command-line arguments for program parameters, however in this class we will use environment variables. This is a common way to configure cloud-deployed services. Environment variables are nice because they are named and can be defined in any order. In particular, your code should get the following constants from environment variables:

AWS_ACCESS_KEY_ID
- See AWS section above.
AWS_SECRET_ACCESS_KEY
- See AWS section above
ELASTIC_SEARCH_HOST
- This identifies the Elastic Search cluster into which the data will be imported. This must be an elastic search instance that you created in the account matching the keys above. For example, in my testing, I used: "search-news-crawl-665pfxjl7sqim2qf4yfd76rfly.us-east-2.es.amazonaws.com" (but this will not work for you).
ELASTIC_SEARCH_INDEX
- Each Elastic Search Cluster can actually have multiple "indexes," which are kind of like different tables in a database. You only need one index, and you can give it any name. However, you will have to use the name index name when it comes time to search. For example, in my testing, I used the index name: "my-index"
COMMON_CRAWL_FILENAME
- This specifies the particular common crawl file to download. Actually, this variable should be optional, and when it's omitted (or set to the empty string) your code should scan through the S3 bucket to find the latest warc file. Having the ability to specify a particular file will us to test everyone's code on the same input, regardless of when the code is run. It also allows you to postpone the development of the code that finds the latest warc file. For example, here are three different files which I have used in my testing:
  - "crawl-data/CC-NEWS/2017/02/CC-NEWS-20170202093341-00045.warc.gz"
    - NOTE: the file above is only 26MB instead of the usual 1GB, so it's convenient for fast testing.
  - "crawl-data/CC-NEWS/2020/04/CC-NEWS-20200405100433-01526.warc.gz"
  - "crawl-data/CC-NEWS/2021/01/CC-NEWS-20210110163639-01552.warc.gz"

So, your code should not hard-code the values above, but rather get those data from environment variables using, for example, System.getenv("ELASTIC_SEARCH_HOST").

SIDE NOTE: the advantage of using environment variables for these values is that you will be able to compile your code, and give it to someone else and that other person will be able to run it to operate on a different AWS account and on a different filename, without any modification to your code. The "user" of your code might also be a piece of cloud infrastructure, like Elastic Beanstalk, which has a nice user interface for setting environment variables.

In IntelliJ, you must edit your "run configuration" to provide these environment variables (under "Runner" -> uncheck "Use project settings" -> "Environment Variables").

Advanced Features

The following are not worth any points, but if you've already done the above and are looking for a challenge, then you should implement the following performance improvements:

Streaming (easy): Don't bother to save the WARC file to disk. Instead, tell the s3 client to download the warc file to an InputStream, and create an ArchiveReader from the InputStream. This will allow you to process the documents and upload them to ElasticSearch in a streaming manner. As the WARC file downloads, its contents are passed (streamed) to the ArchiveReader directly; this allows your code to start parsing and indexing documents immediately rather than waiting first for the whole WARC file to download.
Bulk Indexing (moderate): Each request you make to ElasticSearch has some connection setup latency/delay (due to networking protocol details such as TCP and SSL handshakes, which you'd learn about in CS-340). Making a new connection for each document is actually really inefficient. It's better to bundle together multiple documents to be sent in a bigger request, both to reduce the connection setup delays and to give ElasticSearch lots of indexing work that it can execute in parallel. The api for this bulk indexing is here.

Testing your code

I am providing a Python script to test your code. It is essential that you run this test script on your jar file before submitting it to Canvas. The provided script is very similar to the auto-grading script that we will be running. You should run it on moore.wot.eecs.northwestern.edu. The README.md file gives the exact commands to run the tester on moore.

Connecting to the EECS servers: You should be able to ssh into the department servers using your netid and your EECS account password (different than your netid password). If you need to reset your password you can email help@eecs.northwestern.edu

The tester treats your java code as a "black box" that takes environment variables as input and which does not have any direct output but which (as an observable side effect) adds some data to the elastic search index specified in the environment variables.

The tester expects to find a file named news-import-1.0-SNAPSHOT.jar in the current working directory. This should be copied from the "target" subdirectory of your project. The script needs to know where to find the Java and Maven commands on your system. If you are running the tester on your machine you will have to edit the script to change the following constants (defined near the beginning of the file): JAVA_HOME, MVN.

NOTE: The test script requires both Java and Maven to be installed on your system. To test for Java, try running "jar" on the command line. To test for Maven, try running "mvn" on the command line. Mac users can refer to the Java Runtime that comes inside IntelliJ (as is done in the provided script), but you'll likely have to install Maven.

Look for ten lines in the output that say "TEST PASSED" or "TEST FAILED".

Submission

Your code is compiled to a .jar file in the "target" subdirectory of your project. Please submit this jar file to Canvas.