- Become familiar with a cloud computing platform (AWS).
- Learn some basic DevOps practices.
- Create a demo video for your resume!
In this assignment, you will deploy Homeworks 1, 2, and 3 to AWS, creating a public-facing news article search engine by adapting the code you previously developed.
- In this assignment (and for HW5-6), you may work in pairs. You should choose your partner yourself. If you do not already have teammates in mind, you can try asking on Campuswire to find someone who needs a team. It's also possible to work independently if you prefer to get experience with all steps of the assignment.
- Note: If your solutions to HW1, HW2 or HW3 are incomplete, then it would be a good idea to team up with someone who did the missing assignment. You should be able to "mix and match" your implementations of HW1, 2, and 3. You can also use office hours to get help patching up your HW1, 2, and 3 solutions. They do not need to absolutely perfect because we're testing HW4 manually at more of a high-level.
- Your system will not share any backend infrastructure with other team's systems. In other words, you will not be using the instructor's Elastic Search or REST api backends. You must create a new Elastic Search backend. As in HW1, you must use a personal account for Elastic Search (and other AWS services).
- Your will submit a video demonstration of your finished product. I suggest you store this somewhere permanent and link to it in your resume/portfolio.
Periodic News Article ETL
- You must deploy a news article ETL (like HW1) that runs periodically, every 2 hours or so. It should look for a new common crawl warc file.
- Option 1a (best): If there is a new file, available, it should be imported. For this to work, you probably need to store somewhere the name of the latest warc file that was imported. You can create an new index in ElasticSearch for this purpose, or you can store the information in a local file, depending on your implementation.
- Option 1b (OK): Download and import the latest file regardless of whether it's a repeat. This is less efficient. In this case you need to be sure that articles are not duplicated but just overwritten.
- Note: Periodic batch jobs like this are well suited to run on a serverless compute platform like AWS's Fargate or Lambda, because "serverless" allows us to avoid paying for idle compute resources in between runs. Fargate is a good option if you wish to challenge yourself to learn how to build and deploy a Docker Container. Lambda functions must complete in less than 15 minutes, so it won't work if your ETL tool is on the slow side. However, the instructions below take a simpler approach that requires you to dedicate an always-on virtual machine to the ETL tool. I suggest that most students just follow the instructions below.
- Create a micro EC2 Instance to run your ETL. This will be the Linux Virtual Machine (VM) that runs your HW1 code. I suggest you choose these settings:
- Amazon Linux 2 AMI (HVM), SSD Volume Type. 64-bit (x86)
- t2.micro size
- Note: Later on, if your ETL tool runs out of memory, try upgrading to the t2.small size, or if you're clever add a swap volume instead.
- The streaming implementation actually uses more memory because it stores the full WARC file in memory. If you're saving to a file, then a t2.nano is probably big enough. If you're streaming, then you probably need a t2.small.
- When launching it will ask you to create or choose an SSH key pair. This is used to connect a standard SSH/SCP/SFTP client to the machine. Do not skip this step because the key pair is needed to copy your code to the instance
- Log into the EC2 instance and install Java and Maven:
- Copy your ETL code to the instance using SCP or SFTP and write a bash script that sets the proper environment variables and then launches your code. For example, here is the code for a script that sets some variables and runs a Maven project:
mvn clean package exec:java -Dexec.mainClass="edu.northwestern.ssa.App" 2>1 | tee output.txt
- Notice that the output is both saved to a file output.txt and printed to the screen. The output.txt file will help you to debug your code, since you will not always be connected to the screen while your code runs.
- Create a cron job that periodically runs the script above (every two hours).
- Monitor your elastic search stats and monitor the CPU utilization in the EC2 console to verify that your code is really running.
- As you probably know, you can process the warc file in a streaming manner, using code like this:
InputStream is = s3.getObject(rq, ResponseTransformer.toInputStream());
ArchiveReader ar = WARCReaderFactory.get(filename, is, true);
However, doing this causes your Java app to consume about 1Gbyte of memory for the resettable inputstream. If you run out of memory see the tip above about t2.small or swap space.
- If you implement this correctly, your Elasticsearch database will grow and grow and grow! To make the best use of your limited space, you may discard articles not in English (look at the "lang" attribute of the root HTML element). It's OK if your system stops loading more data after it runs out of space. In fact, you may need to disable new data imports after that happens.
- The "screen" command may be useful if you want to run the ETL tool in the background, and this allows you to log out while the code remains running. However, cron is the tool you should use to permanently schedule the ETL's execution.
- Deploy your API server (HW2) to a Tomcat environment managed by AWS Elastic Beanstalk. To save yourself some money, please deploy it to an environment with just two t2-micro instances (behind a load balancer).
- NOTE: You can use the AWS web console to set "environment properties" for the cloud-based Tomcat environment that will run your HW2 code. These will serve the same purpose as the environment variables in HW2, but you'll have to change your HW2 java code to use System.getProperty(...) instead of System.getenv(). Actually, the best approach is to try one and then fall back to the other if you get a null value.
- Elastic Beanstalk's health check will do a "GET /" request by default, which will give a 404 error in your implementation. To avoid this causing warnings in your environment health, you should enable the following settings in your Elastic Beanstalk configuration: Ignore HTTP 4xx: enabled, and Ignore load balancer 4xx: enabled. An alternative solution is to change the path of the health check in your load balancer settings in EC2.
- Your HW2 code must include an "Access-Control-Allow-Origin: *" header. If you followed my example closely, it's probably already included. In more detail: for security reasons, most browsers block JS requests to a different domain. In practice, we usually avoid this by deploying both the frontend and the backend to the same domain (eg., "mygreatapp.com"). In HW2, you can add the appropriate CORS header in all your responses (including errors) like this:
// below header is for CORS
- Your frontend will also have to be deployed to AWS. There are two ways to do this:
- Option 2 option: Use an S3 bucket to host your JS/React app. This is also pretty simple. (Ignore the instructions related to setting up a domain name/DNS.)
- Alternative: It's a bit messy to mix the frontend and backend, but for simplicity you can include your HTML, JS and CSS files in in your API server (in your HW2 maven project under /src/main/webapp).
- You'll probably notice that your implementation of HW1 did not include the "lang" and "date" fields that are used by later parts of the assignment. You can either go back and add this functionality to HW1 or you can remove the date and language filters from the UI.
- If desired, you can detect the language of a document by looking at the first two letters of the "lang" attribute of the "html" element, if it's provided. For example: <html lang="en">
- Make sure that your search engine returns different results that you get when querying the instructor's backend. That's because the instructor's database of articles will be different than yours (the articles are older). If you're getting the same results as you got in HW3 that means your frontend is still connected to the instructor's backend, not your backend.
Only one of the partners should submit the items below on Canvas. Please do not submit two video links because that might cause the TAs to grade the same thing twice. The other partner should just make a submission listing the name and netid of their partner.
You must record a screencast video demonstrating your app. The video should be less than two minutes long, and you may give a voiceover if explanations are necessary. The video should:
- Demonstrate your user interface, including all the features of HW3.
- Show the public http url of your user interface (you should not be accessing an html file on your computer).
- Show in the AWS EC2 console that your ETL instance has been busy every two hours.
- Show in the Elastic Search console that your database has been loading new documents several times during a day.
- Show that the number of documents indexed by Elastic Search is growing over time.
- Show the cron configuration file or Cloudwatch web console configuration that periodically runs your ETL task.
- If you used Lamba, show its configuration in the AWS web console.
- If you used Fargate, show its configuration in the AWS web console and show your dockerfile source code.
After your video is completed, you may shut down your Elastic Search and EC2 instances (to save your money).
You may record the video using any tool you like, and you may post it to any location that provides a url that the TAs can view. I suggest these tools:
Your Canvas text submission should include:
- Your partner's name and netid.
- A short description of what each partner did for the assignment (both should submit this independently).
- The url of your demo video (only one of you should include this)