Overview

In this project you will write a tool for network exploration and security auditing.  Your tool will take as input a list of domains, it will probe those domains/websites in several different ways, and it will print a report detailing some network characteristics and security features/capabilities each domain.

You may work in pairs.

Note: this is a new project, so please let us know if you find any issues with the instructions.  If we discover any serious issues that require a significant change in the instructions we will send an announcement via Canvas.

Learning goals

After completing this project, students should:

Part 1: Scanner Framework

This part is worth 10% of the total points.

Write a Python 3 program that takes a list of web domains as an input and outputs a JSON dictionary with information about each domain.

Your program will be invoked as follows:

$ python3 scan.py [input_file.txt] [output_file.json]

where the parameter is a filename for the input file, which should be in the current directory and should contain a list of domains to test.  You can test with the following files, and also write your own input files: test_websites.txt, popular_websites.txt, random_websites.txt.

The output of your program will be a JSON dictionary that is output to a file.  It's keys are the domains that were scanned and the values are dictionaries with scan results.  For example:

{
"northwestern.edu": {
"scan_time": 1605038710.32,
"ipv4_addresses": ["129.105.136.48"],
"ipv6_addresses": [],
"http_server": "Apache",
...
}
"google.com": {
"scan_time": 1605038714.20,
"ipv4_addresses": ["172.217.6.110", "216.58.192.206", "172.217.1.46"],
"ipv6_addresses": ["2607:f8b0:4009:800::200e"],
"http_server": "gws",
...
}
}

The example above lists four scan results (scan_time, ipv4_addresses, ipv6_addresses, and http_server).  In Part 2 you will implement the ipv4_addresses, ipv6_addresses, and http_server scanners (and many others), but you should start by just printing scan_time, which is the time you started scanning each domain, expressed in Unix epoch seconds

Please make your output human-readable by including indentation . This can be done with a command like:

with open(file_name, "w") as f:
    json.dump(json_object, f, sort_keys=True, indent=4)

There is no provided code for this assignment.  Your code may be one file or it may be in multiple files and it may include subdirectories, if needed.  You may use any of the standard libraries and you may even "pip install" 3rd-party libraries.  However, if you do use 3rd party libraries, please include a requirements.txt file that lists the packages that are required.  This can be generated by running:

$ pip freeze > requirements.txt

You will submit a .tgz archive of your code, as described later.

Please use a Python virtual environment to ensure that you are starting with no 3rd party packages installed.  After you create and activate the new virtual environment, you can install additional packages.

Part 2: Network Scanners

This part is worth 72% of the total points (6% per scanner)

The subsections below each describe a network scan you must implement.  For many of these scans there are command-line tools that can do most of the work.  Your Python code can run a command-line process using the subprocess module.  For example, try running the following in a Python3 shell:

import subprocess
result = subprocess.check_output(["nslookup", "northwestern.edu", "8.8.8.8"],
timeout=2,
stderr=subprocess.STDOUT).decode("utf-8")
print(result)

The result string contains the stdout from the command-line invocation of "nslookup northwestern.edu 8.8.8.8".  For example, you can parse this result in your Python code to extract the IPv4 address(es) of northwestern.edu.

In general, you should experiment on the command line with a tool first to find the right syntax before you try to attempt to drive it through Python.

You should do your development and testing on moore because the command-line tools on your machine might be slightly different versions, and thus print results in a different format.  For example, Windows has an nslookup command, but its output is formatted differently than the Linux nslookup command on moore.  We'll be testing on moore.  If you want to make your script robust and platform-independent then you would have to get the work done entirely in Python, without relying on command-line tools (but we're recommending the quick-and-dirty, platform-specific approach).

Your code should not crash if a required command-line tool is missing.  For example, Part 2k requires the telnet command, which is installed on moore but probably not installed on your home machine.  If a required command-line tool is missing you must print an error message to stderr and skip the particular scan.  If a scan is skipped, do not include the corresponding key in the output dictionary.

Your code should not hang for a long time if a host is not reachable.  None of the commands you'll run should take more than a few seconds to complete.  Notice that in the example code above I included "timeout=2" to stop the process after two seconds.  You'll have to catch a TimeoutExpired exception.

a) scan_time

The time when you started scanning the domain, expressed in Unix epoch seconds (seconds since 1970).  The value should be a JSON number (integer or floating point).

Output example:

    "scan_time" : 1605038710.32

b) ipv4_addresses

A list of IPv4 addresses listed as DNS "A" records for the domain.  I suggest you use the nslookup commandline tool for this, as demonstrated above.

Output example:

     "ipv4_addresses": ["172.217.6.110", "216.58.192.206", "172.217.1.46"],

In more detail:

c) ipv6_addresses

A list of IPv6 addresses listed as DNS "AAAA" records for the domain.  Again, I suggest you use the nslookup commandline tool for this.  You'll have to add "-type=AAAA" to the command.  You may return an empty list if IPv6 is not supported.  All of the details above also apply to IPv6, so I suggest you write a function that runs the same basic scan with different DNS record types (A vs AAAA).

Output example:

     "ipv6_addresses": ["2607:f8b0:4009:800::200e"]

d) http_server

The web server software reported in the Server header of the HTTP response.  For this scan (and all the scans below) you may assume that all the IP addresses running the website are configured identically (you may scan just one).

There are many ways to implement this scan, including the "curl" command, python's http.client, python's requests library, or an openssl connection (as described below).

If no server header is provided in the response, you should set the value to null.  Note that this is a special JSON null value, not the string "null" (there are no quotes).

Output example:

    "http_server" : "Apache/2.2.34 (Amazon)"
or
"http_server" : null

Side note: In Project 1 we used the "telnet" command to create a TCP socket over which we could type out (and debug) the HTTP protocol.  If you want to do this with a server that requires an encrypted connection (HTTPS/TLS), you must create an encrypted connection using a command like the following (which connects to "google.com"):

$ openssl s_client -crlf -connect google.com:443

then to make a simple HTTP request you can type:

GET / HTTP/1.0
Host: google.com

<press return twice at the end to create a blank line>

You should see the HTTP response headers and some HTML printed to the screen.  You may also see some TLS information printed when the encryption parameters are renegotiated.

e) insecure_http

Return a JSON boolean indicating whether the website listens for unencrypted HTTP requests on port 80. 

Output example:

     "insecure_http": true

Notice above that JSON booleans do not have quotes (they are not strings).

f) redirect_to_https

Return a JSON boolean indicating whether unencrypted HTTP requests on port 80 are redirected to HTTPS requests on port 443.  Note that there several ways for HTTP responses to indicate redirection, including:

Note also that you may have to go through a chain of several redirects before being finally redirected to HTTPS.  You can give up if the website is broken and redirects you more than 10 times.  Return true if you eventually reach an HTTPS page.

If the webserver does not listen to HTTP requests at all, then you should return false.

Output example:

    "redirect_to_https": false

g) hsts

Return a JSON boolean indicating whether the website has enabled HTTP Strict Transport Security.  This tells the browser to remember to refuse connecting to the domain except with encryption (TLS) enabled.  You should check for the appropriate HTTP response header on the final page that you are redirected to.

Output example:

    "hsts": false

h) tls_versions

List all versions of Transport Layer Security (TLS/SSL) supported by the server, as a list of strings (in no particular order).  The options are:

Output example:

"tls_versions": ["TLSv1.2", "TLSv1.1", "TLSv1.0"]

Note: The powerful network scanning tool nmap can provide lots of TLS information with a command like:

$ nmap --script ssl-enum-ciphers -p 443 northwestern.edu

However, nmap does not support the latest version of TLS (1.3), so you should use openssl to test for specific versions of TLS.  For example, here is a server that supports TLSv1.3:

$ echo | openssl s_client -tls1_3 -connect tls13.cloudflare.com:443

and here is one that does not ☹️:

$ echo | openssl s_client -tls1_3 -connect stevetarzia.com:443

Note that the version of openssl installed on most Macs does not support TLSv1.3.  To test this part you should run on moore.  In Python, you can mimic the echo | that feeds into the openssl command by including input=b'' in your check_output command. 

i) root_ca

List the root certificate authority (CA) at the base of the chain of trust for validating this server's public key.  Just list the "organization name" (you'll find this under "O").  Openssl can give you this with a command like:

$ echo | openssl s_client -connect stevetarzia.com:443

If the domain does not support TLS then give null as the value.

Output example:

"root_ca": "Digital Signature Trust Co."

j) rdns_names

List the reverse dns names for the IPv4 addresses listed in Part B.  You can get these by querying DNS for PTR records.  Note that you may get multiple names (or no names) for each IP address; list whatever you get, even if it's an empty list.

Output example:

"rdns_names": ["ord37s03-in-f110.1e100.net", "ord37s03-in-f14.1e100.net",
"ord30s25-in-f14.1e100.net",  "ord30s25-in-f206.1e100.net",
"
ord37s07-in-f14.1e100.net", "ord37s07-in-f46.1e100.net"]

Side note: you may be wondering what is 1e100.net?

k) rtt_range

Print the shortest and longest round trip time (RTT) you observe when contacting all the IPv4 addresses listed in Part B.  You should list this as a list of two numbers giving the RTT in milliseconds: [min, max].

I suggest using the following command to measure round trip time.  This creates a TCP connection and immediately tears it down.  You'll want to record the "real" time it returns:

$ sh -c "time echo -e '\x1dclose\x0d' | telnet 172.217.6.110 443"
Trying 172.217.6.110...
Connected to 172.217.6.110.
Escape character is '^]'.

telnet> close
Connection closed.

real	0m0.003s
user	0m0.001s
sys	0m0.001s

Above I am using the time command in the bourne shell ("sh"), and the output tells us that the RTT was 3 ms.  In your python script you can either run the messy command above or you can implement something like it directly in python using time.time() and socket.socket().  If you do implement it in python, you should compare your results to the command above to ensure it's consistent.

If the domain is not reachable on any common ports (80, 22, 443), then return a null value.

Output example:

"rtt_range": [4, 20]

L) geo_locations

List the set of real-world locations (city, province, country) for all the IPv4 addresses listed in Part B.  Do not repeat locations.  You should use the MaxMind IP Geolocation database via the maxminddb Python library.  I have uploaded a recent database file to Canvas: GeoLite2-City_20201103.tar.gz

You should get a file "GeoLite2-City.mmdb" from the archive above and place it in your working directory.  The maxminddb library requires that database file.  When we test your code we will copy the file "GeoLite2-City.mmdb" into the current directory.

You can test your results with this website: https://www.maxmind.com/en/geoip2-precision-demo 

Output example:

"geo_locations": ["Evanston, Illinois, United States", "New York, New York, United States"]

Part 3: Report

This part is worth 18% of the total points.

Write a python script "report.py" that prints an ASCII text report summarizing the results from Part 2.  It will take as a parameter a filename for a json file in the format you generated in Part 2.  It will print the report to a text file.

Your program will be invoked as follows:

$ python3 report.py [input_file.json] [output_file.txt]

To make your report readable and attractive, I suggest you use the texttable library.  The report should contain:

  1. A textual or tabular listing of all the information returned in Part 2, with a section for each domain.
  2. A table showing the RTT ranges for all domains, sorted by the minimum RTT (ordered from fastest to slowest).
  3. A table showing the number of occurrences for each observed root certificate authority (from Part 2i), sorted from most popular to least.
  4. A table showing the number of occurrences of each web server (from Part 2d), ordered from most popular to least.
  5. A table showing the percentage of scanned domains supporting:

Submission

Testing

Code will be graded on moore.wot.eecs.northwestern.edu.  Please test your code there before submitting.

After you make your tarball, do the following to make sure it includes everything that's needed for the TA to run it: