Assignment 7

Goals

The goal of this assignment is to work with the file system, concurrency, and structural pattern matching in Python.

Instructions

You will be doing your work in a Jupyter notebook for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.13 for your work. (Older versions may work, but your code will be checked with Python 3.13.) To use tiger, use the credentials you received. If you work remotely, make sure to download the .ipynb file to turn in. If you choose to work locally, Anaconda, miniforge, or uv are probably the easiest ways to install and manage Python. You can start JupyterLab either from the Navigator application (anaconda) or via the command-line as jupyter-lab or jupyter lab.

In this assignment, we will be working with synthetic cybersecurity logs. We will be using two types of logs: one that mirrors the output from the psutil library and one that stores network connection information in binary form. These files have been gathered from different data centers as archived directories stored as zip files. You will download these zip files, extract and filter the files from the archives, and parse and filter the log data to find suspicious activity. You will use concurrency to download and process the data.

Due Date

The assignment is due at 11:59pm on Friday, November 21, at 11:59pm.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a7.ipynb.

Details

Please make sure to follow instructions to receive full credit. Because you will be writing classes and adding to them, you do not need to separate each part of the assignment. Please document any shortcomings with your code. You may put the code for each part into one or more cells. Note that CS 503 Students must use asyncio which is optional for CS 490 students.

0. Name & Z-ID (5 pts)

The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.

1. Download & Extract Log Files

There are a bunch of zip files posted on the course web site that you will download using python. Their filenames are us-east-1.zip, us-west-2.zip, eu-central-1.zip, ap-south-1.zip, sa-east-1.zip, ap-southeast-1.zip, signifying the regions of the data centers we have gathered the logs from. CSCI 503 students will use aiohttp for this task while CSCI 490 students may use the requests library. (If you have trouble with this part and wish to continue with other parts of the assignment, download the files manually.)

1a. [CSCI 490] Use requests (20 pts)

To download the files, use the requests library. (This library is installed on tiger, but if you work locally, you may need to install it; conda install requests should work.) Note that each download is a two-step process: first “get”-ing the file and then writing the response to a file. Consult the documentation. After downloading the archives, you should extract all the zip files into the local directory. Remember the DRY principle here. Also, once you download and extract the files into the working directory, rerunning the code to test it may not work as expected because they will already be there. To check for this situation, write code to check (1) if an archive has already been downloaded before downloading it, and (2) if the extracted directory already exists before extracting it. You may also write code to delete those files once to check that your code works, but be careful that you do not delete other files in the process!

Hints

Note that all the archives expand into a directory named data.
When writing the archive, you will probably want to open the file in binary write mode.
Use the exists method with os.path or pathlib.Path to check for path existence
Either zipfile or shutil will be useful in extracting the archive.
shutil’s rmtree can be used to delete a directory.

1b. [CSCI 503] Use aiohttp (30 pts)

To download the files, use the aiohttp library and asyncio to download all the files. (This library is installed on tiger, but if you work locally, you may need to install it; conda install aiohttp should work. You may also need to install nest_asyncio via pip.) While you can refer to the example we showed in class, remember that you need to save each file locally after downloading it. This should be handled as a separate async coroutine, but note that the code in the method will be synchronous because operating systems generally do not support asynchronous file I/O. After downloading the archives, you should extract all the zip files into the local directory. This may be done synchronously! Remember the DRY principle here. Also, once you download and extract the files into the working directory, rerunning the code to test it may not work as expected because they will already be there. To check for this situation, write code to check (1) if an archive has already been downloaded before downloading it, and (2) if the extracted directory already exists before extracting it. You may also write code to delete those files once to check that your code works, but be careful that you do not delete other files in the process!

Hints

You will probably need to run import nest_asyncio; nest_asyncio.apply() in the notebook
Note that all three archives expand into a directory named data.
When writing the archive, you will probably want to open the file in binary write mode.
Use the exists method with os.path or pathlib.Path to check for path existence
Either zipfile or shutil will be useful in extracting the archive.
shutil’s rmtree can be used to delete a directory.

2. Find Matching Files (10 pts)

The zip files are structured differently depending on the location. Regardless, any psutil logs are in a psutil subdirectory and end with the .json extension. Any network logs are stored in a netevent subdirectory and end with the .bin extension. Ignore any files with extensions that are not .json or .bin for their respective directories (e.g. netevent/machine1-2025-04-03.json is not a valid log). Create two lists of pathlib.Path objects: one for psutil paths and one for netevent paths. Note that some files may reside in subdirectories (or sub-subdirectories). Do not move the files into different directories; keep them where they were originally extracted!

Hints

There is more than one way to accomplish this, depending on which libraries you use.
Remember that your code needs to check subdirectories.
You can use recursive search (**) more than once in glob expressions.

3. Parse and Filter the psutil Logs

a. Read the Logs (5 pts)

The logs are in JSON Format. Write a method read_psutil_log that takes a file path as input, reads the JSON data from the file using the json module, and returns a list of dictionaries, each entry representing a process’s information in psutil.as_dict schema.

b. Filter and Label Suspicious Processes (15 pts)

Use structural pattern matching to identify and classify process anomalies. You should locate and label for the following issues:

Any process whose status is “zombie” (label: “zombie process”)
Any process that has high CPU (> 90%) or memory usage (> 2e9) but is not apt, gcc, chrome, or ffmpeg (label: “high resource usage”).
Any process named python, bash, nc, or ncat being run by the root user (label: “suspicious root process”).
Any process run by the root user that is running in /tmp, /home, or /dev (this path will be in the exe field) (label: “suspicious root path”).

Write a method filter_suspicious_processes that takes a list of process information dictionaries and returns a list of suspicious process information as a dictionary. Add the “suspicious” key to each dictionary with its value specifying the issue (see labels above). Each case should identify one of these types (i.e. you cannot nest if statements inside a single case), and you must use the mapping pattern in your cases.

4. Parse and Filter the netevent Logs

a. NetworkEvent class (5 pts)

Write a class NetworkEvent with fields source_ip, dest_ip, src_port, dest_port, protocol, bytes_sent, bytes_received. This can be a dataclass. It should have a constructor to initialize instance variables.

b. Parse method (15 pts)

Write a method read_net_log that takes a file path as input, reads the binary data from the file, unpacks it using the struct module, and returns a list of NetworkEvent instances. The binary data is in network byte order (big-endian) and consists of a magic number followed by a sequence of entries. The magic number (4 bytes, unsigned int) will always be 0xff4e4554 to identify the file as a network event log. Then, each record is 21 bytes, structured as:

source_ip: 4 bytes (unsigned int)
dest_ip: 4 bytes (unsigned int)
src_port: 2 bytes (unsigned short)
dest_port: 2 bytes (unsigned short)
protocol: 1 byte (unsigned char)
bytes_sent: 4 bytes (unsigned int)
bytes_received: 4 bytes (unsigned int)

You should create an enumeration for protocol types (ICMP=1, ST=5, TCP=6, UDP=17, and SCTP=132) and convert the protocol into an object from that enumeration, and parse the ip addresses (source_ip and dest_ip) into the integer lists. The ip address is 4 bytes that can be translated to [x,y,z,w] where each letter is a single byte (translated to an integer) of the address. So an IP stored as 0xc6336429 represents 198.51.100.41 and we want to store it as [198, 51, 100, 41].

Hints:

Check that the magic number if correctly read in before processing the records. Remember to set the endianness.

c. Filter and Label Suspicious Connections (15 pts)

Use structural pattern matching to identify and classify suspicious network connections. You should look for:

Any traffic on ports 4444, 4242, 1337, and 31337 (label: “suspicious port”).
Large uploads: bytes_sent > 1e6 and bytes_sent > 10 * bytes_received (label: “large upload”)
Unknown protocols (not TCP or UDP) (label: “unknown protocol”)
Mismatched protocols for ports: TCP protocol being used on ports 123, 69, and 1900. UDP protocol being used on ports 22, 25, and 80. (label: “mismatched protocol”)
Destination IP addresses in 198.51.100.0/24 and 203.0.113.0/24 (/24 means that last number can be anything so 198.51.100.45 and 203.0.113.123 are both problems.) (label: “suspicious destination”)
Destination IP addresses of 127.0.0.1 when source ip is not 127.0.0.1 (label: “loopback destination”)

Write a method filter_suspicious_connections that takes a list of NetworkEvent instances and returns a list of suspicious connections as json. Again, add the “suspicious” key to each dictionary with its value specifying the issue. If you use a dataclass, you can use the dataclasses.asdict method to generate the dictionary automatically. You must use at least one sequence pattern and at least one or pattern in your cases.

Hints

Remember the IP addresses are are lists
Guards may also be helpful

5. Process All Logs Concurrently (15 pts)

Finally, we are going to process all the logs using threads via concurrent.futures. Use a ThreadPoolExecutor to run the functions from Parts 3 and 4 for each matching file from Part 2. Take the suspicious results from each log, and concatenate them together. You should find over between 13,000 and 14,000 suspicious results.

Hints

Use concurrent.futures to run each thread, and consider the map function to wait for all results to complete

Extra Credit

CS490 students may do Part 1b instead of Part 1a for 10 points of extra credit.
Extract the machine name and date information from the filenames and include it in the reported json data (10 pts).