The goal of this assignment is to work with the file system, concurrency, and structural pattern matching in Python.
You will be doing your work in a Jupyter notebook for this
assignment. You may choose to work on this assignment on a hosted
environment (e.g. tiger)
or on your own local installation of Jupyter and Python. You should use
Python 3.13 for your work. (Older versions may work, but your code will
be checked with Python 3.13.) To use tiger, use the credentials you
received. If you work remotely, make sure to download the .ipynb file to
turn in. If you choose to work locally, Anaconda, miniforge, or uv are probably the easiest ways
to install and manage Python. You can start JupyterLab either from the
Navigator application (anaconda) or via the command-line as
jupyter-lab or jupyter lab.
In this assignment, we will be working with synthetic cybersecurity logs. We will be using two types of logs: one that mirrors the output from the psutil library and one that stores network connection information in binary form. These files have been gathered from different data centers as archived directories stored as zip files. You will download these zip files, extract and filter the files from the archives, and parse and filter the log data to find suspicious activity. You will use concurrency to download and process the data.
The assignment is due at 11:59pm on Friday, November 21, at 11:59pm.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a7.ipynb.
Please make sure to follow instructions to receive full credit. Because you will be writing classes and adding to them, you do not need to separate each part of the assignment. Please document any shortcomings with your code. You may put the code for each part into one or more cells. Note that CS 503 Students must use asyncio which is optional for CS 490 students.
The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.
There are a bunch of zip files posted on the course web
site that you will download using python. Their filenames are
us-east-1.zip, us-west-2.zip,
eu-central-1.zip, ap-south-1.zip,
sa-east-1.zip, ap-southeast-1.zip, signifying
the regions of the data centers we have gathered the logs from. CSCI 503
students will use aiohttp for this task
while CSCI 490 students may use the requests library. (If you have
trouble with this part and wish to continue with other parts of the
assignment, download the files manually.)
To download the files, use the requests library. (This library
is installed on tiger, but if you work locally, you may need to install
it; conda install requests should work.) Note that each
download is a two-step process: first “get”-ing the file and then
writing the response to a file. Consult the documentation.
After downloading the archives, you should extract all the zip files
into the local directory. Remember the DRY principle here. Also, once
you download and extract the files into the working directory, rerunning
the code to test it may not work as expected because they will already
be there. To check for this situation, write code to check (1) if an
archive has already been downloaded before downloading it, and (2) if
the extracted directory already exists before extracting it. You may
also write code to delete those files once to check that your code
works, but be careful that you do not delete other
files in the process!
data.exists method with os.path or pathlib.Path to
check for path existenceTo download the files, use the aiohttp library and
asyncio to download all the files. (This library is installed on tiger,
but if you work locally, you may need to install it;
conda install aiohttp should work. You may also need to
install nest_asyncio via pip.) While you can refer to the
example
we showed in class, remember that you need to save each
file locally after downloading it. This should be handled as a separate
async coroutine, but note that the code in the method will be
synchronous because operating systems generally do not support
asynchronous file I/O. After downloading the archives, you should
extract all the zip files into the local directory. This may be done
synchronously! Remember the DRY principle here. Also,
once you download and extract the files into the working directory,
rerunning the code to test it may not work as expected because they will
already be there. To check for this situation, write code to check (1)
if an archive has already been downloaded before downloading it, and (2)
if the extracted directory already exists before extracting it. You may
also write code to delete those files once to check that your code
works, but be careful that you do not delete other
files in the process!
import nest_asyncio; nest_asyncio.apply() in the
notebookdata.exists method with os.path or pathlib.Path to
check for path existenceThe zip files are structured differently depending on the location.
Regardless, any psutil logs are in a psutil
subdirectory and end with the .json extension. Any network
logs are stored in a netevent subdirectory and end with the
.bin extension. Ignore any files with
extensions that are not .json or .bin for
their respective directories
(e.g. netevent/machine1-2025-04-03.json is not a valid
log). Create two lists of pathlib.Path objects: one for
psutil paths and one for netevent paths. Note that some files may reside
in subdirectories (or sub-subdirectories). Do not move
the files into different directories; keep them where they were
originally extracted!
**) more than once in
glob expressions.The logs are in JSON Format. Write a method
read_psutil_log that takes a file path as input, reads the
JSON data from the file using the json module, and returns
a list of dictionaries, each entry representing a process’s information
in psutil.as_dict schema.
Use structural pattern matching to identify and classify process anomalies. You should locate and label for the following issues:
apt, gcc,
chrome, or ffmpeg (label: “high resource
usage”).python, bash,
nc, or ncat being run by the root user (label:
“suspicious root process”)./tmp, /home, or /dev (this path
will be in the exe field) (label: “suspicious root
path”).Write a method filter_suspicious_processes that takes a
list of process information dictionaries and returns a list of
suspicious process information as a dictionary. Add the “suspicious” key
to each dictionary with its value specifying the issue (see labels
above). Each case should identify one of these types (i.e. you cannot
nest if statements inside a single case), and you must use the mapping
pattern in your cases.
Write a class NetworkEvent with fields
source_ip, dest_ip, src_port,
dest_port, protocol, bytes_sent,
bytes_received. This can be a dataclass. It should have a
constructor to initialize instance variables.
Write a method read_net_log that takes a file path as
input, reads the binary data from the file, unpacks it using the struct
module, and returns a list of NetworkEvent instances. The
binary data is in network byte order (big-endian) and
consists of a magic number followed by a sequence of entries. The magic
number (4 bytes, unsigned int) will always be 0xff4e4554 to
identify the file as a network event log. Then, each record is 21 bytes,
structured as:
You should create an enumeration for
protocol types (ICMP=1, ST=5, TCP=6, UDP=17, and SCTP=132) and convert
the protocol into an object from that enumeration, and parse the ip
addresses (source_ip and dest_ip) into the integer lists. The ip address
is 4 bytes that can be translated to [x,y,z,w] where each
letter is a single byte (translated to an
integer) of the address. So an IP stored as 0xc6336429
represents 198.51.100.41 and we want to store it as
[198, 51, 100, 41].
Use structural pattern matching to identify and classify suspicious network connections. You should look for:
bytes_sent > 1e6 and
bytes_sent > 10 * bytes_received (label: “large
upload”)/24 means that last number can be anything so
198.51.100.45 and 203.0.113.123 are both problems.) (label: “suspicious
destination”)Write a method filter_suspicious_connections that takes
a list of NetworkEvent instances and returns a list of
suspicious connections as json. Again, add the “suspicious” key to each
dictionary with its value specifying the issue. If you use a dataclass,
you can use the dataclasses.asdict method to generate the
dictionary automatically. You must use at least one
sequence pattern and at least one or pattern in your cases.
Finally, we are going to process all the logs using threads via concurrent.futures. Use a ThreadPoolExecutor to run the functions from Parts 3 and 4 for each matching file from Part 2. Take the suspicious results from each log, and concatenate them together. You should find over between 13,000 and 14,000 suspicious results.