The goal of this assignment is to work with files, iterators, strings, and string formatting in Python.
You will be doing your work in a Jupyter notebook for this
assignment. You may choose to work on this assignment on a hosted
environment (e.g. tiger)
or on your own local installation of Jupyter and Python. You should use
Python 3.14 for your work. To use tiger, use the credentials you
received. If you work remotely, make sure to download the .ipynb file to
turn in. If you choose to work locally, miniforge, Anaconda, pixi or uv are probably the easiest ways
to install and manage Python. If you work locally, you may launch
Jupyter Lab either from the Navigator application (anaconda) or via the
command-line as jupyter-lab or
jupyter lab.
In this assignment, we will be working with data from Brazil’s Agência Nacional de Vigilância Sanitária about controlled medications. Because the original Jan. 2026 dataset is large, I have downloaded the data and created a smaller sampled version here, maintaining the original iso-8859-1 encoding. There are a number of fields in the data. I have used translation software to understand the fields using the “Documentação e Dicionário de Dados SNGPC - Industrializados” file and have listed those we will be using here:
SG_UF_VENDA: Abbreviation of state where medication was
dispensedNO_MUNICIPIO_VENDA: Municipality (city) where the
medication was dispensedDS_PRINCIPIO_ATIVO: Active ingredient(s); if there is
more than one, the ingredients are separated by the +
characterSG_SEXO: The sex of the patient receiving the
medication (1=male, 2=female)NU_IDADE: The age of the patient receiving the
medication (can be in months or years, see next field)NU_UNIDADE_IDADE: The unit of the age (1=years,
2=months)The data is located here. You may use the following code to download this file:
from pathlib import Path
from urllib.request import urlretrieve
# download the data if we don't have it locally
url = "https://faculty.cs.niu.edu/~dakoop/cs503-2026sp/a4/brazil-medications.csv"
local_fname = "brazil-medications.csv"
if not Path(local_fname).exists():
urlretrieve(url, local_fname)You will read the data from that file, update some of the text, parse the updated data into a dictionary, calculate the number of times an ingredient was dispensed for each sex, and write a new output file in a similar format.
The assignment is due at 11:59pm on Thursday, March 5.
You should submit the completed notebook file required for this
assignment on Blackboard. The
filename of the notebook should be a4.ipynb. You
should not turn in the
food-prices-monthly.txt file as your notebook should
contain the code to create it.
Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.
The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.
Either download the file and upload it to Jupyter, or use the provided code above to download the file. You should open the file and read all of the data into an list of strings representing each line. Do not remove newline characters. Use the correct method to do this. Remember to set the encoding correctly.
encoding keyword argument to the open
function to help decode the file.Write code to loop through the list of lines to find the set of all unique non-ASCII characters in the text and the subset of those that are non-alphabetic. Note that you can check whether a character is in the ASCII subset by checking whether its integer value is less than 128 (recall ASCII mapped the first 128 characters to the values 0 through 127). Then, compute the subset of those that are non-alphabetic characters. Output both the set of non-ASCII characters and the subset of non-alphabetic, non-ASCII characters. You should find 12 non-ASCII characters, only one of which is non-alphabetic.
ord function will be useful to obtain the integer
value of a character.string.is... methods may be useful to find
non-alphabetic characters.If you print the single non-alphabetic, non-ASCII character, you will notice that it looks like a space. It is a non-breaking space. Create a new list of lines where all of the occurences of this character are replaced with a standard space character.
You should have noticed that this file is in a variant of the
comma-separated values (CSV) format, but a semi-colon (;)
separates each field. We can still use the csv.DictReader
to parse this file but will have to specify a different delimiter. In
addition, because we have already read the file and converted the
non-breaking spaces, we need to have our list of strings look like a
file handle. io.StringIO
can help us with that. Assuming the output from Part 2 is in the
variable updated_lines, we can do
import io
new_f = io.StringIO(''.join(updated_lines))and use new_f as we would a file handle. Now, use the
DictReader to read in a list of dictionaries representing
the data.
In the data represented as a list of dictionaries, we will update the
municipality (NO_MUNICIPIO_VENDA) and active ingredients
(DS_PRINCIPIO_ATIVO) fields. First, there may be more than
one active ingredient per entry; the documentation tells us each
ingredient is separated by a + character. Update the
DS_PRINCIPIO_ATIVO field to a list of
strings and remove any leading and trailing whitespace from each
ingredient.
All of the data is in uppercase. We wish to convert the municipality
and ingredient names to mixed case where the first letter of each word
is capitalized. However, we want a few words
["do", "de", "da", "dos", "das", "e"] to be
all lowercase. Make sure these are
words and not just parts of a word! You may do this
using string methods or in a single pass using a regular expression.
Since we converted all non-breaking spaces, you may assume that the
character on either side of these words is a normal space character. If
you use a regular expression, the \b class is helpful and
more robust.
After both of these steps, the entry
{'NU_ANO_VENDA': '2026',
'NU_MES_VENDA': '01',
'SG_UF_VENDA': 'RN',
'NO_MUNICIPIO_VENDA': 'SÃO JOSÉ DO SERIDÓ',
'DS_PRINCIPIO_ATIVO': 'HIDROCORTISONA + SULFATO DE NEOMICINA + SULFATO DE POLIMIXINA B',
...
}should become
{'NU_ANO_VENDA': '2026',
'NU_MES_VENDA': '01',
'SG_UF_VENDA': 'RN',
'NO_MUNICIPIO_VENDA': 'São José do Seridó',
'DS_PRINCIPIO_ATIVO': ['Hidrocortisona',
'Sulfato de Neomicina',
'Sulfato de Polimixina B'],
...
}DS_PRINCIPIO_ATIVO will be a list of
strings and the mixed case conversion has to happen for each itemTo get a summary of the medications dispensed, we want to create a
data structure(s) that store the total number of times an active
ingredient (DS_PRINCIPIO_ATIVO) has been dispensed, as well
as the known totals for male and female patients. For CSCI 503 students,
these aggregations should also be broken down by state
(SG_UF_VENDA). Information about a patient’s sex is stored
in the SG_SEXO attribute where '1' indicates
male and '2' indicates female. Note that some records do
not have a value for sex. Include these in the totals. You are free to
store this data as you wish, but you will use that structure to write
the output in Part 6.
Write a new file named medication-summary.txt that
outputs the name of each medication, the total number of times the
ingredient was dispensed, the number of times it was dispensed to a male
patient, the number of times it was dispensed to a female patient, and
the signed difference between those counts. For CSCI
503 students, for each medication, there should then be lines for each
state with the same counts.
The file should be written in a fixed-width format where in each row, the medication is listed first and left-aligned, then the state (for CSCI 503), and then the numbers for total, male, female, and the difference (male-female). All the numbers should be right-aligned. The difference should show a plus or minus sign (ok for zero to have a sign). There should be at least one space between each column. You should calculate the number of characters used for each field based on the longest name or the number of digits in the maximum value of the numeric fields.
The output file should look similar to the following example (CSCI 490 students will not have the lines for each state):
Ingredient St Tot M F Diff
Amoxicilina Tri-Hidratada 4106 1876 2087 -211
SP 1089 480 564 -84
MG 625 273 322 -49
RS 349 167 174 -7
RJ 268 127 129 -2
PR 245 126 109 +17
SC 223 105 114 -9
BA 179 87 88 -1
GO 166 71 92 -21
...
Ácido Retinóico 1 0 0 +0
BA 1 0 0 +0
Hidrogenotartarato de Rivastigmina 1 0 0 +0
SP 1 0 0 +0
print function calls to write a few data items to
stdout first, then when you are satisfied the format is
correct, write all of the data items to a file.print
function.with statement to make sure all data is written
to the file (or make sure to call close).open to be able to
write to a file.!head command:
!head medication-summary.txt