Assignment 4

Goals

The goal of this assignment is to work with files, iterators, strings, and string formatting in Python.

Instructions

You will be doing your work in a Jupyter notebook for this assignment. You may choose to work on this assignment on a hosted environment (e.g. tiger) or on your own local installation of Jupyter and Python. You should use Python 3.14 for your work. To use tiger, use the credentials you received. If you work remotely, make sure to download the .ipynb file to turn in. If you choose to work locally, miniforge, Anaconda, pixi or uv are probably the easiest ways to install and manage Python. If you work locally, you may launch Jupyter Lab either from the Navigator application (anaconda) or via the command-line as jupyter-lab or jupyter lab.

In this assignment, we will be working with data from Brazil’s Agência Nacional de Vigilância Sanitária about controlled medications. Because the original Jan. 2026 dataset is large, I have downloaded the data and created a smaller sampled version here, maintaining the original iso-8859-1 encoding. There are a number of fields in the data. I have used translation software to understand the fields using the “Documentação e Dicionário de Dados SNGPC - Industrializados” file and have listed those we will be using here:

  • SG_UF_VENDA: Abbreviation of state where medication was dispensed
  • NO_MUNICIPIO_VENDA: Municipality (city) where the medication was dispensed
  • DS_PRINCIPIO_ATIVO: Active ingredient(s); if there is more than one, the ingredients are separated by the + character
  • SG_SEXO: The sex of the patient receiving the medication (1=male, 2=female)
  • NU_IDADE: The age of the patient receiving the medication (can be in months or years, see next field)
  • NU_UNIDADE_IDADE: The unit of the age (1=years, 2=months)

The data is located here. You may use the following code to download this file:

from pathlib import Path
from urllib.request import urlretrieve

# download the data if we don't have it locally
url = "https://faculty.cs.niu.edu/~dakoop/cs503-2026sp/a4/brazil-medications.csv"
local_fname = "brazil-medications.csv"
if not Path(local_fname).exists():
    urlretrieve(url, local_fname)

You will read the data from that file, update some of the text, parse the updated data into a dictionary, calculate the number of times an ingredient was dispensed for each sex, and write a new output file in a similar format.

Due Date

The assignment is due at 11:59pm on Thursday, March 5.

Submission

You should submit the completed notebook file required for this assignment on Blackboard. The filename of the notebook should be a4.ipynb. You should not turn in the food-prices-monthly.txt file as your notebook should contain the code to create it.

Details

Please make sure to follow instructions to receive full credit. Use a markdown cell to Label each part of the assignment with the number of the section you are completing. You may put the code for each part into one or more cells.

0. Name & Z-ID (5 pts)

The first cell of your notebook should be a markdown cell with a line for your name and a line for your Z-ID. If you wish to add other information (the assignment name, a description of the assignment), you may do so after these two lines.

1. Read Data (10 pts)

Either download the file and upload it to Jupyter, or use the provided code above to download the file. You should open the file and read all of the data into an list of strings representing each line. Do not remove newline characters. Use the correct method to do this. Remember to set the encoding correctly.

Hints
  • You can use the encoding keyword argument to the open function to help decode the file.
  • Remember there are a few read methods that are available in python; there is one that does exactly what we need.

2. Non-ASCII Characters

a. Find All non-ASCII Characters (10 pts)

Write code to loop through the list of lines to find the set of all unique non-ASCII characters in the text and the subset of those that are non-alphabetic. Note that you can check whether a character is in the ASCII subset by checking whether its integer value is less than 128 (recall ASCII mapped the first 128 characters to the values 0 through 127). Then, compute the subset of those that are non-alphabetic characters. Output both the set of non-ASCII characters and the subset of non-alphabetic, non-ASCII characters. You should find 12 non-ASCII characters, only one of which is non-alphabetic.

Hints
  • Remember that you will need to iterate through each character in each line
  • The ord function will be useful to obtain the integer value of a character.
  • Python’s string.is... methods may be useful to find non-alphabetic characters.

b. Convert non-Alphabetic, non-ASCII Characters (5 pts)

If you print the single non-alphabetic, non-ASCII character, you will notice that it looks like a space. It is a non-breaking space. Create a new list of lines where all of the occurences of this character are replaced with a standard space character.

3. Parse DSV (10 pts)

You should have noticed that this file is in a variant of the comma-separated values (CSV) format, but a semi-colon (;) separates each field. We can still use the csv.DictReader to parse this file but will have to specify a different delimiter. In addition, because we have already read the file and converted the non-breaking spaces, we need to have our list of strings look like a file handle. io.StringIO can help us with that. Assuming the output from Part 2 is in the variable updated_lines, we can do

import io

new_f = io.StringIO(''.join(updated_lines))

and use new_f as we would a file handle. Now, use the DictReader to read in a list of dictionaries representing the data.

4. Update Fields (20 pts)

In the data represented as a list of dictionaries, we will update the municipality (NO_MUNICIPIO_VENDA) and active ingredients (DS_PRINCIPIO_ATIVO) fields. First, there may be more than one active ingredient per entry; the documentation tells us each ingredient is separated by a + character. Update the DS_PRINCIPIO_ATIVO field to a list of strings and remove any leading and trailing whitespace from each ingredient.

All of the data is in uppercase. We wish to convert the municipality and ingredient names to mixed case where the first letter of each word is capitalized. However, we want a few words ["do", "de", "da", "dos", "das", "e"] to be all lowercase. Make sure these are words and not just parts of a word! You may do this using string methods or in a single pass using a regular expression. Since we converted all non-breaking spaces, you may assume that the character on either side of these words is a normal space character. If you use a regular expression, the \b class is helpful and more robust.

After both of these steps, the entry

{'NU_ANO_VENDA': '2026',
 'NU_MES_VENDA': '01',
 'SG_UF_VENDA': 'RN',
 'NO_MUNICIPIO_VENDA': 'SÃO JOSÉ DO SERIDÓ',
 'DS_PRINCIPIO_ATIVO': 'HIDROCORTISONA + SULFATO DE NEOMICINA + SULFATO DE POLIMIXINA B',
 ...
}

should become

{'NU_ANO_VENDA': '2026',
 'NU_MES_VENDA': '01',
 'SG_UF_VENDA': 'RN',
 'NO_MUNICIPIO_VENDA': 'São José do Seridó',
 'DS_PRINCIPIO_ATIVO': ['Hidrocortisona',
  'Sulfato de Neomicina',
  'Sulfato de Polimixina B'],
 ...
}
Hints:
  • Remember that the DS_PRINCIPIO_ATIVO will be a list of strings and the mixed case conversion has to happen for each item
  • There is a string method that will break up strings based on a delimiter.
  • There is a string method that will do most of the specified mixed case conversion, leaving you to update the all-lowercase words.

5. Aggregate Active Ingredient Counts (10/15 pts)

To get a summary of the medications dispensed, we want to create a data structure(s) that store the total number of times an active ingredient (DS_PRINCIPIO_ATIVO) has been dispensed, as well as the known totals for male and female patients. For CSCI 503 students, these aggregations should also be broken down by state (SG_UF_VENDA). Information about a patient’s sex is stored in the SG_SEXO attribute where '1' indicates male and '2' indicates female. Note that some records do not have a value for sex. Include these in the totals. You are free to store this data as you wish, but you will use that structure to write the output in Part 6.

Hints:
  • You will likely use a nested data structure. Remember that if you have nested dictionaries, you will need to create a dictionary whenever one does not already exist.

6. Write Output (20/25 pts)

Write a new file named medication-summary.txt that outputs the name of each medication, the total number of times the ingredient was dispensed, the number of times it was dispensed to a male patient, the number of times it was dispensed to a female patient, and the signed difference between those counts. For CSCI 503 students, for each medication, there should then be lines for each state with the same counts.

The file should be written in a fixed-width format where in each row, the medication is listed first and left-aligned, then the state (for CSCI 503), and then the numbers for total, male, female, and the difference (male-female). All the numbers should be right-aligned. The difference should show a plus or minus sign (ok for zero to have a sign). There should be at least one space between each column. You should calculate the number of characters used for each field based on the longest name or the number of digits in the maximum value of the numeric fields.

The output file should look similar to the following example (CSCI 490 students will not have the lines for each state):

Ingredient                                          St  Tot    M    F  Diff
Amoxicilina Tri-Hidratada                              4106 1876 2087  -211
                                                    SP 1089  480  564   -84
                                                    MG  625  273  322   -49
                                                    RS  349  167  174    -7
                                                    RJ  268  127  129    -2
                                                    PR  245  126  109   +17
                                                    SC  223  105  114    -9
                                                    BA  179   87   88    -1
                                                    GO  166   71   92   -21
...
Ácido Retinóico                                           1    0    0    +0
                                                    BA    1    0    0    +0
Hidrogenotartarato de Rivastigmina                        1    0    0    +0
                                                    SP    1    0    0    +0
Hints
  • Use print function calls to write a few data items to stdout first, then when you are satisfied the format is correct, write all of the data items to a file.
  • You can use the file keyword argument with the print function.
  • Use a with statement to make sure all data is written to the file (or make sure to call close).
  • Remember to pass the write flag to open to be able to write to a file.
  • Consult the Format Specification Mini-Language for the various flags
  • You can see the first few lines of the file you wrote in the notebook using the !head command: !head medication-summary.txt

Extra Credit

  • CSCI 490 students may complete the CSCI 503 work for extra credit (10 pts)
  • (5 pts) Sort all values by decreasing totals
  • (10 pts) Create a second file that calculate the number of medications per municipality (total, not broken down by ingredients). For each municipality, show its name, state, and the total medications with breakdowns for male, female, and difference.