# Data Cleaning Courselet

**Authors:** 
 - Manvitha Kuncham (Z1959412@students.niu.edu)
 - Siddarth Vijayakumar Sivakala (Z2061678@students.niu.edu)
 - David Koop (dakoop@niu.edu)

**Last Updated:** 2025-10-06

This courselet provides information on using polars to do data cleaning. Data cleaning is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets. It is an important step in the data preparation process before analysis or modeling can be performed. It is a time-consuming process, especially for large datasets. However, the benefits of high-quality data are significant and can improve decision-making, increase efficiency, and reduce costs. Conversely, poor data quality can lead to inaccurate results, incorrect conclusions, and costly mistakes. Data cleaning involves various techniques such as handling missing values, removing duplicates, converting data types, and checking for consistency. These techniques help to ensure that the data is accurate, reliable, and consistent for analysis. This courselet uses a modified version of the [Kaggle Netflix dataset](https://www.kaggle.com/datasets/shivamb/netflix-shows) that is available [here](https://github.com/dakoop/fount-data-cleaning/raw/refs/heads/main/netflix-titles.csv.gz).

In [1]:
%config InteractiveShell.ast_node_interactivity = 'last_expr_or_assign'

In [1]:
import urllib.request
from pathlib import Path

fname = Path("netflix-titles.csv.gz")
url = "https://raw.githubusercontent.com/dakoop/fount-data-cleaning/refs/heads/main/netflix-titles.csv.gz"
if not fname.exists():
    urllib.request.urlretrieve(url, fname)

To begin, we should inspect the dataset and see how it looks.

In [2]:
import polars as pl

df = pl.read_csv("netflix-titles.csv.gz")

index,Unnamed: 1,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,str,str,str,i64,str,str,str,str,f64
0,,"""s1""","""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""","""September 25, 2021""",2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
1,,"""s2""","""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""","""September 24, 2021""",2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
2,,"""s3""","""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,"""September 24, 2021""",2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
3,,"""s4""","""TV Show""","""Jailbirds New Orleans""",,,,"""September 24, 2021""",2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
4,,"""s5""","""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""","""September 24, 2021""",2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
9665,,"""s9665""","""Movie""","""InuYasha: The Movie 2: The Cas…","""Toshiya Shinohara""","""Kappei Yamaguchi, Satsuki Yuki…","""Japan""","""September 1, 2017""",2002,"""TV-14""","""99 min""","""Action & Adventure, Anime Feat…","""With their biggest foe seeming…",3766.591645
9666,,"""s9666""","""Movie""","""You're Everything To Me""","""Tolga Örnek""","""Tolga Çevik, Cengiz Bozkurt, M…","""Turkey""","""March 12, 2021""",2016,"""TV-PG""","""107 min""","""Comedies, Dramas, Independent …","""When an old fling shows up wit…",7064.946622
9667,,"""s9667""","""TV Show""","""Best.Worst.Weekend.Ever.""",,"""Sam Ashe Arnold, Cole Sand, Br…","""United States""","""October 19, 2018""",2018,"""TV-PG""","""1 Season""","""Kids' TV, TV Comedies""","""Teenage friends plan an epic t…",9192.518555
9668,,"""s9668""","""Movie""","""The Delivery Boy""","""Adekunle Nodash Adejuyigbe""","""Jamal Ibrahim, Jemima Osunde, …","""Nigeria""","""May 14, 2020""",2018,"""TV-14""","""67 min""","""International Movies, Thriller…","""A teen criminal and a young se…",2576.945027


### Dropping Unwanted Columns

Looking at the list of columns, we can see the `Unnamed: 1` column that seems to be filled with Null values. Let's verify this is the case.

In [4]:
df.null_count()

index,Unnamed: 1,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,9670,0,0,0,2900,911,923,11,0,5,3,0,0,1004


In [5]:
df.select(pl.col("Unnamed: 1").is_null().all())

Unnamed: 1
bool
True


Dropping unwanted columns is important to ensure that you are working with a clean and concise DataFrame that contains only the necessary information for your analysis or operations. Removing nulls cleans data, improving analysis accuracy. For example, if you were trying to calculate summary statistics on the DataFrame, the presence of these columns could affect the results and make it difficult to interpret the output. Additionally, these columns may take up unnecessary space and make it more difficult to work with the DataFrame. The `drop` function in polars is used to get rid of columns.

In [6]:
df = df.drop("Unnamed: 1")

index,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,str,str,i64,str,str,str,str,f64
0,"""s1""","""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""","""September 25, 2021""",2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
1,"""s2""","""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""","""September 24, 2021""",2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
2,"""s3""","""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,"""September 24, 2021""",2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
3,"""s4""","""TV Show""","""Jailbirds New Orleans""",,,,"""September 24, 2021""",2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
4,"""s5""","""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""","""September 24, 2021""",2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…,…
9665,"""s9665""","""Movie""","""InuYasha: The Movie 2: The Cas…","""Toshiya Shinohara""","""Kappei Yamaguchi, Satsuki Yuki…","""Japan""","""September 1, 2017""",2002,"""TV-14""","""99 min""","""Action & Adventure, Anime Feat…","""With their biggest foe seeming…",3766.591645
9666,"""s9666""","""Movie""","""You're Everything To Me""","""Tolga Örnek""","""Tolga Çevik, Cengiz Bozkurt, M…","""Turkey""","""March 12, 2021""",2016,"""TV-PG""","""107 min""","""Comedies, Dramas, Independent …","""When an old fling shows up wit…",7064.946622
9667,"""s9667""","""TV Show""","""Best.Worst.Weekend.Ever.""",,"""Sam Ashe Arnold, Cole Sand, Br…","""United States""","""October 19, 2018""",2018,"""TV-PG""","""1 Season""","""Kids' TV, TV Comedies""","""Teenage friends plan an epic t…",9192.518555
9668,"""s9668""","""Movie""","""The Delivery Boy""","""Adekunle Nodash Adejuyigbe""","""Jamal Ibrahim, Jemima Osunde, …","""Nigeria""","""May 14, 2020""",2018,"""TV-14""","""67 min""","""International Movies, Thriller…","""A teen criminal and a young se…",2576.945027


At this point, we might wish to begin analyzing the dataset. Data cleaning is usually not an isolated task totally completed before any analysis begins but is often part of the analysis process. Sometimes queries help surface issues with the dataset.

Let's look at all content added in 2021. We see that there is a `date_added` colummn and attempt to query it using the datetime (`dt`) accessor.

In [7]:
df.filter(pl.col("date_added").dt.year() == 2021)

ColumnNotFoundError: unable to find column "date_added"; valid columns: ["index", "show_id", "type", "title", "director", "cast", "country", "date_added ", "release_year", "rating", "duration", "listed_in", "description", "view count"]

Resolved plan until failure:

	---> FAILED HERE RESOLVING 'filter' <---
DF ["index", "show_id", "type", "title"]; PROJECT */14 COLUMNS; SELECTION: None

##### Exercise

The error message claims that the attribute is missing. How could this be? Fix this issue.

##### Solution

Let's look at the column names in the dataframe.

In [8]:
df.columns

['index',
 'show_id',
 'type',
 'title',
 'director',
 'cast',
 'country',
 'date_added ',
 'release_year',
 'rating',
 'duration',
 'listed_in',
 'description',
 'view count']

We see that `date_added` has a space at the end of it! Let's rename it so that we don't have to deal with this typo.

In [9]:
df = df.rename({"date_added ": "date_added"})

index,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,str,str,i64,str,str,str,str,f64
0,"""s1""","""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""","""September 25, 2021""",2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
1,"""s2""","""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""","""September 24, 2021""",2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
2,"""s3""","""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,"""September 24, 2021""",2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
3,"""s4""","""TV Show""","""Jailbirds New Orleans""",,,,"""September 24, 2021""",2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
4,"""s5""","""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""","""September 24, 2021""",2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…,…
9665,"""s9665""","""Movie""","""InuYasha: The Movie 2: The Cas…","""Toshiya Shinohara""","""Kappei Yamaguchi, Satsuki Yuki…","""Japan""","""September 1, 2017""",2002,"""TV-14""","""99 min""","""Action & Adventure, Anime Feat…","""With their biggest foe seeming…",3766.591645
9666,"""s9666""","""Movie""","""You're Everything To Me""","""Tolga Örnek""","""Tolga Çevik, Cengiz Bozkurt, M…","""Turkey""","""March 12, 2021""",2016,"""TV-PG""","""107 min""","""Comedies, Dramas, Independent …","""When an old fling shows up wit…",7064.946622
9667,"""s9667""","""TV Show""","""Best.Worst.Weekend.Ever.""",,"""Sam Ashe Arnold, Cole Sand, Br…","""United States""","""October 19, 2018""",2018,"""TV-PG""","""1 Season""","""Kids' TV, TV Comedies""","""Teenage friends plan an epic t…",9192.518555
9668,"""s9668""","""Movie""","""The Delivery Boy""","""Adekunle Nodash Adejuyigbe""","""Jamal Ibrahim, Jemima Osunde, …","""Nigeria""","""May 14, 2020""",2018,"""TV-14""","""67 min""","""International Movies, Thriller…","""A teen criminal and a young se…",2576.945027


### Converting Data Types

Now, let's try our query to find 2021 additions again.

In [10]:
df.filter(pl.col("date_added").dt.year() == 2021)

InvalidOperationError: `year` operation not supported for dtype `str`

It still didn't work! The error this time is because the `date_added` column is not in a datetime format, which means that the code cannot extract the year from the column. To fix this, we can use the `to_date()` function to convert the `date_added` column to a datetime format.

In [11]:
df = df.with_columns(pl.col("date_added").str.to_date())

ComputeError: could not find an appropriate format to parse dates, please define a format

In some cases, polars is able to **infer** the format from the data, but in this case, it tells us it cannot. Thus, we need to refer to the [documentation](https://docs.rs/chrono/latest/chrono/format/strftime/index.html) to specify the format. In this case, we need the format string `"%B %d, %Y"`.

In [12]:
df = df.with_columns(pl.col("date_added").str.to_date("%B %d, %Y"))

index,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,str,date,i64,str,str,str,str,f64
0,"""s1""","""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
1,"""s2""","""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
2,"""s3""","""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
3,"""s4""","""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
4,"""s5""","""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…,…
9665,"""s9665""","""Movie""","""InuYasha: The Movie 2: The Cas…","""Toshiya Shinohara""","""Kappei Yamaguchi, Satsuki Yuki…","""Japan""",2017-09-01,2002,"""TV-14""","""99 min""","""Action & Adventure, Anime Feat…","""With their biggest foe seeming…",3766.591645
9666,"""s9666""","""Movie""","""You're Everything To Me""","""Tolga Örnek""","""Tolga Çevik, Cengiz Bozkurt, M…","""Turkey""",2021-03-12,2016,"""TV-PG""","""107 min""","""Comedies, Dramas, Independent …","""When an old fling shows up wit…",7064.946622
9667,"""s9667""","""TV Show""","""Best.Worst.Weekend.Ever.""",,"""Sam Ashe Arnold, Cole Sand, Br…","""United States""",2018-10-19,2018,"""TV-PG""","""1 Season""","""Kids' TV, TV Comedies""","""Teenage friends plan an epic t…",9192.518555
9668,"""s9668""","""Movie""","""The Delivery Boy""","""Adekunle Nodash Adejuyigbe""","""Jamal Ibrahim, Jemima Osunde, …","""Nigeria""",2020-05-14,2018,"""TV-14""","""67 min""","""International Movies, Thriller…","""A teen criminal and a young se…",2576.945027


Now, let's check our query to verify it finally works.

In [13]:
df.filter(pl.col("date_added").dt.year() == 2021)

index,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,str,date,i64,str,str,str,str,f64
0,"""s1""","""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
1,"""s2""","""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
2,"""s3""","""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
3,"""s4""","""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
4,"""s5""","""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…,…
9650,"""s9650""","""Movie""","""Dark Skies""","""Scott Stewart""","""Keri Russell, Josh Hamilton, J…","""United States""",2021-09-19,2013,"""PG-13""","""97 min""","""Horror Movies, Sci-Fi & Fantas…","""A family’s idyllic suburban li…",6364.32282
9655,"""s9655""","""TV Show""","""Two Sentence Horror Stories""",,"""Nicole Kang, Jim Parrack, Tara…","""United States""",2021-02-24,2021,"""TV-14""","""2 Seasons""","""TV Horror, Teen TV Shows""","""This anthology series of terro…",5257.704297
9660,"""s9660""","""Movie""","""Chhota Bheem Aur Hanuman""","""Rajiv Chilaka""","""Vatsal Dubey, Julie Tejwani, R…",,2021-07-22,2012,"""TV-Y7""","""68 min""","""Children & Family Movies""","""When two evil entities kidnap …",8812.140161
9662,"""s9662""","""TV Show""","""Kim's Convenience""",,"""Paul Sun-Hyung Lee, Jean Yoon,…","""Canada""",2021-07-06,2021,"""TV-MA""","""5 Seasons""","""International TV Shows, TV Com…","""While running a convenience st…",4343.603147


## Detecting Errors

Often, it is helpful to start by **detecting errors** in the dataset instead of triaging them as they arise. Even if we don't immediately fix them, it is important to know that these issues can affect our analyses. There are a number of types of techniques that may help here depending on data types and known constrains. In some cases, we may know the possible values for a column and can use those to find values that don't belong. In other cases, we may be able to apply statistical techniques to identify issues.

### Columns with Known Domains

For the ratings column, we can look at the [TV Parental Guidelines](https://en.wikipedia.org/wiki/TV_Parental_Guidelines) and [MPAA Movie Ratings](https://en.wikipedia.org/wiki/Motion_Picture_Association_film_rating_system) to identify the valid values. We can then use the `is_in` function to find values that are not in this list.

In [14]:
tv_ratings = ["TV-Y", "TV-Y7", "TV-G", "TV-PG", "TV-14", "TV-MA"]
movie_ratings = ["G", "PG", "PG-13", "R", "NC-17"]

df.filter(~pl.col("rating").is_in(tv_ratings + movie_ratings))["rating"].value_counts(
    sort=True
)

rating,count
str,u32
"""NR""",86
"""TV-Y7-FV""",8
"""UR""",3
"""74 min""",1
"""84 min""",1
"""66 min""",1


This left a few unmatched values. With a bit of research, we find that "UR" means "Unrated", "NR" means "Not Rated", and "TV-Y7-FV" means "TV-Y7-Fantasy Violence". All of these seem like valid ratings. We can add these to our list, leaving only three values outside of our domain.

In [15]:
tv_ratings = ["TV-Y", "TV-Y7", "TV-G", "TV-PG", "TV-14", "TV-MA", "TV-Y7-FV"]
movie_ratings = ["G", "PG", "PG-13", "R", "NC-17", "UR", "NR"]

df.filter(~pl.col("rating").is_in(tv_ratings + movie_ratings))

index,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,str,date,i64,str,str,str,str,f64
5541,"""s5542""","""Movie""","""Louis C.K. 2017""","""Louis C.K.""","""Louis C.K.""","""United States""",2017-04-04,2017,"""74 min""",,"""Movies""","""Louis C.K. muses on religion, …",5098.898959
5794,"""s5795""","""Movie""","""Louis C.K.: Hilarious""","""Louis C.K.""","""Louis C.K.""","""United States""",2016-09-16,2010,"""84 min""",,"""Movies""","""Emmy-winning comedy writer Lou…",5793.630643
5813,"""s5814""","""Movie""","""Louis C.K.: Live at the Comedy…","""Louis C.K.""","""Louis C.K.""","""United States""",2016-08-15,2015,"""66 min""",,"""Movies""","""The comic puts his trademark h…",9140.760807


In this case, we can see that the duration seems to have ended up in the wrong column. Later, we can fix this by moving these values to the correct column. For now, we can just note that these are issues that need to be fixed.

### Statistical Anomalies

For numeric columns, we can start by examining the statistical distribution of values. We can pull out all numeric columns using a selector, and then run `describe` to obtain statistics.

In [16]:
import polars.selectors as cs

df.select(cs.numeric()).describe()

statistic,index,release_year,view count
str,f64,f64,f64
"""count""",9670.0,9670.0,8666.0
"""null_count""",0.0,0.0,1004.0
"""mean""",4834.5,1978.196381,5473.269073
"""std""",2791.632885,262.597782,2630.588147
"""min""",0.0,71.0,1000.007723
"""25%""",2417.0,2013.0,3172.995137
"""50%""",4835.0,2017.0,5433.747701
"""75%""",7252.0,2019.0,7818.207527
"""max""",9669.0,2021.0,9998.971387


In this case, we see that `release_year` has a mean value that is very different from its median. In addition, the minimum value of 71 seems suspicious. The z-score calculates the number of standard deviations a data value is away from the mean. Those values with z-scores greater than three are often outliers.

In [17]:
y_mean = df["release_year"].mean()
y_std = df["release_year"].std()

release_years = df.select("release_year").with_columns(
    ((pl.col("release_year") - y_mean) / y_std).alias("z_score")
)
release_years.filter(pl.col("z_score").abs() > 3).sort("release_year")

release_year,z_score
i64,f64
71,-7.262805
71,-7.262805
72,-7.258996
72,-7.258996
72,-7.258996
…,…
99,-7.156178
99,-7.156178
99,-7.156178
99,-7.156178


Here we can see that some of the release years are likely written as two-digit years ('71, '99) instead of 1971 or 1999. Again, we can fix these later if our analyses need this information.

### Detecting Duplicates

Another potential issue is duplicate items in the dataset. If we calculate statistics that include duplicates (e.g. the number of movies from each decade), we will get incorrect results. One way to start looking for duplicates is to call the `unique()` method.

In [18]:
uniq_df = df.unique()

index,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,str,date,i64,str,str,str,str,f64
5100,"""s5101""","""TV Show""","""Dave Chappelle: Equanimity & T…","""Stan Lathan""","""Dave Chappelle""","""United States""",2017-12-31,2017,"""TV-MA""","""1 Season""","""Stand-Up Comedy & Talk Shows, …","""Comedy titan Dave Chappelle ca…",4800.640911
303,"""s304""","""Movie""","""Esperando la carroza""","""Alejandro Doria""","""Luis Brandoni, China Zorrilla,…","""Argentina""",2021-08-05,85,"""TV-MA""","""95 min""","""Comedies, Cult Movies, Interna…","""Cora has three sons and a daug…",3015.552178
4555,"""s4556""","""Movie""","""The Meaning of Monty Python""","""Albert Sharpe""",,"""United Kingdom""",2018-10-02,2013,"""TV-MA""","""60 min""","""Documentaries""","""Five Pythons reflect on their …",8627.898873
6842,"""s6843""","""Movie""","""Get Smart""","""Peter Segal""","""Steve Carell, Anne Hathaway, D…","""United States""",2019-04-01,2008,"""PG-13""","""110 min""","""Action & Adventure, Comedies""","""When the identities of secret …",5599.648173
7510,"""s7511""","""Movie""","""Morris from America""","""Chad Hartigan""","""Markees Christmas, Craig Robin…","""Germany, United States""",2018-11-01,2016,"""R""","""91 min""","""Dramas, Independent Movies, In…","""When his father moves from the…",3343.639706
…,…,…,…,…,…,…,…,…,…,…,…,…,…
4522,"""s4523""","""Movie""","""22 July""","""Paul Greengrass""","""Anders Danielsen Lie, Jon Øiga…","""Norway, Iceland, United States""",2018-10-10,2018,"""R""","""144 min""","""Dramas, Thrillers""","""After devastating terror attac…",4044.202071
7931,"""s7932""","""Movie""","""Sanai Choughade""","""Rajeev Patil""","""Shreyas Talpade, Subodh Bhave,…","""India""",2018-01-01,2008,"""TV-14""","""122 min""","""Comedies, Dramas, Internationa…","""In an effort to honor the fami…",1432.101537
592,"""s593""","""Movie""","""She's Out of My League""","""Jim Field Smith""","""Jay Baruchel, Alice Eve, T.J. …","""United States""",2021-07-01,2010,"""R""","""106 min""","""Comedies, Romantic Movies""","""Kirk's a 5. His new girlfriend…",1558.662762
3806,"""s3807""","""TV Show""","""Prince of Peoria""",,"""Gavin Lewis, Theodore Barnes, …","""United States""",2019-05-20,2019,"""TV-G""","""2 Seasons""","""Kids' TV, TV Comedies""","""A prankster prince who wants t…",2676.123658


We can see that in this case, the number of rows stayed exactly the same, but the order changed. By default, `unique` does not maintain the original order of the rows (the index column was in order before). We can pass a `maintain_order` argument that will maintain the original order but may take a bit longer.

In [19]:
df.unique(maintain_order=True)

index,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,str,date,i64,str,str,str,str,f64
0,"""s1""","""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
1,"""s2""","""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
2,"""s3""","""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
3,"""s4""","""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
4,"""s5""","""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…,…
9665,"""s9665""","""Movie""","""InuYasha: The Movie 2: The Cas…","""Toshiya Shinohara""","""Kappei Yamaguchi, Satsuki Yuki…","""Japan""",2017-09-01,2002,"""TV-14""","""99 min""","""Action & Adventure, Anime Feat…","""With their biggest foe seeming…",3766.591645
9666,"""s9666""","""Movie""","""You're Everything To Me""","""Tolga Örnek""","""Tolga Çevik, Cengiz Bozkurt, M…","""Turkey""",2021-03-12,2016,"""TV-PG""","""107 min""","""Comedies, Dramas, Independent …","""When an old fling shows up wit…",7064.946622
9667,"""s9667""","""TV Show""","""Best.Worst.Weekend.Ever.""",,"""Sam Ashe Arnold, Cole Sand, Br…","""United States""",2018-10-19,2018,"""TV-PG""","""1 Season""","""Kids' TV, TV Comedies""","""Teenage friends plan an epic t…",9192.518555
9668,"""s9668""","""Movie""","""The Delivery Boy""","""Adekunle Nodash Adejuyigbe""","""Jamal Ibrahim, Jemima Osunde, …","""Nigeria""",2020-05-14,2018,"""TV-14""","""67 min""","""International Movies, Thriller…","""A teen criminal and a young se…",2576.945027


While it might be surprising that our items are all unique, this method requires that **all** columns match, and you might notice that the `index` column is always unique. Let's drop this column since it is unneccessary for our analysis.

In [20]:
df = df.drop("index")

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
str,str,str,str,str,str,date,i64,str,str,str,str,f64
"""s1""","""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
"""s2""","""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
"""s3""","""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
"""s4""","""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
"""s5""","""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…
"""s9665""","""Movie""","""InuYasha: The Movie 2: The Cas…","""Toshiya Shinohara""","""Kappei Yamaguchi, Satsuki Yuki…","""Japan""",2017-09-01,2002,"""TV-14""","""99 min""","""Action & Adventure, Anime Feat…","""With their biggest foe seeming…",3766.591645
"""s9666""","""Movie""","""You're Everything To Me""","""Tolga Örnek""","""Tolga Çevik, Cengiz Bozkurt, M…","""Turkey""",2021-03-12,2016,"""TV-PG""","""107 min""","""Comedies, Dramas, Independent …","""When an old fling shows up wit…",7064.946622
"""s9667""","""TV Show""","""Best.Worst.Weekend.Ever.""",,"""Sam Ashe Arnold, Cole Sand, Br…","""United States""",2018-10-19,2018,"""TV-PG""","""1 Season""","""Kids' TV, TV Comedies""","""Teenage friends plan an epic t…",9192.518555
"""s9668""","""Movie""","""The Delivery Boy""","""Adekunle Nodash Adejuyigbe""","""Jamal Ibrahim, Jemima Osunde, …","""Nigeria""",2020-05-14,2018,"""TV-14""","""67 min""","""International Movies, Thriller…","""A teen criminal and a young se…",2576.945027


In [21]:
df.unique(maintain_order=True)

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
str,str,str,str,str,str,date,i64,str,str,str,str,f64
"""s1""","""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
"""s2""","""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
"""s3""","""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
"""s4""","""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
"""s5""","""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…
"""s9665""","""Movie""","""InuYasha: The Movie 2: The Cas…","""Toshiya Shinohara""","""Kappei Yamaguchi, Satsuki Yuki…","""Japan""",2017-09-01,2002,"""TV-14""","""99 min""","""Action & Adventure, Anime Feat…","""With their biggest foe seeming…",3766.591645
"""s9666""","""Movie""","""You're Everything To Me""","""Tolga Örnek""","""Tolga Çevik, Cengiz Bozkurt, M…","""Turkey""",2021-03-12,2016,"""TV-PG""","""107 min""","""Comedies, Dramas, Independent …","""When an old fling shows up wit…",7064.946622
"""s9667""","""TV Show""","""Best.Worst.Weekend.Ever.""",,"""Sam Ashe Arnold, Cole Sand, Br…","""United States""",2018-10-19,2018,"""TV-PG""","""1 Season""","""Kids' TV, TV Comedies""","""Teenage friends plan an epic t…",9192.518555
"""s9668""","""Movie""","""The Delivery Boy""","""Adekunle Nodash Adejuyigbe""","""Jamal Ibrahim, Jemima Osunde, …","""Nigeria""",2020-05-14,2018,"""TV-14""","""67 min""","""International Movies, Thriller…","""A teen criminal and a young se…",2576.945027


Now, the number of shows in the unique dataframe has dropped from 9,670 to 9,142. However, we don't know if there are any shows that are the same but have different show_id values. We can check `show_id` column for duplicates.

In [22]:
uniq_df = df.unique(subset="show_id", maintain_order=True)

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
str,str,str,str,str,str,date,i64,str,str,str,str,f64
"""s1""","""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
"""s2""","""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
"""s3""","""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
"""s4""","""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
"""s5""","""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…
"""s9665""","""Movie""","""InuYasha: The Movie 2: The Cas…","""Toshiya Shinohara""","""Kappei Yamaguchi, Satsuki Yuki…","""Japan""",2017-09-01,2002,"""TV-14""","""99 min""","""Action & Adventure, Anime Feat…","""With their biggest foe seeming…",3766.591645
"""s9666""","""Movie""","""You're Everything To Me""","""Tolga Örnek""","""Tolga Çevik, Cengiz Bozkurt, M…","""Turkey""",2021-03-12,2016,"""TV-PG""","""107 min""","""Comedies, Dramas, Independent …","""When an old fling shows up wit…",7064.946622
"""s9667""","""TV Show""","""Best.Worst.Weekend.Ever.""",,"""Sam Ashe Arnold, Cole Sand, Br…","""United States""",2018-10-19,2018,"""TV-PG""","""1 Season""","""Kids' TV, TV Comedies""","""Teenage friends plan an epic t…",9192.518555
"""s9668""","""Movie""","""The Delivery Boy""","""Adekunle Nodash Adejuyigbe""","""Jamal Ibrahim, Jemima Osunde, …","""Nigeria""",2020-05-14,2018,"""TV-14""","""67 min""","""International Movies, Thriller…","""A teen criminal and a young se…",2576.945027


This gives us the same number of rows which means that if the show_id is the same, the rest of the row is also the same. It does not tell us, however, whether there could be titles that have **different** show ids. This would be the inverse subset. Let's look at this.

In [23]:
uniq_df.unique(pl.exclude("show_id"), maintain_order=True)

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
str,str,str,str,str,str,date,i64,str,str,str,str,f64
"""s1""","""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
"""s2""","""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
"""s3""","""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
"""s4""","""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
"""s5""","""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…
"""s9522""","""Movie""","""Benji's Very Own Christmas Sto…","""Joe Camp""","""Ron Moody, Patsy Garrett, Cynt…","""United States""",2018-10-03,1978,"""TV-G""","""25 min""","""Children & Family Movies""","""While on a press tour, Benji g…",3617.016158
"""s9560""","""Movie""","""Major Payne""","""Nick Castle""","""Damon Wayans, Karyn Parsons, W…","""United States""",2021-08-01,1995,"""PG-13""","""97 min""","""Comedies""","""A hardened Marine is given his…",3259.901162
"""s9585""","""Movie""","""Candyman""","""Bernard Rose""","""Virginia Madsen, Tony Todd, Xa…","""United States, United Kingdom""",2019-10-01,92,"""R""","""99 min""","""Cult Movies, Horror Movies""","""Grad student Helen Lyle uninte…",
"""s9593""","""Movie""","""Basic Instinct""","""Paul Verhoeven""","""Michael Douglas, Sharon Stone,…","""United States, France""",2020-10-01,92,"""R""","""128 min""","""Classic Movies, Thrillers""","""A detective investigating a ro…",5453.100902


This number suggests that some titles are listed twice, with different show_id values. Let's examine the duplicates. Here, we can create a struct using all the columns except show_id in order to check for duplicates.

In [24]:
uniq_df.filter(pl.struct(pl.exclude("show_id")).is_duplicated()).sort("title")

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
str,str,str,str,str,str,date,i64,str,str,str,str,f64
"""s2141""","""TV Show""","""(Un)Well""",,,"""United States""",2020-08-12,2020,"""TV-MA""","""1 Season""","""Reality TV""","""This docuseries takes a deep d…",4672.449561
"""s9624""","""TV Show""","""(Un)Well""",,,"""United States""",2020-08-12,2020,"""TV-MA""","""1 Season""","""Reality TV""","""This docuseries takes a deep d…",4672.449561
"""s3809""","""TV Show""","""1994""","""Diego Enrique Osorno""",,"""Mexico""",2019-05-17,2019,"""TV-MA""","""1 Season""","""Crime TV Shows, Docuseries, In…","""Archival video and new intervi…",9873.145515
"""s9498""","""TV Show""","""1994""","""Diego Enrique Osorno""",,"""Mexico""",2019-05-17,2019,"""TV-MA""","""1 Season""","""Crime TV Shows, Docuseries, In…","""Archival video and new intervi…",9873.145515
"""s6012""","""Movie""","""30 Days of Luxury""","""Hani Hamdi""","""Taher Farouz, Sad Al-Saghir, A…","""Egypt""",2019-04-18,2016,"""TV-14""","""91 min""","""Comedies, International Movies""","""With the help of his friends, …",4998.741581
…,…,…,…,…,…,…,…,…,…,…,…,…
"""s9588""","""Movie""","""Yucatán""","""Daniel Monzón""","""Luis Tosar, Rodrigo de la Sern…","""Spain""",2019-02-15,2018,"""TV-MA""","""130 min""","""Comedies, International Movies""","""Competing con artists attempt …",6142.439604
"""s3157""","""Movie""","""Zero Hour""","""Robert O. Peters""","""Richard Mofe-Damijo, Alex Ekub…",,2019-12-13,2018,"""TV-MA""","""89 min""","""International Movies, Thriller…","""After his father passes, the h…",3087.248982
"""s9574""","""Movie""","""Zero Hour""","""Robert O. Peters""","""Richard Mofe-Damijo, Alex Ekub…",,2019-12-13,2018,"""TV-MA""","""89 min""","""International Movies, Thriller…","""After his father passes, the h…",3087.248982
"""s1774""","""TV Show""","""Zumbo's Just Desserts""",,"""Adriano Zumbo, Rachel Khoo""","""Australia""",2020-10-31,2019,"""TV-PG""","""1 Season""","""International TV Shows, Realit…","""Dessert wizard Adriano Zumbo l…",7108.050036


That `release_year` issue may also turn out to be a problem now because we might have some of the two-digit years. Let's fix that now.

##### Exercise

Convert any `release_year` values that are two-digit years instead of four-digit years.

##### Solution

In [25]:
df = df.with_columns(
    pl.when(pl.col("release_year") < 100)
    .then(pl.col("release_year") + 1900)
    .otherwise(pl.col("release_year"))
)

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
str,str,str,str,str,str,date,i64,str,str,str,str,f64
"""s1""","""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
"""s2""","""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
"""s3""","""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
"""s4""","""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
"""s5""","""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…
"""s9665""","""Movie""","""InuYasha: The Movie 2: The Cas…","""Toshiya Shinohara""","""Kappei Yamaguchi, Satsuki Yuki…","""Japan""",2017-09-01,2002,"""TV-14""","""99 min""","""Action & Adventure, Anime Feat…","""With their biggest foe seeming…",3766.591645
"""s9666""","""Movie""","""You're Everything To Me""","""Tolga Örnek""","""Tolga Çevik, Cengiz Bozkurt, M…","""Turkey""",2021-03-12,2016,"""TV-PG""","""107 min""","""Comedies, Dramas, Independent …","""When an old fling shows up wit…",7064.946622
"""s9667""","""TV Show""","""Best.Worst.Weekend.Ever.""",,"""Sam Ashe Arnold, Cole Sand, Br…","""United States""",2018-10-19,2018,"""TV-PG""","""1 Season""","""Kids' TV, TV Comedies""","""Teenage friends plan an epic t…",9192.518555
"""s9668""","""Movie""","""The Delivery Boy""","""Adekunle Nodash Adejuyigbe""","""Jamal Ibrahim, Jemima Osunde, …","""Nigeria""",2020-05-14,2018,"""TV-14""","""67 min""","""International Movies, Thriller…","""A teen criminal and a young se…",2576.945027


### Removing Duplicates

Now, we can reexamine our duplicates, but first we have to recompute the uniq values after fixing the release year.

In [26]:
uniq_df = df.unique(maintain_order=True)

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
str,str,str,str,str,str,date,i64,str,str,str,str,f64
"""s1""","""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
"""s2""","""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
"""s3""","""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
"""s4""","""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
"""s5""","""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…
"""s9665""","""Movie""","""InuYasha: The Movie 2: The Cas…","""Toshiya Shinohara""","""Kappei Yamaguchi, Satsuki Yuki…","""Japan""",2017-09-01,2002,"""TV-14""","""99 min""","""Action & Adventure, Anime Feat…","""With their biggest foe seeming…",3766.591645
"""s9666""","""Movie""","""You're Everything To Me""","""Tolga Örnek""","""Tolga Çevik, Cengiz Bozkurt, M…","""Turkey""",2021-03-12,2016,"""TV-PG""","""107 min""","""Comedies, Dramas, Independent …","""When an old fling shows up wit…",7064.946622
"""s9667""","""TV Show""","""Best.Worst.Weekend.Ever.""",,"""Sam Ashe Arnold, Cole Sand, Br…","""United States""",2018-10-19,2018,"""TV-PG""","""1 Season""","""Kids' TV, TV Comedies""","""Teenage friends plan an epic t…",9192.518555
"""s9668""","""Movie""","""The Delivery Boy""","""Adekunle Nodash Adejuyigbe""","""Jamal Ibrahim, Jemima Osunde, …","""Nigeria""",2020-05-14,2018,"""TV-14""","""67 min""","""International Movies, Thriller…","""A teen criminal and a young se…",2576.945027


We also might be curious about which `show_id` values are duplicated, but if we sort by that column, we get an alphabetical sort with "s10" following "s1" instead of "s2". Let's update this id to an integer instead. First, let's verify that each value starts with a "s".

In [27]:
uniq_df.filter(~pl.col("show_id").str.starts_with("s"))

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
str,str,str,str,str,str,date,i64,str,str,str,str,f64


Now, we can convert it using casting.

In [28]:
uniq_df = uniq_df.with_columns(
    pl.col("show_id").str.replace("s", "").cast(pl.Int64)
).sort("show_id")

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,date,i64,str,str,str,str,f64
1,"""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
2,"""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
3,"""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
4,"""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
5,"""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…
9665,"""Movie""","""InuYasha: The Movie 2: The Cas…","""Toshiya Shinohara""","""Kappei Yamaguchi, Satsuki Yuki…","""Japan""",2017-09-01,2002,"""TV-14""","""99 min""","""Action & Adventure, Anime Feat…","""With their biggest foe seeming…",3766.591645
9666,"""Movie""","""You're Everything To Me""","""Tolga Örnek""","""Tolga Çevik, Cengiz Bozkurt, M…","""Turkey""",2021-03-12,2016,"""TV-PG""","""107 min""","""Comedies, Dramas, Independent …","""When an old fling shows up wit…",7064.946622
9667,"""TV Show""","""Best.Worst.Weekend.Ever.""",,"""Sam Ashe Arnold, Cole Sand, Br…","""United States""",2018-10-19,2018,"""TV-PG""","""1 Season""","""Kids' TV, TV Comedies""","""Teenage friends plan an epic t…",9192.518555
9668,"""Movie""","""The Delivery Boy""","""Adekunle Nodash Adejuyigbe""","""Jamal Ibrahim, Jemima Osunde, …","""Nigeria""",2020-05-14,2018,"""TV-14""","""67 min""","""International Movies, Thriller…","""A teen criminal and a young se…",2576.945027


Let's look at the distribution of show_ids in the duplicates.

In [29]:
uniq_df.filter(pl.struct(pl.exclude("show_id")).is_duplicated())["show_id"].hist(
    bin_count=10
)

breakpoint,category,count
f64,cat,u32
969.6,"""(-6.666, 969.6]""",32
1936.2,"""(969.6, 1936.2]""",39
2902.8,"""(1936.2, 2902.8]""",34
3869.4,"""(2902.8, 3869.4]""",39
4836.0,"""(3869.4, 4836.0]""",35
5802.6,"""(4836.0, 5802.6]""",33
6769.2,"""(5802.6, 6769.2]""",44
7735.8,"""(6769.2, 7735.8]""",30
8702.4,"""(7735.8, 8702.4]""",48
9669.0,"""(8702.4, 9669.0]""",336


A lot of these are in the 9000 values, suggesting the latest rows added are duplicates. If we remove duplicates but keep the first by show_id, we can see that many/all of the duplicates have show_id values over 8807.

In [30]:
uniq_df = uniq_df.sort("show_id").unique(
    subset=pl.exclude("show_id"), keep="first", maintain_order=True
)

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,date,i64,str,str,str,str,f64
1,"""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
2,"""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
3,"""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
4,"""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
5,"""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…
8803,"""Movie""","""Zodiac""","""David Fincher""","""Mark Ruffalo, Jake Gyllenhaal,…","""United States""",2019-11-20,2007,"""R""","""158 min""","""Cult Movies, Dramas, Thrillers""","""A political cartoonist, a crim…",3029.19655
8804,"""TV Show""","""Zombie Dumb""",,,,2019-07-01,2018,"""TV-Y7""","""2 Seasons""","""Kids' TV, Korean TV Shows, TV …","""While living alone in a spooky…",
8805,"""Movie""","""Zombieland""","""Ruben Fleischer""","""Jesse Eisenberg, Woody Harrels…","""United States""",2019-11-01,2009,"""R""","""88 min""","""Comedies, Horror Movies""","""Looking to survive in a world …",3877.856967
8806,"""Movie""","""Zoom""","""Peter Hewitt""","""Tim Allen, Courteney Cox, Chev…","""United States""",2020-01-11,2006,"""PG""","""88 min""","""Children & Family Movies, Come…","""Dragged from civilian life, a …",2554.09238


### Handling Missing Values

If we don't drop the rows with missing values, those rows will remain in the data frame and may cause issues when we try to analyze the data or perform operations on the column containing missing values. For example, if you were trying to calculate the viewership for movies directed by a particular director, the missing values in the `view count` column would not be included in the calculation, which could give an incorrect result.

#### Detecting Missing Values

Polars uses the `is_null()` method to detect if any values are missing. Let's see all missing values by using all columns (`pl.all()`). We can also count all of the values that are null by summing these booleans (which converts them to integers).

In [31]:
uniq_df.select(pl.all().is_null())

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool
false,false,false,false,true,false,false,false,false,false,false,false,true
false,false,false,true,false,false,false,false,false,false,false,false,false
false,false,false,false,false,true,false,false,false,false,false,false,false
false,false,false,true,true,true,false,false,false,false,false,false,false
false,false,false,true,false,false,false,false,false,false,false,false,false
…,…,…,…,…,…,…,…,…,…,…,…,…
false,false,false,false,false,false,false,false,false,false,false,false,false
false,false,false,true,true,true,false,false,false,false,false,false,true
false,false,false,false,false,false,false,false,false,false,false,false,false
false,false,false,false,false,false,false,false,false,false,false,false,false


In [32]:
uniq_df.select(pl.all().is_null().sum())

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,2634,825,831,10,0,4,3,0,0,921


Note that this result tells us which columns have null values, not which rows have null values. We can using the horizontal operations to find rows with null values. In this case, `any_horizontal` finds any row which has a null value in any column.

In [33]:
uniq_df.filter(pl.any_horizontal(pl.all().is_null()))

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,date,i64,str,str,str,str,f64
1,"""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
2,"""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
3,"""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
4,"""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
5,"""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…
8797,"""TV Show""","""Yunus Emre""",,"""Gökhan Atalay, Payidar Tüfekçi…","""Turkey""",2017-01-17,2016,"""TV-PG""","""2 Seasons""","""International TV Shows, TV Dra…","""During the Mongol invasions, Y…",
8798,"""TV Show""","""Zak Storm""",,"""Michael Johnston, Jessica Gee-…","""United States, France, South K…",2018-09-13,2016,"""TV-Y7""","""3 Seasons""","""Kids' TV""","""Teen surfer Zak Storm is myste…",6227.299595
8801,"""TV Show""","""Zindagi Gulzar Hai""",,"""Sanam Saeed, Fawad Khan, Ayesh…","""Pakistan""",2016-12-15,2012,"""TV-PG""","""1 Season""","""International TV Shows, Romant…","""Strong-willed, middle-class Ka…",1080.009852
8804,"""TV Show""","""Zombie Dumb""",,,,2019-07-01,2018,"""TV-Y7""","""2 Seasons""","""Kids' TV, Korean TV Shows, TV …","""While living alone in a spooky…",


This shows that we have 4015 rows with some null values.

#### Removing Rows with Missing Data

We could choose to remove any row with missing data from the dataset.

In [34]:
uniq_df.filter(~pl.any_horizontal(pl.all().is_null()))

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,date,i64,str,str,str,str,f64
8,"""Movie""","""Sankofa""","""Haile Gerima""","""Kofi Ghanaba, Oyafunmike Ogunl…","""United States, Ghana, Burkina …",2021-09-24,1993,"""TV-MA""","""125 min""","""Dramas, Independent Movies, In…","""On a photo shoot in Ghana, an …",7834.118691
9,"""TV Show""","""The Great British Baking Show""","""Andy Devonshire""","""Mel Giedroyc, Sue Perkins, Mar…","""United Kingdom""",2021-09-24,2021,"""TV-14""","""9 Seasons""","""British TV Shows, Reality TV""","""A talented batch of amateur ba…",1921.94935
10,"""Movie""","""The Starling""","""Theodore Melfi""","""Melissa McCarthy, Chris O'Dowd…","""United States""",2021-09-24,2021,"""PG-13""","""104 min""","""Comedies, Dramas""","""A woman adjusting to life afte…",8955.964467
13,"""Movie""","""Je Suis Karl""","""Christian Schwochow""","""Luna Wedler, Jannis Niewöhner,…","""Germany, Czech Republic""",2021-09-23,2021,"""TV-MA""","""127 min""","""Dramas, International Movies""","""After most of her family is mu…",2757.058961
25,"""Movie""","""Jeans""","""S. Shankar""","""Prashanth, Aishwarya Rai Bachc…","""India""",2021-09-21,1998,"""TV-14""","""166 min""","""Comedies, International Movies…","""When the father of the man she…",5492.522311
…,…,…,…,…,…,…,…,…,…,…,…,…
8800,"""Movie""","""Zenda""","""Avadhoot Gupte""","""Santosh Juvekar, Siddharth Cha…","""India""",2018-02-15,2009,"""TV-14""","""120 min""","""Dramas, International Movies""","""A change in the leadership of …",4763.368397
8802,"""Movie""","""Zinzana""","""Majid Al Ansari""","""Ali Suliman, Saleh Bakri, Yasa…","""United Arab Emirates, Jordan""",2016-03-09,2015,"""TV-MA""","""96 min""","""Dramas, International Movies, …","""Recovering alcoholic Talal wak…",4586.137163
8803,"""Movie""","""Zodiac""","""David Fincher""","""Mark Ruffalo, Jake Gyllenhaal,…","""United States""",2019-11-20,2007,"""R""","""158 min""","""Cult Movies, Dramas, Thrillers""","""A political cartoonist, a crim…",3029.19655
8805,"""Movie""","""Zombieland""","""Ruben Fleischer""","""Jesse Eisenberg, Woody Harrels…","""United States""",2019-11-01,2009,"""R""","""88 min""","""Comedies, Horror Movies""","""Looking to survive in a world …",3877.856967


However, this removes nearly half of the data. Another approach would be to drop the rows only when certain columns are null. If we are most interested in analyzing viewership, and we have no viewership numbers for a show, we can remove that row.

In [35]:
updated_df = uniq_df.filter(~pl.col("view count").is_null())

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,date,i64,str,str,str,str,f64
2,"""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
3,"""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
4,"""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
5,"""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
6,"""TV Show""","""Midnight Mass""","""Mike Flanagan""","""Kate Siegel, Zach Gilford, Ham…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""TV Dramas, TV Horror, TV Myste…","""The arrival of a charismatic y…",1209.969778
…,…,…,…,…,…,…,…,…,…,…,…,…
8801,"""TV Show""","""Zindagi Gulzar Hai""",,"""Sanam Saeed, Fawad Khan, Ayesh…","""Pakistan""",2016-12-15,2012,"""TV-PG""","""1 Season""","""International TV Shows, Romant…","""Strong-willed, middle-class Ka…",1080.009852
8802,"""Movie""","""Zinzana""","""Majid Al Ansari""","""Ali Suliman, Saleh Bakri, Yasa…","""United Arab Emirates, Jordan""",2016-03-09,2015,"""TV-MA""","""96 min""","""Dramas, International Movies, …","""Recovering alcoholic Talal wak…",4586.137163
8803,"""Movie""","""Zodiac""","""David Fincher""","""Mark Ruffalo, Jake Gyllenhaal,…","""United States""",2019-11-20,2007,"""R""","""158 min""","""Cult Movies, Dramas, Thrillers""","""A political cartoonist, a crim…",3029.19655
8805,"""Movie""","""Zombieland""","""Ruben Fleischer""","""Jesse Eisenberg, Woody Harrels…","""United States""",2019-11-01,2009,"""R""","""88 min""","""Comedies, Horror Movies""","""Looking to survive in a world …",3877.856967


### Imputing missing values

While dropping values is one solution, we can also choose to fill in each null value with a non-null value using a particular strategy. In older datasets and especially those columns that stored integers, null values were often represented with a special value that was not in the domain. For viewer counts, for example, the dataset creator might use the value -999 to flag counts that are unknown.

In [36]:
view_counts = uniq_df.select(
    "show_id", "view count", pl.exclude("show_id", "view count")
)
view_count_ints = view_counts.with_columns(
    pl.col("view count").fill_null(-999).cast(pl.Int64),
)

show_id,view count,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
i64,i64,str,str,str,str,str,date,i64,str,str,str,str
1,-999,"""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…"
2,5672,"""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…"
3,8329,"""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…"
4,9694,"""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …"
5,7951,"""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …"
…,…,…,…,…,…,…,…,…,…,…,…,…
8803,3029,"""Movie""","""Zodiac""","""David Fincher""","""Mark Ruffalo, Jake Gyllenhaal,…","""United States""",2019-11-20,2007,"""R""","""158 min""","""Cult Movies, Dramas, Thrillers""","""A political cartoonist, a crim…"
8804,-999,"""TV Show""","""Zombie Dumb""",,,,2019-07-01,2018,"""TV-Y7""","""2 Seasons""","""Kids' TV, Korean TV Shows, TV …","""While living alone in a spooky…"
8805,3877,"""Movie""","""Zombieland""","""Ruben Fleischer""","""Jesse Eisenberg, Woody Harrels…","""United States""",2019-11-01,2009,"""R""","""88 min""","""Comedies, Horror Movies""","""Looking to survive in a world …"
8806,2554,"""Movie""","""Zoom""","""Peter Hewitt""","""Tim Allen, Courteney Cox, Chev…","""United States""",2020-01-11,2006,"""PG""","""88 min""","""Children & Family Movies, Come…","""Dragged from civilian life, a …"


The inverse operation--replacing the special values with null--is more useful when reading these older files. Note that polars translates the `None` value in python to the null dataframe value.

In [37]:
view_count_ints.with_columns(pl.col("view count").replace(-999, None))

show_id,view count,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
i64,i64,str,str,str,str,str,date,i64,str,str,str,str
1,,"""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…"
2,5672,"""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…"
3,8329,"""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…"
4,9694,"""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …"
5,7951,"""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …"
…,…,…,…,…,…,…,…,…,…,…,…,…
8803,3029,"""Movie""","""Zodiac""","""David Fincher""","""Mark Ruffalo, Jake Gyllenhaal,…","""United States""",2019-11-20,2007,"""R""","""158 min""","""Cult Movies, Dramas, Thrillers""","""A political cartoonist, a crim…"
8804,,"""TV Show""","""Zombie Dumb""",,,,2019-07-01,2018,"""TV-Y7""","""2 Seasons""","""Kids' TV, Korean TV Shows, TV …","""While living alone in a spooky…"
8805,3877,"""Movie""","""Zombieland""","""Ruben Fleischer""","""Jesse Eisenberg, Woody Harrels…","""United States""",2019-11-01,2009,"""R""","""88 min""","""Comedies, Horror Movies""","""Looking to survive in a world …"
8806,2554,"""Movie""","""Zoom""","""Peter Hewitt""","""Tim Allen, Courteney Cox, Chev…","""United States""",2020-01-11,2006,"""PG""","""88 min""","""Children & Family Movies, Come…","""Dragged from civilian life, a …"


Another approach might use a statistical measure to fill in the missing values. For example, using the mean value may be a reasonable solution.

In [38]:
view_counts.with_columns(
    pl.col("view count").fill_null(pl.col("view count").mean()),
)

show_id,view count,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
i64,f64,str,str,str,str,str,date,i64,str,str,str,str
1,5475.391886,"""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…"
2,5672.976753,"""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…"
3,8329.431642,"""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…"
4,9694.184545,"""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …"
5,7951.019068,"""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …"
…,…,…,…,…,…,…,…,…,…,…,…,…
8803,3029.19655,"""Movie""","""Zodiac""","""David Fincher""","""Mark Ruffalo, Jake Gyllenhaal,…","""United States""",2019-11-20,2007,"""R""","""158 min""","""Cult Movies, Dramas, Thrillers""","""A political cartoonist, a crim…"
8804,5475.391886,"""TV Show""","""Zombie Dumb""",,,,2019-07-01,2018,"""TV-Y7""","""2 Seasons""","""Kids' TV, Korean TV Shows, TV …","""While living alone in a spooky…"
8805,3877.856967,"""Movie""","""Zombieland""","""Ruben Fleischer""","""Jesse Eisenberg, Woody Harrels…","""United States""",2019-11-01,2009,"""R""","""88 min""","""Comedies, Horror Movies""","""Looking to survive in a world …"
8806,2554.09238,"""Movie""","""Zoom""","""Peter Hewitt""","""Tim Allen, Courteney Cox, Chev…","""United States""",2020-01-11,2006,"""PG""","""88 min""","""Children & Family Movies, Come…","""Dragged from civilian life, a …"


The problem here is that it treats tv shows and movies similarly across all genres, countries, ratings, and release years. We can instead use similar types of titles to estimate view counts. This utilizes polars' `over` expression which is similar to window functions in databases.

In [39]:
view_counts.with_columns(
    pl.when(pl.col("view count").is_null())
    .then(
        pl.col("view count").mean().over(["type", "country", "release_year", "rating"])
    )
    .otherwise(pl.col("view count"))
)

show_id,view count,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
i64,f64,str,str,str,str,str,date,i64,str,str,str,str
1,4726.325187,"""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…"
2,5672.976753,"""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…"
3,8329.431642,"""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…"
4,9694.184545,"""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …"
5,7951.019068,"""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …"
…,…,…,…,…,…,…,…,…,…,…,…,…
8803,3029.19655,"""Movie""","""Zodiac""","""David Fincher""","""Mark Ruffalo, Jake Gyllenhaal,…","""United States""",2019-11-20,2007,"""R""","""158 min""","""Cult Movies, Dramas, Thrillers""","""A political cartoonist, a crim…"
8804,5709.170534,"""TV Show""","""Zombie Dumb""",,,,2019-07-01,2018,"""TV-Y7""","""2 Seasons""","""Kids' TV, Korean TV Shows, TV …","""While living alone in a spooky…"
8805,3877.856967,"""Movie""","""Zombieland""","""Ruben Fleischer""","""Jesse Eisenberg, Woody Harrels…","""United States""",2019-11-01,2009,"""R""","""88 min""","""Comedies, Horror Movies""","""Looking to survive in a world …"
8806,2554.09238,"""Movie""","""Zoom""","""Peter Hewitt""","""Tim Allen, Courteney Cox, Chev…","""United States""",2020-01-11,2006,"""PG""","""88 min""","""Children & Family Movies, Come…","""Dragged from civilian life, a …"


The `date_added` column also has missing values. We notice, however, that this value seems to be somewhat correlated with the show_id as the most recently added data has show_id values closer to 0.

In [40]:
uniq_df.filter(pl.col("date_added").is_not_null()).sort("date_added")

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,date,i64,str,str,str,str,f64
5958,"""Movie""","""To and From New York""","""Sorin Dan Mihalcescu""","""Barbara King, Shaana Diya, Joh…","""United States""",2008-01-01,2006,"""TV-MA""","""81 min""","""Dramas, Independent Movies, Th…","""While covering a story in New …",1477.845408
6612,"""TV Show""","""Dinner for Five""",,,"""United States""",2008-02-04,2007,"""TV-MA""","""1 Season""","""Stand-Up Comedy & Talk Shows""","""In each episode, four celebrit…",8619.289384
5957,"""Movie""","""Just Another Love Story""","""Ole Bornedal""","""Anders W. Berthelsen, Rebecka …","""Denmark""",2009-05-05,2007,"""TV-MA""","""104 min""","""Dramas, International Movies""","""When he causes a car accident …",8147.491405
5956,"""Movie""","""Splatter""","""Joe Dante""","""Corey Feldman, Tony Todd, Tara…","""United States""",2009-11-18,2009,"""TV-MA""","""29 min""","""Horror Movies""","""After committing suicide, a wa…",4195.239183
7371,"""Movie""","""Mad Ron's Prevues from Hell""","""Jim Monaco""","""Nick Pawlow, Jordu Schell, Jay…","""United States""",2010-11-01,1987,"""NR""","""84 min""","""Cult Movies, Horror Movies""","""This collection cherry-picks t…",4735.769142
…,…,…,…,…,…,…,…,…,…,…,…,…
8,"""Movie""","""Sankofa""","""Haile Gerima""","""Kofi Ghanaba, Oyafunmike Ogunl…","""United States, Ghana, Burkina …",2021-09-24,1993,"""TV-MA""","""125 min""","""Dramas, Independent Movies, In…","""On a photo shoot in Ghana, an …",7834.118691
9,"""TV Show""","""The Great British Baking Show""","""Andy Devonshire""","""Mel Giedroyc, Sue Perkins, Mar…","""United Kingdom""",2021-09-24,2021,"""TV-14""","""9 Seasons""","""British TV Shows, Reality TV""","""A talented batch of amateur ba…",1921.94935
10,"""Movie""","""The Starling""","""Theodore Melfi""","""Melissa McCarthy, Chris O'Dowd…","""United States""",2021-09-24,2021,"""PG-13""","""104 min""","""Comedies, Dramas""","""A woman adjusting to life afte…",8955.964467
11,"""TV Show""","""Vendetta: Truth, Lies and The …",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, Docuseries, In…","""Sicily boasts a bold ""Anti-Maf…",9504.425582


Grouping by month and looking at this further, there does seem to be some correlation.

In [41]:
uniq_df.group_by(pl.col("date_added").dt.truncate("1mo")).agg(
    pl.col("show_id").mean()
).sort("date_added").tail(20)

date_added,show_id
date,f64
2020-02-01,3321.692982
2020-03-01,3484.459854
2020-04-01,3347.062147
2020-05-01,3012.050955
2020-06-01,3016.762821
…,…
2021-05-01,891.5
2021-06-01,722.0
2021-07-01,490.0
2021-08-01,272.5


Thus, it may be reasonable to imput the missing date_added values with those values that are near to the missing value when sorted by date_added. polars has a fill_null method that supports different strategies including "forward" and "backward". Remember that we already sorted by the show_id, as otherwise, we would need to sort first.

In [42]:
uniq_df.filter(pl.col("date_added").is_null())

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,date,i64,str,str,str,str,f64
6067,"""TV Show""","""A Young Doctor's Notebook and …",,"""Daniel Radcliffe, Jon Hamm, Ad…","""United Kingdom""",,2013,"""TV-MA""","""2 Seasons""","""British TV Shows, TV Comedies,…","""Set during the Russian Revolut…",8357.142556
6175,"""TV Show""","""Anthony Bourdain: Parts Unknow…",,"""Anthony Bourdain""","""United States""",,2018,"""TV-PG""","""5 Seasons""","""Docuseries""","""This CNN original series has c…",5907.632893
6796,"""TV Show""","""Frasier""",,"""Kelsey Grammer, Jane Leeves, D…","""United States""",,2003,"""TV-PG""","""11 Seasons""","""Classic & Cult TV, TV Comedies""","""Frasier Crane is a snooty but …",1307.748154
6807,"""TV Show""","""Friends""",,"""Jennifer Aniston, Courteney Co…","""United States""",,2003,"""TV-14""","""10 Seasons""","""Classic & Cult TV, TV Comedies""","""This hit sitcom follows the me…",
6902,"""TV Show""","""Gunslinger Girl""",,"""Yuuka Nanri, Kanako Mitsuhashi…","""Japan""",,2008,"""TV-14""","""2 Seasons""","""Anime Series, Crime TV Shows""","""On the surface, the Social Wel…",9142.947848
7197,"""TV Show""","""Kikoriki""",,"""Igor Dmitriev""",,,2010,"""TV-Y""","""2 Seasons""","""Kids' TV""","""A wacky rabbit and his gang of…",3813.250843
7255,"""TV Show""","""La Familia P. Luche""",,"""Eugenio Derbez, Consuelo Duval…","""United States""",,2012,"""TV-14""","""3 Seasons""","""International TV Shows, Spanis…","""This irreverent sitcom featues…",1758.695289
7407,"""TV Show""","""Maron""",,"""Marc Maron, Judd Hirsch, Josh …","""United States""",,2016,"""TV-MA""","""4 Seasons""","""TV Comedies""","""Marc Maron stars as Marc Maron…",8228.971578
7848,"""TV Show""","""Red vs. Blue""",,"""Burnie Burns, Jason Saldaña, G…","""United States""",,2015,"""NR""","""13 Seasons""","""TV Action & Adventure, TV Come…","""This parody of first-person sh…",9727.349148
8183,"""TV Show""","""The Adventures of Figaro Pho""",,"""Luke Jurevicius, Craig Behenna…","""Australia""",,2015,"""TV-Y7""","""2 Seasons""","""Kids' TV, TV Comedies""","""Imagine your worst fears, then…",2812.301701


In [43]:
fill_df = uniq_df.filter((pl.col("show_id") >= 6065) & (pl.col("show_id") < 6070))

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,date,i64,str,str,str,str,f64
6065,"""Movie""","""A Week in Watts""","""Gregory Caruso""",,"""United States""",2018-02-19,2017,"""TV-14""","""91 min""","""Documentaries""","""Los Angeles police officers em…",1893.766991
6066,"""Movie""","""A Wrinkle in Time""","""Ava DuVernay""","""Storm Reid, Oprah Winfrey, Ree…","""United States""",2018-09-25,2018,"""PG""","""110 min""","""Children & Family Movies""","""Years after their father disap…",1247.992414
6067,"""TV Show""","""A Young Doctor's Notebook and …",,"""Daniel Radcliffe, Jon Hamm, Ad…","""United Kingdom""",,2013,"""TV-MA""","""2 Seasons""","""British TV Shows, TV Comedies,…","""Set during the Russian Revolut…",8357.142556
6068,"""TV Show""","""A.D. Kingdom and Empire""",,"""Juan Pablo Di Pace, Adam Levy,…","""United States""",2017-12-15,2015,"""TV-14""","""1 Season""","""TV Dramas""","""In the wake of Jesus Christ's …",8396.272773
6069,"""Movie""","""A.M.I.""","""Rusty Nixon""","""Debs Howard, Philip Granger, S…","""Canada""",2020-10-01,2019,"""TV-MA""","""77 min""","""Horror Movies""","""After losing her mother, a tee…",5454.846942


In [44]:
fill_df.with_columns(pl.col("date_added").fill_null(strategy="forward"))

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,date,i64,str,str,str,str,f64
6065,"""Movie""","""A Week in Watts""","""Gregory Caruso""",,"""United States""",2018-02-19,2017,"""TV-14""","""91 min""","""Documentaries""","""Los Angeles police officers em…",1893.766991
6066,"""Movie""","""A Wrinkle in Time""","""Ava DuVernay""","""Storm Reid, Oprah Winfrey, Ree…","""United States""",2018-09-25,2018,"""PG""","""110 min""","""Children & Family Movies""","""Years after their father disap…",1247.992414
6067,"""TV Show""","""A Young Doctor's Notebook and …",,"""Daniel Radcliffe, Jon Hamm, Ad…","""United Kingdom""",2018-09-25,2013,"""TV-MA""","""2 Seasons""","""British TV Shows, TV Comedies,…","""Set during the Russian Revolut…",8357.142556
6068,"""TV Show""","""A.D. Kingdom and Empire""",,"""Juan Pablo Di Pace, Adam Levy,…","""United States""",2017-12-15,2015,"""TV-14""","""1 Season""","""TV Dramas""","""In the wake of Jesus Christ's …",8396.272773
6069,"""Movie""","""A.M.I.""","""Rusty Nixon""","""Debs Howard, Philip Granger, S…","""Canada""",2020-10-01,2019,"""TV-MA""","""77 min""","""Horror Movies""","""After losing her mother, a tee…",5454.846942


In [45]:
fill_df.with_columns(pl.col("date_added").fill_null(strategy="backward"))

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,date,i64,str,str,str,str,f64
6065,"""Movie""","""A Week in Watts""","""Gregory Caruso""",,"""United States""",2018-02-19,2017,"""TV-14""","""91 min""","""Documentaries""","""Los Angeles police officers em…",1893.766991
6066,"""Movie""","""A Wrinkle in Time""","""Ava DuVernay""","""Storm Reid, Oprah Winfrey, Ree…","""United States""",2018-09-25,2018,"""PG""","""110 min""","""Children & Family Movies""","""Years after their father disap…",1247.992414
6067,"""TV Show""","""A Young Doctor's Notebook and …",,"""Daniel Radcliffe, Jon Hamm, Ad…","""United Kingdom""",2017-12-15,2013,"""TV-MA""","""2 Seasons""","""British TV Shows, TV Comedies,…","""Set during the Russian Revolut…",8357.142556
6068,"""TV Show""","""A.D. Kingdom and Empire""",,"""Juan Pablo Di Pace, Adam Levy,…","""United States""",2017-12-15,2015,"""TV-14""","""1 Season""","""TV Dramas""","""In the wake of Jesus Christ's …",8396.272773
6069,"""Movie""","""A.M.I.""","""Rusty Nixon""","""Debs Howard, Philip Granger, S…","""Canada""",2020-10-01,2019,"""TV-MA""","""77 min""","""Horror Movies""","""After losing her mother, a tee…",5454.846942


We extract the values around 6067 to show the difference between these two methods.

## Repairing Data

We have not yet fixed the issue with the rating field containing some duration values. This requires setting the rating field to none and moving those values to the duration field.

In [46]:
uniq_df.filter(pl.col("rating").str.ends_with("min"))

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,date,i64,str,str,str,str,f64
5542,"""Movie""","""Louis C.K. 2017""","""Louis C.K.""","""Louis C.K.""","""United States""",2017-04-04,2017,"""74 min""",,"""Movies""","""Louis C.K. muses on religion, …",5098.898959
5795,"""Movie""","""Louis C.K.: Hilarious""","""Louis C.K.""","""Louis C.K.""","""United States""",2016-09-16,2010,"""84 min""",,"""Movies""","""Emmy-winning comedy writer Lou…",5793.630643
5814,"""Movie""","""Louis C.K.: Live at the Comedy…","""Louis C.K.""","""Louis C.K.""","""United States""",2016-08-15,2015,"""66 min""",,"""Movies""","""The comic puts his trademark h…",9140.760807


In [47]:
repaired_data = uniq_df.with_columns(
    pl.when(pl.col("rating").str.ends_with("min"))
    .then(pl.lit(None))
    .otherwise(pl.col("rating"))
    .alias("rating"),
    pl.when(pl.col("rating").str.ends_with("min"))
    .then(pl.col("rating"))
    .otherwise(pl.col("duration"))
    .alias("duration"),
)

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,date,i64,str,str,str,str,f64
1,"""Movie""","""Dick Johnson Is Dead""","""Kirsten Johnson""",,"""United States""",2021-09-25,2020,"""PG-13""","""90 min""","""Documentaries""","""As her father nears the end of…",
2,"""TV Show""","""Blood & Water""",,"""Ama Qamata, Khosi Ngema, Gail …","""South Africa""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, TV Dra…","""After crossing paths at a part…",5672.976753
3,"""TV Show""","""Ganglands""","""Julien Leclercq""","""Sami Bouajila, Tracy Gotoas, S…",,2021-09-24,2021,"""TV-MA""","""1 Season""","""Crime TV Shows, International …","""To protect his family from a p…",8329.431642
4,"""TV Show""","""Jailbirds New Orleans""",,,,2021-09-24,2021,"""TV-MA""","""1 Season""","""Docuseries, Reality TV""","""Feuds, flirtations and toilet …",9694.184545
5,"""TV Show""","""Kota Factory""",,"""Mayur More, Jitendra Kumar, Ra…","""India""",2021-09-24,2021,"""TV-MA""","""2 Seasons""","""International TV Shows, Romant…","""In a city of coaching centers …",7951.019068
…,…,…,…,…,…,…,…,…,…,…,…,…
8803,"""Movie""","""Zodiac""","""David Fincher""","""Mark Ruffalo, Jake Gyllenhaal,…","""United States""",2019-11-20,2007,"""R""","""158 min""","""Cult Movies, Dramas, Thrillers""","""A political cartoonist, a crim…",3029.19655
8804,"""TV Show""","""Zombie Dumb""",,,,2019-07-01,2018,"""TV-Y7""","""2 Seasons""","""Kids' TV, Korean TV Shows, TV …","""While living alone in a spooky…",
8805,"""Movie""","""Zombieland""","""Ruben Fleischer""","""Jesse Eisenberg, Woody Harrels…","""United States""",2019-11-01,2009,"""R""","""88 min""","""Comedies, Horror Movies""","""Looking to survive in a world …",3877.856967
8806,"""Movie""","""Zoom""","""Peter Hewitt""","""Tim Allen, Courteney Cox, Chev…","""United States""",2020-01-11,2006,"""PG""","""88 min""","""Children & Family Movies, Come…","""Dragged from civilian life, a …",2554.09238


In [48]:
repaired_data.filter(pl.col("show_id").is_in([5542, 5795, 5814]))

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,view count
i64,str,str,str,str,str,date,i64,str,str,str,str,f64
5542,"""Movie""","""Louis C.K. 2017""","""Louis C.K.""","""Louis C.K.""","""United States""",2017-04-04,2017,,"""74 min""","""Movies""","""Louis C.K. muses on religion, …",5098.898959
5795,"""Movie""","""Louis C.K.: Hilarious""","""Louis C.K.""","""Louis C.K.""","""United States""",2016-09-16,2010,,"""84 min""","""Movies""","""Emmy-winning comedy writer Lou…",5793.630643
5814,"""Movie""","""Louis C.K.: Live at the Comedy…","""Louis C.K.""","""Louis C.K.""","""United States""",2016-08-15,2015,,"""66 min""","""Movies""","""The comic puts his trademark h…",9140.760807


More complex data repair often requires entity matching, where we detect that a value is likely the same but was changed in some way. For example, while English refers to the country of Brazil with a "z", Portuguese writes it with an "s". We will cover this topic more with data integration and data fusion.

### Final Exercise

Impute the duration missing values with the average duration for the type of content (TV Show or Movie). Note that `duration` is currently a string so this will take some data processing.