Cleaning Data#
Prerequisites
Outcomes
Be able to use string methods to clean data that comes as a string
Be able to drop missing data
Use cleaning methods to prepare and analyze a real dataset
Data
Item information from about 3,000 Chipotle meals from about 1,800 Grubhub orders
# Uncomment following line to install on colab
#! pip install
import pandas as pd
import numpy as np
Cleaning Data#
For many data projects, a significant proportion of time is spent collecting and cleaning the data — not performing the analysis.
This non-analysis work is often called “data cleaning”.
pandas provides very powerful data cleaning tools, which we will demonstrate using the following dataset.
df = pd.DataFrame({"numbers": ["#23", "#24", "#18", "#14", "#12", "#10", "#35"],
"nums": ["23", "24", "18", "14", np.nan, "XYZ", "35"],
"colors": ["green", "red", "yellow", "orange", "purple", "blue", "pink"],
"other_column": [0, 1, 0, 2, 1, 0, 2]})
df
numbers | nums | colors | other_column | |
---|---|---|---|---|
0 | #23 | 23 | green | 0 |
1 | #24 | 24 | red | 1 |
2 | #18 | 18 | yellow | 0 |
3 | #14 | 14 | orange | 2 |
4 | #12 | NaN | purple | 1 |
5 | #10 | XYZ | blue | 0 |
6 | #35 | 35 | pink | 2 |
What would happen if we wanted to try and compute the mean of
numbers
?
df["numbers"].mean()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/nanops.py:1630, in _ensure_numeric(x)
1629 try:
-> 1630 x = float(x)
1631 except (TypeError, ValueError):
1632 # e.g. "1+1j" or "foo"
ValueError: could not convert string to float: '#23#24#18#14#12#10#35'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/nanops.py:1634, in _ensure_numeric(x)
1633 try:
-> 1634 x = complex(x)
1635 except ValueError as err:
1636 # e.g. "foo"
ValueError: complex() arg is a malformed string
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 df["numbers"].mean()
File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/generic.py:11847, in NDFrame._add_numeric_operations.<locals>.mean(self, axis, skipna, level, numeric_only, **kwargs)
11829 @doc(
11830 _num_doc,
11831 desc="Return the mean of the values over the requested axis.",
(...)
11845 **kwargs,
11846 ):
> 11847 return NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs)
File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/generic.py:11401, in NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs)
11393 def mean(
11394 self,
11395 axis: Axis | None | lib.NoDefault = lib.no_default,
(...)
11399 **kwargs,
11400 ) -> Series | float:
> 11401 return self._stat_function(
11402 "mean", nanops.nanmean, axis, skipna, level, numeric_only, **kwargs
11403 )
File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/generic.py:11353, in NDFrame._stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs)
11343 warnings.warn(
11344 "Using the level keyword in DataFrame and Series aggregations is "
11345 "deprecated and will be removed in a future version. Use groupby "
(...)
11348 stacklevel=find_stack_level(),
11349 )
11350 return self._agg_by_level(
11351 name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only
11352 )
> 11353 return self._reduce(
11354 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
11355 )
File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/series.py:4816, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
4812 raise NotImplementedError(
4813 f"Series.{name} does not implement {kwd_name}."
4814 )
4815 with np.errstate(all="ignore"):
-> 4816 return op(delegate, skipna=skipna, **kwds)
File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/nanops.py:93, in disallow.__call__.<locals>._f(*args, **kwargs)
91 try:
92 with np.errstate(invalid="ignore"):
---> 93 return f(*args, **kwargs)
94 except ValueError as e:
95 # we want to transform an object array
96 # ValueError message to the more typical TypeError
97 # e.g. this is normally a disallowed function on
98 # object arrays that contain strings
99 if is_object_dtype(args[0]):
File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/nanops.py:155, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
153 result = alt(values, axis=axis, skipna=skipna, **kwds)
154 else:
--> 155 result = alt(values, axis=axis, skipna=skipna, **kwds)
157 return result
File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/nanops.py:418, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
415 if datetimelike and mask is None:
416 mask = isna(values)
--> 418 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
420 if datetimelike:
421 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)
File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/nanops.py:706, in nanmean(values, axis, skipna, mask)
703 dtype_count = dtype
705 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 706 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
708 if axis is not None and getattr(the_sum, "ndim", False):
709 count = cast(np.ndarray, count)
File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/nanops.py:1637, in _ensure_numeric(x)
1634 x = complex(x)
1635 except ValueError as err:
1636 # e.g. "foo"
-> 1637 raise TypeError(f"Could not convert {x} to numeric") from err
1638 return x
TypeError: Could not convert #23#24#18#14#12#10#35 to numeric
It throws an error!
Can you figure out why?
Hint
When looking at error messages, start at the very bottom.
The final error says, TypeError: Could not convert #23#24... to numeric
.
Exercise
See exercise 1 in the exercise list.
String Methods#
Our solution to the previous exercise was to remove the #
by using
the replace
string method: int(c2n.replace("#", ""))
.
One way to make this change to every element of a column would be to loop through all elements of the column and apply the desired string methods…
%%time
# Iterate over all rows
for row in df.iterrows():
# `iterrows` method produces a tuple with two elements...
# The first element is an index and the second is a Series with the data from that row
index_value, column_values = row
# Apply string method
clean_number = int(column_values["numbers"].replace("#", ""))
# The `at` method is very similar to the `loc` method, but it is specialized
# for accessing single elements at a time... We wanted to use it here to give
# the loop the best chance to beat a faster method which we show you next.
df.at[index_value, "numbers_loop"] = clean_number
CPU times: user 1.65 ms, sys: 153 µs, total: 1.8 ms
Wall time: 1.73 ms
While this is fast for a small dataset like this, this method slows for larger datasets.
One significantly faster (and easier) method is to apply a string method to an entire column of data.
Most methods that are available to a Python string (we learned a
few of them in the strings lecture) are
also available to a pandas Series that has dtype
object.
We access them by doing s.str.method_name
where method_name
is
the name of the method.
When we apply the method to a Series, it is applied to all rows in the Series in one shot!
Let’s redo our previous example using a pandas .str
method.
%%time
# ~2x faster than loop... However, speed gain increases with size of DataFrame. The
# speedup can be in the ballpark of ~100-500x faster for big DataFrames.
# See appendix at the end of the lecture for an application on a larger DataFrame
df["numbers_str"] = df["numbers"].str.replace("#", "")
CPU times: user 698 µs, sys: 65 µs, total: 763 µs
Wall time: 761 µs
We can use .str
to access almost any string method that works on
normal strings. (See the official
documentation
for more information.)
df["colors"].str.contains("p")
0 False
1 False
2 False
3 False
4 True
5 False
6 True
Name: colors, dtype: bool
df["colors"].str.capitalize()
0 Green
1 Red
2 Yellow
3 Orange
4 Purple
5 Blue
6 Pink
Name: colors, dtype: object
Exercise
See exercise 2
Type Conversions#
In our example above, the dtype
of the numbers_str
column shows that pandas still treats
it as a string even after we have removed the "#"
.
We need to convert this column to numbers.
The best way to do this is using the pd.to_numeric
function.
This method attempts to convert whatever is stored in a Series into numeric values
For example, after the "#"
removed, the numbers of column
"numbers"
are ready to be converted to actual numbers.
df["numbers_numeric"] = pd.to_numeric(df["numbers_str"])
df.dtypes
numbers object
nums object
colors object
other_column int64
numbers_loop float64
numbers_str object
numbers_numeric int64
dtype: object
df.head()
numbers | nums | colors | other_column | numbers_loop | numbers_str | numbers_numeric | |
---|---|---|---|---|---|---|---|
0 | #23 | 23 | green | 0 | 23.0 | 23 | 23 |
1 | #24 | 24 | red | 1 | 24.0 | 24 | 24 |
2 | #18 | 18 | yellow | 0 | 18.0 | 18 | 18 |
3 | #14 | 14 | orange | 2 | 14.0 | 14 | 14 |
4 | #12 | NaN | purple | 1 | 12.0 | 12 | 12 |
We can convert to other types well.
Using the astype
method, we can convert to any of the supported
pandas dtypes
(recall the intro lecture).
Below are some examples. (Pay attention to the reported dtype
)
df["numbers_numeric"].astype(str)
0 23
1 24
2 18
3 14
4 12
5 10
6 35
Name: numbers_numeric, dtype: object
df["numbers_numeric"].astype(float)
0 23.0
1 24.0
2 18.0
3 14.0
4 12.0
5 10.0
6 35.0
Name: numbers_numeric, dtype: float64
Exercise
See exercise 3 in the exercise list.
Missing Data#
Many datasets have missing data.
In our example, we are missing an element from the "nums"
column.
df
numbers | nums | colors | other_column | numbers_loop | numbers_str | numbers_numeric | |
---|---|---|---|---|---|---|---|
0 | #23 | 23 | green | 0 | 23.0 | 23 | 23 |
1 | #24 | 24 | red | 1 | 24.0 | 24 | 24 |
2 | #18 | 18 | yellow | 0 | 18.0 | 18 | 18 |
3 | #14 | 14 | orange | 2 | 14.0 | 14 | 14 |
4 | #12 | NaN | purple | 1 | 12.0 | 12 | 12 |
5 | #10 | XYZ | blue | 0 | 10.0 | 10 | 10 |
6 | #35 | 35 | pink | 2 | 35.0 | 35 | 35 |
We can find missing data by using the isnull
method.
df.isnull()
numbers | nums | colors | other_column | numbers_loop | numbers_str | numbers_numeric | |
---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False |
4 | False | True | False | False | False | False | False |
5 | False | False | False | False | False | False | False |
6 | False | False | False | False | False | False | False |
We might want to know whether particular rows or columns have any missing data.
To do this we can use the .any
method on the boolean DataFrame
df.isnull()
.
df.isnull().any(axis=0)
numbers False
nums True
colors False
other_column False
numbers_loop False
numbers_str False
numbers_numeric False
dtype: bool
df.isnull().any(axis=1)
0 False
1 False
2 False
3 False
4 True
5 False
6 False
dtype: bool
Many approaches have been developed to deal with missing data, but the two most commonly used (and the corresponding DataFrame method) are:
Exclusion: Ignore any data that is missing (
.dropna
).Imputation: Compute “predicted” values for the data that is missing (
.fillna
).
For the advantages and disadvantages of these (and other) approaches, consider reading the Wikipedia article.
For now, let’s see some examples.
# drop all rows containing a missing observation
df.dropna()
numbers | nums | colors | other_column | numbers_loop | numbers_str | numbers_numeric | |
---|---|---|---|---|---|---|---|
0 | #23 | 23 | green | 0 | 23.0 | 23 | 23 |
1 | #24 | 24 | red | 1 | 24.0 | 24 | 24 |
2 | #18 | 18 | yellow | 0 | 18.0 | 18 | 18 |
3 | #14 | 14 | orange | 2 | 14.0 | 14 | 14 |
5 | #10 | XYZ | blue | 0 | 10.0 | 10 | 10 |
6 | #35 | 35 | pink | 2 | 35.0 | 35 | 35 |
# fill the missing values with a specific value
df.fillna(value=100)
numbers | nums | colors | other_column | numbers_loop | numbers_str | numbers_numeric | |
---|---|---|---|---|---|---|---|
0 | #23 | 23 | green | 0 | 23.0 | 23 | 23 |
1 | #24 | 24 | red | 1 | 24.0 | 24 | 24 |
2 | #18 | 18 | yellow | 0 | 18.0 | 18 | 18 |
3 | #14 | 14 | orange | 2 | 14.0 | 14 | 14 |
4 | #12 | 100 | purple | 1 | 12.0 | 12 | 12 |
5 | #10 | XYZ | blue | 0 | 10.0 | 10 | 10 |
6 | #35 | 35 | pink | 2 | 35.0 | 35 | 35 |
# use the _next_ valid observation to fill the missing data
df.fillna(method="bfill")
numbers | nums | colors | other_column | numbers_loop | numbers_str | numbers_numeric | |
---|---|---|---|---|---|---|---|
0 | #23 | 23 | green | 0 | 23.0 | 23 | 23 |
1 | #24 | 24 | red | 1 | 24.0 | 24 | 24 |
2 | #18 | 18 | yellow | 0 | 18.0 | 18 | 18 |
3 | #14 | 14 | orange | 2 | 14.0 | 14 | 14 |
4 | #12 | XYZ | purple | 1 | 12.0 | 12 | 12 |
5 | #10 | XYZ | blue | 0 | 10.0 | 10 | 10 |
6 | #35 | 35 | pink | 2 | 35.0 | 35 | 35 |
# use the _previous_ valid observation to fill missing data
df.fillna(method="ffill")
numbers | nums | colors | other_column | numbers_loop | numbers_str | numbers_numeric | |
---|---|---|---|---|---|---|---|
0 | #23 | 23 | green | 0 | 23.0 | 23 | 23 |
1 | #24 | 24 | red | 1 | 24.0 | 24 | 24 |
2 | #18 | 18 | yellow | 0 | 18.0 | 18 | 18 |
3 | #14 | 14 | orange | 2 | 14.0 | 14 | 14 |
4 | #12 | 14 | purple | 1 | 12.0 | 12 | 12 |
5 | #10 | XYZ | blue | 0 | 10.0 | 10 | 10 |
6 | #35 | 35 | pink | 2 | 35.0 | 35 | 35 |
We will see more examples of dealing with missing data in future chapters.
Case Study#
We will now use data from an article written by The Upshot at the NYTimes.
This data has order information from almost 2,000 Chipotle orders and includes information on what was ordered and how much it cost.
url = "https://datascience.quantecon.org/assets/data/chipotle_raw.csv.zip"
chipotle = pd.read_csv(url)
chipotle.head()
Unnamed: 0 | order_id | quantity | item_name | choice_description | item_price | |
---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 |
1 | 1 | 1 | 1 | Izze | [Clementine] | $3.39 |
2 | 2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 |
3 | 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 |
4 | 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
Exercise
See exercise 4 in the exercise list.
Appendix: Performance of .str
Methods#
Let’s repeat the “remove the #
” example from above, but this time on
a much larger dataset.
import numpy as np
test = pd.DataFrame({"floats": np.round(100*np.random.rand(100000), 2)})
test["strings"] = test["floats"].astype(str) + "%"
test.head()
floats | strings | |
---|---|---|
0 | 82.03 | 82.03% |
1 | 41.31 | 41.31% |
2 | 7.27 | 7.27% |
3 | 63.95 | 63.95% |
4 | 19.13 | 19.13% |
%%time
for row in test.iterrows():
index_value, column_values = row
clean_number = column_values["strings"].replace("%", "")
test.at[index_value, "numbers_loop"] = clean_number
CPU times: user 5.72 s, sys: 28 µs, total: 5.72 s
Wall time: 5.72 s
%%time
test["numbers_str_method"] = test["strings"].str.replace("%", "")
CPU times: user 23.9 ms, sys: 4 ms, total: 27.9 ms
Wall time: 28 ms
test["numbers_str_method"].equals(test["numbers_loop"])
True
We got the exact same result in a fraction of the time!
Exercises#
Exercise 1#
Convert the string below into a number.
c2n = "#39"
Exercise 2#
Make a new column called colors_upper
that contains the elements of
colors
with all uppercase letters.
Exercise 3#
Convert the column "nums"
to a numeric type using pd.to_numeric
and
save it to the DataFrame as "nums_tonumeric"
.
Notice that there is a missing value, and a value that is not a number.
Look at the documentation for pd.to_numeric
and think about how to
overcome this.
Think about why this could be a bad idea of used without
knowing what your data looks like. (Think about what happens when you
apply it to the "numbers"
column before replacing the "#"
.)
Exercise 4#
We’d like you to use this data to answer the following questions.
What is the average price of an item with chicken?
What is the average price of an item with steak?
Did chicken or steak produce more revenue (total)?
How many missing items are there in this dataset? How many missing items in each column?
Hint
Before you will be able to do any of these things you will need to
make sure the item_price
column has a numeric dtype
(probably
float).