QuantEcon DataScience

Introduction to Economic Modeling and Data Science

Cleaning Data

Prerequisites

Outcomes

  • Be able to use string methods to clean data that comes as a string
  • Be able to drop missing data
  • Use cleaning methods to prepare and analyze a real dataset

Data

  • Item information from about 3,000 Chipotle meals from about 1,800 Grubhub orders
In [1]:
# Uncomment following line to install on colab
#! pip install qeds
In [2]:
import pandas as pd
import numpy as np
import qeds

Cleaning Data

For many data projects, a significant proportion of time is spent collecting and cleaning the data — not performing the analysis.

This non-analysis work is often called “data cleaning”.

pandas provides very powerful data cleaning tools, which we will demonstrate using the following dataset.

In [3]:
df = pd.DataFrame({"numbers": ["#23", "#24", "#18", "#14", "#12", "#10", "#35"],
                   "nums": ["23", "24", "18", "14", np.nan, "XYZ", "35"],
                   "colors": ["green", "red", "yellow", "orange", "purple", "blue", "pink"],
                   "other_column": [0, 1, 0, 2, 1, 0, 2]})
df
Out[3]:
numbers nums colors other_column
0 #23 23 green 0
1 #24 24 red 1
2 #18 18 yellow 0
3 #14 14 orange 2
4 #12 NaN purple 1
5 #10 XYZ blue 0
6 #35 35 pink 2

What would happen if we wanted to try and compute the mean of numbers?

df["numbers"].mean()

It throws an error!

Can you figure out why?

Hint: When looking at error messages, start at the very bottom.

The final error says, TypeError: Could not convert #23#24... to numeric.

See exercise 1 in the exercise list

String Methods

Our solution to the previous exercise was to remove the # by using the replace string method: int(c2n.replace("#", "")).

One way to make this change to every element of a column would be to loop through all elements of the column and apply the desired string methods…

In [4]:
%%time

# Iterate over all rows
for row in df.iterrows():

    # `iterrows` method produces a tuple with two elements...
    # The first element is an index and the second is a Series with the data from that row
    index_value, column_values = row

    # Apply string method
    clean_number = int(column_values["numbers"].replace("#", ""))

    # The `at` method is very similar to the `loc` method, but it is specialized
    # for accessing single elements at a time... We wanted to use it here to give
    # the loop the best chance to beat a faster method which we show you next.
    df.at[index_value, "numbers_loop"] = clean_number
CPU times: user 2.81 ms, sys: 0 ns, total: 2.81 ms
Wall time: 2.6 ms

While this is fast for a small dataset like this, this method slows for larger datasets.

One significantly faster (and easier) method is to apply a string method to an entire column of data.

Most methods that are available to a Python string (we learned a few of them in the strings lecture) are also available to a pandas Series that has dtype object.

We access them by doing s.str.method_name where method_name is the name of the method.

When we apply the method to a Series, it is applied to all rows in the Series in one shot!

Let’s redo our previous example using a pandas .str method.

In [5]:
%%time

# ~2x faster than loop... However, speed gain increases with size of DataFrame. The
# speedup can be in the ballpark of ~100-500x faster for big DataFrames.
# See appendix at the end of the lecture for an application on a larger DataFrame
df["numbers_str"] = df["numbers"].str.replace("#", "")
CPU times: user 734 µs, sys: 117 µs, total: 851 µs
Wall time: 784 µs

We can use .str to access almost any string method that works on normal strings. (See the official documentation for more information.)

In [6]:
df["colors"].str.contains("p")
Out[6]:
0    False
1    False
2    False
3    False
4     True
5    False
6     True
Name: colors, dtype: bool
In [7]:
df["colors"].str.capitalize()
Out[7]:
0     Green
1       Red
2    Yellow
3    Orange
4    Purple
5      Blue
6      Pink
Name: colors, dtype: object

See exercise 2 in the exercise list

Type Conversions

In our example above, the dtype of the numbers_str column shows that pandas still treats it as a string even after we have removed the "#".

We need to convert this column to numbers.

The best way to do this is using the pd.to_numeric function.

This method attempts to convert whatever is stored in a Series into numeric values

For example, after the "#" removed, the numbers of column "numbers" are ready to be converted to actual numbers.

In [8]:
df["numbers_numeric"] = pd.to_numeric(df["numbers_str"])
In [9]:
df.dtypes
Out[9]:
numbers             object
nums                object
colors              object
other_column         int64
numbers_loop       float64
numbers_str         object
numbers_numeric      int64
dtype: object
In [10]:
df.head()
Out[10]:
numbers nums colors other_column numbers_loop numbers_str numbers_numeric
0 #23 23 green 0 23.0 23 23
1 #24 24 red 1 24.0 24 24
2 #18 18 yellow 0 18.0 18 18
3 #14 14 orange 2 14.0 14 14
4 #12 NaN purple 1 12.0 12 12

We can convert to other types well.

Using the astype method, we can convert to any of the supported pandas dtypes (recall the intro lecture).

Below are some examples. (Pay attention to the reported dtype)

In [11]:
df["numbers_numeric"].astype(str)
Out[11]:
0    23
1    24
2    18
3    14
4    12
5    10
6    35
Name: numbers_numeric, dtype: object
In [12]:
df["numbers_numeric"].astype(float)
Out[12]:
0    23.0
1    24.0
2    18.0
3    14.0
4    12.0
5    10.0
6    35.0
Name: numbers_numeric, dtype: float64

See exercise 3 in the exercise list

Missing Data

Many datasets have missing data.

In our example, we are missing an element from the "nums" column.

In [13]:
df
Out[13]:
numbers nums colors other_column numbers_loop numbers_str numbers_numeric
0 #23 23 green 0 23.0 23 23
1 #24 24 red 1 24.0 24 24
2 #18 18 yellow 0 18.0 18 18
3 #14 14 orange 2 14.0 14 14
4 #12 NaN purple 1 12.0 12 12
5 #10 XYZ blue 0 10.0 10 10
6 #35 35 pink 2 35.0 35 35

We can find missing data by using the isnull method.

In [14]:
df.isnull()
Out[14]:
numbers nums colors other_column numbers_loop numbers_str numbers_numeric
0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False False False False False False
4 False True False False False False False
5 False False False False False False False
6 False False False False False False False

We might want to know whether particular rows or columns have any missing data.

To do this we can use the .any method on the boolean DataFrame df.isnull().

In [15]:
df.isnull().any(axis=0)
Out[15]:
numbers            False
nums                True
colors             False
other_column       False
numbers_loop       False
numbers_str        False
numbers_numeric    False
dtype: bool
In [16]:
df.isnull().any(axis=1)
Out[16]:
0    False
1    False
2    False
3    False
4     True
5    False
6    False
dtype: bool

Many approaches have been developed to deal with missing data, but the two most commonly used (and the corresponding DataFrame method) are:

  • Exclusion: Ignore any data that is missing (.dropna).
  • Imputation: Compute “predicted” values for the data that is missing (.fillna).

For the advantages and disadvantages of these (and other) approaches, consider reading the Wikipedia article.

For now, let’s see some examples.

In [17]:
# drop all rows containing a missing observation
df.dropna()
Out[17]:
numbers nums colors other_column numbers_loop numbers_str numbers_numeric
0 #23 23 green 0 23.0 23 23
1 #24 24 red 1 24.0 24 24
2 #18 18 yellow 0 18.0 18 18
3 #14 14 orange 2 14.0 14 14
5 #10 XYZ blue 0 10.0 10 10
6 #35 35 pink 2 35.0 35 35
In [18]:
# fill the missing values with a specific value
df.fillna(value=100)
Out[18]:
numbers nums colors other_column numbers_loop numbers_str numbers_numeric
0 #23 23 green 0 23.0 23 23
1 #24 24 red 1 24.0 24 24
2 #18 18 yellow 0 18.0 18 18
3 #14 14 orange 2 14.0 14 14
4 #12 100 purple 1 12.0 12 12
5 #10 XYZ blue 0 10.0 10 10
6 #35 35 pink 2 35.0 35 35
In [19]:
# use the _next_ valid observation to fill the missing data
df.fillna(method="bfill")
Out[19]:
numbers nums colors other_column numbers_loop numbers_str numbers_numeric
0 #23 23 green 0 23.0 23 23
1 #24 24 red 1 24.0 24 24
2 #18 18 yellow 0 18.0 18 18
3 #14 14 orange 2 14.0 14 14
4 #12 XYZ purple 1 12.0 12 12
5 #10 XYZ blue 0 10.0 10 10
6 #35 35 pink 2 35.0 35 35
In [20]:
# use the _previous_ valid observation to fill missing data
df.fillna(method="ffill")
Out[20]:
numbers nums colors other_column numbers_loop numbers_str numbers_numeric
0 #23 23 green 0 23.0 23 23
1 #24 24 red 1 24.0 24 24
2 #18 18 yellow 0 18.0 18 18
3 #14 14 orange 2 14.0 14 14
4 #12 14 purple 1 12.0 12 12
5 #10 XYZ blue 0 10.0 10 10
6 #35 35 pink 2 35.0 35 35

We will see more examples of dealing with missing data in future chapters.

Case Study

We will now use data from an article written by The Upshot at the NYTimes.

This data has order information from almost 2,000 Chipotle orders and includes information on what was ordered and how much it cost.

In [21]:
chipotle = qeds.data.load("chipotle_raw")
chipotle.head()
Out[21]:
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98

See exercise 4 in the exercise list

Appendix: Performance of .str Methods

Let’s repeat the “remove the #” example from above, but this time on a much larger dataset.

In [22]:
import numpy as np
test = pd.DataFrame({"floats": np.round(100*np.random.rand(100000), 2)})
test["strings"] = test["floats"].astype(str) + "%"
test.head()
Out[22]:
floats strings
0 96.20 96.2%
1 7.64 7.64%
2 2.63 2.63%
3 98.48 98.48%
4 74.47 74.47%
In [23]:
%%time

for row in test.iterrows():
    index_value, column_values = row
    clean_number = column_values["strings"].replace("%", "")
    test.at[index_value, "numbers_loop"] = clean_number
CPU times: user 8.58 s, sys: 18.9 ms, total: 8.6 s
Wall time: 8.68 s
In [24]:
%%time
test["numbers_str_method"] = test["strings"].str.replace("%", "")
CPU times: user 47.3 ms, sys: 0 ns, total: 47.3 ms
Wall time: 47.4 ms
In [25]:
test["numbers_str_method"].equals(test["numbers_loop"])
Out[25]:
True

We got the exact same result in a fraction of the time!

Exercises

Exercise 1

Convert the string below into a number.

In [26]:
c2n = "#39"

(back to text)

Exercise 2

Make a new column called colors_upper that contains the elements of colors with all uppercase letters.

(back to text)

Exercise 3

Convert the column "nums" to a numeric type using pd.to_numeric and save it to the DataFrame as "nums_tonumeric".

Notice that there is a missing value, and a value that is not a number.

Look at the documentation for pd.to_numeric and think about how to overcome this.

Think about why this could be a bad idea of used without knowing what your data looks like. (Think about what happens when you apply it to the "numbers" column before replacing the "#".)

(back to text)

Exercise 4

We'd like you to use this data to answer the following questions.

  • What is the average price of an item with chicken?
  • What is the average price of an item with steak?
  • Did chicken or steak produce more revenue (total)?
  • How many missing items are there in this dataset? How many missing items in each column?

Hint: before you will be able to do any of these things you will need to make sure the item_price column has a numeric dtype (probably float)

(back to text)

Download

Launch Notebook