Problem Set 7#

import matplotlib.colors as mplc
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import pandas as pd

Question 1#

From Data Visualization: Rules and Guidelines

Create a bar chart of the below data on Canadian GDP growth. Use a non-red color for the years 2000 to 2008, red for 2009, and the first color again for 2010 to 2018.

ca_gdp = pd.Series(
    [5.2, 1.8, 3.0, 1.9, 3.1, 3.2, 2.8, 2.2, 1.0, -2.8, 3.2, 3.1, 1.7, 2.5, 2.9, 1.0, 1.4, 3.0],
    index=list(range(2000, 2018))
)

fig, ax = plt.subplots()

for side in ["right", "top", "left", "bottom"]:
    ax.spines[side].set_visible(False)
../_images/d7823de2891fba9cf876dacb728b1daaee9fd389a251850997f8fe0a286f1cef.png

Question 2#

From Data Visualization: Rules and Guidelines

Draft another way to organize time and education by modifying the code below. That is, have two subplots (one for each education level) and four groups of points (one for each year).

Why do you think they chose to organize the information the way they did rather than this way?

# Read in data
df = pd.read_csv("https://datascience.quantecon.org/assets/data/density_wage_data.csv")
df["year"] = df.year.astype(int)  # Convert year to int


def single_scatter_plot(df, year, educ, ax, color):
    """
    This function creates a single year's and education level's
    log density to log wage plot
    """
    # Filter data to keep only the data of interest
    _df = df.query("(year == @year) & (group == @educ)")
    _df.plot(
        kind="scatter", x="density_log", y="wages_logs", ax=ax, color=color
    )

    return ax

# Create initial plot
fig, ax = plt.subplots(1, 4, figsize=(16, 6), sharey=True)

for (i, year) in enumerate(df.year.unique()):
    single_scatter_plot(df, year, "college", ax[i], "b")
    single_scatter_plot(df, year, "noncollege", ax[i], "r")
    ax[i].set_title(str(year))
../_images/0a0a08fa455980ac92f0784b61e6f70bcf470208d669172947e05e83c4dfc590.png

Questions 3-5#

These question uses a dataset from the Bureau of Transportation Statistics that describes the cause for all US domestic flight delays in November 2016. We used the same data in the previous problem set.

url = "https://datascience.quantecon.org/assets/data/airline_performance_dec16.csv.zip"
air_perf = pd.read_csv(url)[["CRSDepTime", "Carrier", "CarrierDelay", "ArrDelay"]]
air_perf.info()
air_perf.head
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 460949 entries, 0 to 460948
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   CRSDepTime    460949 non-null  object 
 1   Carrier       460949 non-null  object 
 2   CarrierDelay  460949 non-null  float64
 3   ArrDelay      452229 non-null  float64
dtypes: float64(2), object(2)
memory usage: 14.1+ MB
<bound method NDFrame.head of                  CRSDepTime Carrier  CarrierDelay  ArrDelay
0       2016-12-18 15:58:00      AA           0.0      20.0
1       2016-12-19 15:58:00      AA           0.0      20.0
2       2016-12-20 15:58:00      AA           0.0      -3.0
3       2016-12-21 15:58:00      AA           0.0     -10.0
4       2016-12-22 15:58:00      AA           0.0      -8.0
...                     ...     ...           ...       ...
460944  2016-12-30 05:25:00      DL           0.0      -5.0
460945  2016-12-30 13:42:00      DL           0.0       3.0
460946  2016-12-30 15:28:00      DL           0.0     -29.0
460947  2016-12-30 11:15:00      DL           0.0      -3.0
460948  2016-12-30 13:45:00      DL           0.0     -10.0

[460949 rows x 4 columns]>

The following questions are intentionally somewhat open-ended. For each one, carefully choose the type of visualization you’ll create. Put some effort into choosing colors, labels, and other formatting.

Question 3#

Create a visualization of the relationship between airline (carrier) and delays.

Question 4#

Create a visualization of the relationship between date and delays.

Question 5#

Create a visualization of the relationship between location (origin and/or destination) and delays.