Time series#

Prerequisites

Outcomes

  • Know how pandas handles dates

  • Understand how to parse strings into datetime objects

  • Know how to write dates as custom formatted strings

  • Be able to access day, month, year, etc. for a DateTimeIndex and a column with dtype datetime

  • Understand both rolling and re-sampling operations and the difference between the two

Data

  • Bitcoin to USD exchange rates from March 2014 to the present

# Uncomment following line to install on colab
#! pip install 
import os
import pandas as pd
import matplotlib.pyplot as plt
import quandl

# see section on API keys at end of lecture!
quandl.ApiConfig.api_key = os.environ.get("QUANDL_AUTH", "Dn6BtVoBhzuKTuyo6hbp")
start_date = "2014-05-01"

%matplotlib inline

Intro#

pandas has extensive support for handling dates and times.

We will loosely refer to data with date or time information as time series data.

In this lecture, we will cover the most useful parts of pandas’ time series functionality.

Among these topics are:

  • Parsing strings as dates

  • Writing datetime objects as strings (inverse operation of previous point)

  • Extracting data from a DataFrame or Series with date information in the index

  • Shifting data through time (taking leads or lags)

  • Re-sampling data to a different frequency and rolling operations

However, even more than with previous topics, we will skip a lot of the functionality pandas offers, and we urge you to refer to the official documentation for more information.

Parsing Strings as Dates#

When working with time series data, we almost always receive the data with dates encoded as strings.

Hopefully, the date strings follow a structured format or pattern.

One common pattern is YYYY-MM-DD: 4 numbers for the year, 2 for the month, and 2 for the day with each section separated by a -.

For example, we write Christmas day 2017 in this format as

christmas_str = "2017-12-25"

To convert a string into a time-aware object, we use the pd.to_datetime function.

christmas = pd.to_datetime(christmas_str)
print("The type of christmas is", type(christmas))
christmas
The type of christmas is <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Timestamp('2017-12-25 00:00:00')

The pd.to_datetime function is pretty smart at guessing the format of the date…

for date in ["December 25, 2017", "Dec. 25, 2017",
             "Monday, Dec. 25, 2017", "25 Dec. 2017", "25th Dec. 2017"]:
    print("pandas interprets {} as {}".format(date, pd.to_datetime(date)))
pandas interprets December 25, 2017 as 2017-12-25 00:00:00
pandas interprets Dec. 25, 2017 as 2017-12-25 00:00:00
pandas interprets Monday, Dec. 25, 2017 as 2017-12-25 00:00:00
pandas interprets 25 Dec. 2017 as 2017-12-25 00:00:00
pandas interprets 25th Dec. 2017 as 2017-12-25 00:00:00

However, sometimes we will need to give pandas a hint.

For example, that same time (midnight on Christmas) would be reported on an Amazon transaction report as

christmas_amzn = "2017-12-25T00:00:00+ 00 :00"

If we try to pass this to pd.to_datetime, it will fail.

pd.to_datetime(christmas_amzn)
---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/_libs/tslib.pyx:605, in pandas._libs.tslib.array_to_datetime()

File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/_libs/tslibs/parsing.pyx:318, in pandas._libs.tslibs.parsing.parse_datetime_string()

File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/dateutil/parser/_parser.py:1368, in parse(timestr, parserinfo, **kwargs)
   1367 else:
-> 1368     return DEFAULTPARSER.parse(timestr, **kwargs)

File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/dateutil/parser/_parser.py:643, in parser.parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    642 if res is None:
--> 643     raise ParserError("Unknown string format: %s", timestr)
    645 if len(res) == 0:

ParserError: Unknown string format: 2017-12-25T00:00:00+ 00 :00

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/_libs/tslib.pyx:616, in pandas._libs.tslib.array_to_datetime()

TypeError: invalid string coercion to datetime for "2017-12-25T00:00:00+ 00 :00" at position 0

During handling of the above exception, another exception occurred:

ParserError                               Traceback (most recent call last)
Cell In[6], line 1
----> 1 pd.to_datetime(christmas_amzn)

File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/tools/datetimes.py:1102, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1100         result = convert_listlike(argc, format)
   1101 else:
-> 1102     result = convert_listlike(np.array([arg]), format)[0]
   1103     if isinstance(arg, bool) and isinstance(result, np.bool_):
   1104         result = bool(result)  # TODO: avoid this kludge.

File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/tools/datetimes.py:438, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    436 assert format is None or infer_datetime_format
    437 utc = tz == "utc"
--> 438 result, tz_parsed = objects_to_datetime64ns(
    439     arg,
    440     dayfirst=dayfirst,
    441     yearfirst=yearfirst,
    442     utc=utc,
    443     errors=errors,
    444     require_iso8601=require_iso8601,
    445     allow_object=True,
    446 )
    448 if tz_parsed is not None:
    449     # We can take a shortcut since the datetime64 numpy array
    450     # is in UTC
    451     dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))

File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:2177, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2175 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
   2176 try:
-> 2177     result, tz_parsed = tslib.array_to_datetime(
   2178         data.ravel("K"),
   2179         errors=errors,
   2180         utc=utc,
   2181         dayfirst=dayfirst,
   2182         yearfirst=yearfirst,
   2183         require_iso8601=require_iso8601,
   2184         allow_mixed=allow_mixed,
   2185     )
   2186     result = result.reshape(data.shape, order=order)
   2187 except OverflowError as err:
   2188     # Exception is raised when a part of date is greater than 32 bit signed int

File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/_libs/tslib.pyx:427, in pandas._libs.tslib.array_to_datetime()

File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/_libs/tslib.pyx:683, in pandas._libs.tslib.array_to_datetime()

File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/_libs/tslib.pyx:829, in pandas._libs.tslib._array_to_datetime_object()

File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/_libs/tslib.pyx:819, in pandas._libs.tslib._array_to_datetime_object()

File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pandas/_libs/tslibs/parsing.pyx:318, in pandas._libs.tslibs.parsing.parse_datetime_string()

File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/dateutil/parser/_parser.py:1368, in parse(timestr, parserinfo, **kwargs)
   1366     return parser(parserinfo).parse(timestr, **kwargs)
   1367 else:
-> 1368     return DEFAULTPARSER.parse(timestr, **kwargs)

File /usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/dateutil/parser/_parser.py:643, in parser.parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    640 res, skipped_tokens = self._parse(timestr, **kwargs)
    642 if res is None:
--> 643     raise ParserError("Unknown string format: %s", timestr)
    645 if len(res) == 0:
    646     raise ParserError("String does not contain a date: %s", timestr)

ParserError: Unknown string format: 2017-12-25T00:00:00+ 00 :00 present at position 0

To parse a date with this format, we need to specify the format argument for pd.to_datetime.

amzn_strftime = "%Y-%m-%dT%H:%M:%S+ 00 :00"
pd.to_datetime(christmas_amzn, format=amzn_strftime)
Timestamp('2017-12-25 00:00:00')

Can you guess what amzn_strftime represents?

Let’s take a closer look at amzn_strftime and christmas_amzn.

print(amzn_strftime)
print(christmas_amzn)
%Y-%m-%dT%H:%M:%S+ 00 :00
2017-12-25T00:00:00+ 00 :00

Notice that both of the strings have a similar form, but that instead of actual numerical values, amzn_strftime has placeholders.

Specifically, anywhere the % shows up is a signal to the pd.to_datetime function that it is where relevant information is stored.

For example, the %Y is a stand-in for a four digit year, %m is for 2 a digit month, and so on…

The official Python documentation contains a complete list of possible %something patterns that are accepted in the format argument.

Exercise

See exercise 1 in the exercise list.

Multiple Dates#

If we have dates in a Series (e.g. column of DataFrame) or a list, we can pass the entire collection to pd.to_datetime and get a collection of dates back.

We’ll just show an example of that here as the mechanics are the same as a single date.

pd.to_datetime(["2017-12-25", "2017-12-31"])
DatetimeIndex(['2017-12-25', '2017-12-31'], dtype='datetime64[ns]', freq=None)

Date Formatting#

We can use the %pattern format to have pandas write datetime objects as specially formatted strings using the strftime (string format time) method.

For example,

christmas.strftime("We love %A %B %d (also written %c)")
'We love Monday December 25 (also written Mon Dec 25 00:00:00 2017)'

Exercise

See exercise 2 in the exercise list.

Extracting Data#

When the index of a DataFrame has date information and pandas recognizes the values as datetime values, we can leverage some convenient indexing features for extracting data.

The flexibility of these features is best understood through example, so let’s load up some data and take a look.

btc_usd = quandl.get("BCHARTS/BITSTAMPUSD", start_date=start_date)
btc_usd.info()
btc_usd.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2599 entries, 2014-05-01 to 2021-06-20
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Open               2599 non-null   float64
 1   High               2599 non-null   float64
 2   Low                2599 non-null   float64
 3   Close              2599 non-null   float64
 4   Volume (BTC)       2599 non-null   float64
 5   Volume (Currency)  2599 non-null   float64
 6   Weighted Price     2599 non-null   float64
dtypes: float64(7)
memory usage: 162.4 KB
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2014-05-01 449.00 465.88 447.97 460.97 9556.037886 4.381969e+06 458.554960
2014-05-02 460.97 462.99 444.51 454.50 8269.891417 3.731061e+06 451.162018
2014-05-03 452.00 454.50 431.00 439.00 7431.626480 3.271086e+06 440.157544
2014-05-04 439.00 442.83 429.55 438.04 5154.407794 2.245293e+06 435.606483
2014-05-05 435.88 445.00 425.00 433.00 8188.082795 3.547855e+06 433.294968

Here, we have the Bitcoin (BTC) to US dollar (USD) exchange rate from March 2014 until today.

Notice that the type of index is DateTimeIndex.

This is the key that enables things like…

Extracting all data for the year 2015 by passing "2015" to .loc.

btc_usd.loc["2015"]
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2015-01-01 321.00 321.00 312.60 313.81 3087.436554 9.745096e+05 315.637119
2015-01-02 313.82 317.01 311.96 315.42 3468.281375 1.092446e+06 314.981849
2015-01-03 315.42 316.58 280.00 282.00 21752.719146 6.475952e+06 297.707695
2015-01-04 280.00 289.39 255.00 264.00 41441.278553 1.126676e+07 271.872950
2015-01-05 264.55 280.00 264.07 276.80 9528.271002 2.596898e+06 272.546601
... ... ... ... ... ... ... ...
2015-12-27 417.11 425.42 410.50 421.76 6968.573659 2.921475e+06 419.235752
2015-12-28 421.78 429.86 417.01 421.46 7560.562992 3.204102e+06 423.791428
2015-12-29 420.81 433.33 418.55 431.82 10419.585366 4.444308e+06 426.534057
2015-12-30 431.70 434.97 420.75 425.84 7717.510263 3.315825e+06 429.649630
2015-12-31 426.09 433.89 419.99 430.89 6634.863167 2.833320e+06 427.035137

365 rows × 7 columns

We can also narrow down to specific months.

# By month's name
btc_usd.loc["August 2017"]
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2017-08-01 2855.81 2929.17 2615.00 2731.00 12525.076691 3.432280e+07 2740.326259
2017-08-02 2732.00 2760.00 2650.00 2703.51 9486.625526 2.570111e+07 2709.193699
2017-08-03 2703.51 2807.44 2698.83 2793.37 7963.697999 2.193830e+07 2754.788542
2017-08-04 2793.34 2877.52 2765.91 2855.00 7635.821672 2.165009e+07 2835.331752
2017-08-05 2851.01 3339.66 2848.32 3263.62 16996.273101 5.386193e+07 3169.043337
2017-08-06 3263.51 3296.51 3146.10 3222.75 5998.735789 1.941266e+07 3236.124519
2017-08-07 3216.78 3430.00 3186.00 3387.55 12046.117265 4.022585e+07 3339.320971
2017-08-08 3387.54 3490.00 3300.00 3412.41 15835.370208 5.405482e+07 3413.549765
2017-08-09 3408.46 3423.10 3178.72 3342.99 14286.844138 4.735646e+07 3314.690350
2017-08-10 3342.99 3448.00 3311.17 3413.03 9031.121280 3.065844e+07 3394.754600
2017-08-11 3410.00 3705.00 3390.67 3645.06 11927.373334 4.188802e+07 3511.923462
2017-08-12 3651.74 3934.00 3586.95 3855.10 12351.074661 4.679980e+07 3789.127641
2017-08-13 3855.04 4190.00 3841.71 4053.87 15889.829788 6.367210e+07 4007.097892
2017-08-14 4053.87 4329.43 3964.96 4306.23 14212.304207 5.983600e+07 4210.154943
2017-08-15 4320.95 4400.00 3800.00 4155.67 25515.718014 1.046514e+08 4101.447155
2017-08-16 4154.99 4379.78 3926.06 4378.84 12923.637280 5.419433e+07 4193.426713
2017-08-17 4361.99 4480.00 4167.21 4276.50 14573.185929 6.322860e+07 4338.694675
2017-08-18 4260.47 4368.00 3964.96 4100.00 17516.993361 7.322403e+07 4180.171091
2017-08-19 4100.00 4188.00 3900.00 4099.55 15036.184051 6.060491e+07 4030.604133
2017-08-20 4091.99 4125.95 4000.00 4058.68 6237.972896 2.528964e+07 4054.143713
2017-08-21 4058.64 4080.00 3949.78 3987.52 9782.056594 3.920391e+07 4007.736878
2017-08-22 3987.51 4139.31 3600.00 4085.00 23522.758166 9.160033e+07 3894.115013
2017-08-23 4078.00 4248.97 4051.94 4108.12 14979.403481 6.211729e+07 4146.846811
2017-08-24 4121.78 4350.00 4082.57 4300.34 10782.694367 4.517407e+07 4189.497275
2017-08-25 4308.80 4449.98 4270.00 4355.98 9699.610034 4.227406e+07 4358.326082
2017-08-26 4348.17 4369.78 4232.43 4333.38 6559.668604 2.822792e+07 4303.253256
2017-08-27 4333.38 4393.30 4290.32 4337.68 3979.124453 1.726397e+07 4338.634650
2017-08-28 4329.91 4399.72 4169.01 4379.99 8641.002446 3.707720e+07 4290.845145
2017-08-29 4385.00 4649.78 4336.26 4578.82 11879.642831 5.349117e+07 4502.758977
2017-08-30 4578.82 4642.22 4479.00 4573.20 8720.035040 3.980054e+07 4564.263673
2017-08-31 4573.15 4765.21 4566.66 4734.26 8911.412459 4.172621e+07 4682.334106
# By month's number
btc_usd.loc["08/2017"]
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2017-08-01 2855.81 2929.17 2615.00 2731.00 12525.076691 3.432280e+07 2740.326259
2017-08-02 2732.00 2760.00 2650.00 2703.51 9486.625526 2.570111e+07 2709.193699
2017-08-03 2703.51 2807.44 2698.83 2793.37 7963.697999 2.193830e+07 2754.788542
2017-08-04 2793.34 2877.52 2765.91 2855.00 7635.821672 2.165009e+07 2835.331752
2017-08-05 2851.01 3339.66 2848.32 3263.62 16996.273101 5.386193e+07 3169.043337
2017-08-06 3263.51 3296.51 3146.10 3222.75 5998.735789 1.941266e+07 3236.124519
2017-08-07 3216.78 3430.00 3186.00 3387.55 12046.117265 4.022585e+07 3339.320971
2017-08-08 3387.54 3490.00 3300.00 3412.41 15835.370208 5.405482e+07 3413.549765
2017-08-09 3408.46 3423.10 3178.72 3342.99 14286.844138 4.735646e+07 3314.690350
2017-08-10 3342.99 3448.00 3311.17 3413.03 9031.121280 3.065844e+07 3394.754600
2017-08-11 3410.00 3705.00 3390.67 3645.06 11927.373334 4.188802e+07 3511.923462
2017-08-12 3651.74 3934.00 3586.95 3855.10 12351.074661 4.679980e+07 3789.127641
2017-08-13 3855.04 4190.00 3841.71 4053.87 15889.829788 6.367210e+07 4007.097892
2017-08-14 4053.87 4329.43 3964.96 4306.23 14212.304207 5.983600e+07 4210.154943
2017-08-15 4320.95 4400.00 3800.00 4155.67 25515.718014 1.046514e+08 4101.447155
2017-08-16 4154.99 4379.78 3926.06 4378.84 12923.637280 5.419433e+07 4193.426713
2017-08-17 4361.99 4480.00 4167.21 4276.50 14573.185929 6.322860e+07 4338.694675
2017-08-18 4260.47 4368.00 3964.96 4100.00 17516.993361 7.322403e+07 4180.171091
2017-08-19 4100.00 4188.00 3900.00 4099.55 15036.184051 6.060491e+07 4030.604133
2017-08-20 4091.99 4125.95 4000.00 4058.68 6237.972896 2.528964e+07 4054.143713
2017-08-21 4058.64 4080.00 3949.78 3987.52 9782.056594 3.920391e+07 4007.736878
2017-08-22 3987.51 4139.31 3600.00 4085.00 23522.758166 9.160033e+07 3894.115013
2017-08-23 4078.00 4248.97 4051.94 4108.12 14979.403481 6.211729e+07 4146.846811
2017-08-24 4121.78 4350.00 4082.57 4300.34 10782.694367 4.517407e+07 4189.497275
2017-08-25 4308.80 4449.98 4270.00 4355.98 9699.610034 4.227406e+07 4358.326082
2017-08-26 4348.17 4369.78 4232.43 4333.38 6559.668604 2.822792e+07 4303.253256
2017-08-27 4333.38 4393.30 4290.32 4337.68 3979.124453 1.726397e+07 4338.634650
2017-08-28 4329.91 4399.72 4169.01 4379.99 8641.002446 3.707720e+07 4290.845145
2017-08-29 4385.00 4649.78 4336.26 4578.82 11879.642831 5.349117e+07 4502.758977
2017-08-30 4578.82 4642.22 4479.00 4573.20 8720.035040 3.980054e+07 4564.263673
2017-08-31 4573.15 4765.21 4566.66 4734.26 8911.412459 4.172621e+07 4682.334106

Or even a day…

# By date name
btc_usd.loc["August 1, 2017"]
Open                 2.855810e+03
High                 2.929170e+03
Low                  2.615000e+03
Close                2.731000e+03
Volume (BTC)         1.252508e+04
Volume (Currency)    3.432280e+07
Weighted Price       2.740326e+03
Name: 2017-08-01 00:00:00, dtype: float64
# By date number
btc_usd.loc["08-01-2017"]
Open                 2.855810e+03
High                 2.929170e+03
Low                  2.615000e+03
Close                2.731000e+03
Volume (BTC)         1.252508e+04
Volume (Currency)    3.432280e+07
Weighted Price       2.740326e+03
Name: 2017-08-01 00:00:00, dtype: float64

What can we pass as the .loc argument when we have a DateTimeIndex?

Anything that can be converted to a datetime using pd.to_datetime, without having to specify the format argument.

When that condition holds, pandas will return all rows whose date in the index “belong” to that date or period.

We can also use the range shorthand notation to give a start and end date for selection.

btc_usd.loc["April 1, 2015":"April 10, 2015"]
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2015-04-01 243.93 246.83 239.32 246.69 6226.016464 1.513601e+06 243.109050
2015-04-02 246.68 256.96 244.52 253.28 9806.822203 2.453664e+06 250.199655
2015-04-03 253.22 256.67 251.23 254.19 5048.577376 1.283171e+06 254.164902
2015-04-04 254.19 255.85 250.76 253.70 2769.281658 7.002845e+05 252.875870
2015-04-05 253.60 261.00 251.65 260.54 5759.360160 1.479483e+06 256.883285
2015-04-06 260.57 262.98 254.00 255.58 5960.677633 1.535495e+06 257.604180
2015-04-07 255.54 256.62 251.50 253.72 6010.267582 1.528034e+06 254.237260
2015-04-08 253.71 254.96 243.06 244.58 11663.656155 2.878712e+06 246.810407
2015-04-09 244.84 246.30 238.47 243.43 7943.710541 1.932558e+06 243.281478
2015-04-10 243.75 243.94 231.00 235.99 11549.630656 2.728444e+06 236.236497

Exercise

See exercise 3 in the exercise list.

Accessing Date Properties#

Sometimes, we would like to directly access a part of the date/time.

If our date/time information is in the index, we can to df.index.XX where XX is replaced by year, month, or whatever we would like to access.

btc_usd.index.year
Int64Index([2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014,
            ...
            2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021],
           dtype='int64', name='Date', length=2599)
btc_usd.index.day
Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
            ...
            11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
           dtype='int64', name='Date', length=2599)

We can also do the same if the date/time information is stored in a column, but we have to use a slightly different syntax.

df["column_name"].dt.XX
btc_date_column = btc_usd.reset_index()
btc_date_column.head()
Date Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
0 2014-05-01 449.00 465.88 447.97 460.97 9556.037886 4.381969e+06 458.554960
1 2014-05-02 460.97 462.99 444.51 454.50 8269.891417 3.731061e+06 451.162018
2 2014-05-03 452.00 454.50 431.00 439.00 7431.626480 3.271086e+06 440.157544
3 2014-05-04 439.00 442.83 429.55 438.04 5154.407794 2.245293e+06 435.606483
4 2014-05-05 435.88 445.00 425.00 433.00 8188.082795 3.547855e+06 433.294968
btc_date_column["Date"].dt.year.head()
0    2014
1    2014
2    2014
3    2014
4    2014
Name: Date, dtype: int64
btc_date_column["Date"].dt.month.head()
0    5
1    5
2    5
3    5
4    5
Name: Date, dtype: int64

Leads and Lags: df.shift#

When doing time series analysis, we often want to compare data at one date against data at another date.

pandas can help us with this if we leverage the shift method.

Without any additional arguments, shift() will move all data forward one period, filling the first row with missing data.

# so we can see the result of shift clearly
btc_usd.head()
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2014-05-01 449.00 465.88 447.97 460.97 9556.037886 4.381969e+06 458.554960
2014-05-02 460.97 462.99 444.51 454.50 8269.891417 3.731061e+06 451.162018
2014-05-03 452.00 454.50 431.00 439.00 7431.626480 3.271086e+06 440.157544
2014-05-04 439.00 442.83 429.55 438.04 5154.407794 2.245293e+06 435.606483
2014-05-05 435.88 445.00 425.00 433.00 8188.082795 3.547855e+06 433.294968
btc_usd.shift().head()
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2014-05-01 NaN NaN NaN NaN NaN NaN NaN
2014-05-02 449.00 465.88 447.97 460.97 9556.037886 4.381969e+06 458.554960
2014-05-03 460.97 462.99 444.51 454.50 8269.891417 3.731061e+06 451.162018
2014-05-04 452.00 454.50 431.00 439.00 7431.626480 3.271086e+06 440.157544
2014-05-05 439.00 442.83 429.55 438.04 5154.407794 2.245293e+06 435.606483

We can use this to compute the percent change from one day to the next. (Quiz: Why does that work? Remember how pandas uses the index to align data.)

((btc_usd - btc_usd.shift()) / btc_usd.shift()).head()
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2014-05-01 NaN NaN NaN NaN NaN NaN NaN
2014-05-02 0.026659 -0.006203 -0.007724 -0.014036 -0.134590 -0.148542 -0.016122
2014-05-03 -0.019459 -0.018337 -0.030393 -0.034103 -0.101363 -0.123282 -0.024391
2014-05-04 -0.028761 -0.025677 -0.003364 -0.002187 -0.306423 -0.313594 -0.010340
2014-05-05 -0.007107 0.004900 -0.010592 -0.011506 0.588559 0.580130 -0.005306

Setting the first argument to n tells pandas to shift the data down n rows (apply an n period lag).

btc_usd.shift(3).head()
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2014-05-01 NaN NaN NaN NaN NaN NaN NaN
2014-05-02 NaN NaN NaN NaN NaN NaN NaN
2014-05-03 NaN NaN NaN NaN NaN NaN NaN
2014-05-04 449.00 465.88 447.97 460.97 9556.037886 4.381969e+06 458.554960
2014-05-05 460.97 462.99 444.51 454.50 8269.891417 3.731061e+06 451.162018

A negative value will shift the data up or apply a lead.

btc_usd.shift(-2).head()
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2014-05-01 452.00 454.50 431.00 439.00 7431.626480 3.271086e+06 440.157544
2014-05-02 439.00 442.83 429.55 438.04 5154.407794 2.245293e+06 435.606483
2014-05-03 435.88 445.00 425.00 433.00 8188.082795 3.547855e+06 433.294968
2014-05-04 431.64 434.00 420.27 428.01 8041.198415 3.439331e+06 427.713734
2014-05-05 429.00 452.00 425.67 440.00 13248.349023 5.842712e+06 441.014383
btc_usd.shift(-2).tail()
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2021-06-16 38110.25 38208.68 35143.68 35840.00 4240.404867 1.560566e+08 36802.295855
2021-06-17 35849.18 36440.00 34718.21 35507.06 3129.334722 1.118542e+08 35743.751401
2021-06-18 35494.70 36125.00 33337.00 35675.47 3865.408811 1.344830e+08 34791.393581
2021-06-19 NaN NaN NaN NaN NaN NaN NaN
2021-06-20 NaN NaN NaN NaN NaN NaN NaN

Exercise

See exercise 4 in the exercise list.

Rolling Computations: .rolling#

pandas has facilities that enable easy computation of rolling statistics.

These are best understood by example, so we will dive right in.

# first take only the first 6 rows so we can easily see what is going on
btc_small = btc_usd.head(6)
btc_small
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2014-05-01 449.00 465.88 447.97 460.97 9556.037886 4.381969e+06 458.554960
2014-05-02 460.97 462.99 444.51 454.50 8269.891417 3.731061e+06 451.162018
2014-05-03 452.00 454.50 431.00 439.00 7431.626480 3.271086e+06 440.157544
2014-05-04 439.00 442.83 429.55 438.04 5154.407794 2.245293e+06 435.606483
2014-05-05 435.88 445.00 425.00 433.00 8188.082795 3.547855e+06 433.294968
2014-05-06 431.64 434.00 420.27 428.01 8041.198415 3.439331e+06 427.713734

Below, we compute the 2 day moving average (for all columns).

btc_small.rolling("2d").mean()
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2014-05-01 449.000 465.880 447.970 460.970 9556.037886 4.381969e+06 458.554960
2014-05-02 454.985 464.435 446.240 457.735 8912.964652 4.056515e+06 454.858489
2014-05-03 456.485 458.745 437.755 446.750 7850.758948 3.501074e+06 445.659781
2014-05-04 445.500 448.665 430.275 438.520 6293.017137 2.758190e+06 437.882013
2014-05-05 437.440 443.915 427.275 435.520 6671.245295 2.896574e+06 434.450725
2014-05-06 433.760 439.500 422.635 430.505 8114.640605 3.493593e+06 430.504351

To do this operation, pandas starts at each row (date) then looks backwards the specified number of periods (here 2 days) and then applies some aggregation function (mean) on all the data in that window.

If pandas cannot look back the full length of the window (e.g. when working on the first row), it fills as much of the window as possible and then does the operation. Notice that the value at 2014-05-01 is the same in both DataFrames.

Below, we see a visual depiction of the rolling maximum on a 21 day window for the whole dataset.

fig, ax = plt.subplots(figsize=(10, 4))
btc_usd["Open"].plot(ax=ax, linestyle="--", alpha=0.8)
btc_usd.rolling("21d").max()["Open"].plot(ax=ax, alpha=0.8, linewidth=3)
ax.legend(["Original", "21 day max"])
<matplotlib.legend.Legend at 0x7f5e2b2b4160>
../_images/0c7e437edd40da0b483cf5d4a716bde1e302b75291a14c4902f8d37a6574655b.png

We can also ask pandas to apply custom functions, similar to what we saw when studying GroupBy.

def is_volatile(x):
    "Returns a 1 if the variance is greater than 1, otherwise returns 0"
    if x.var() > 1.0:
        return 1.0
    else:
        return 0.0
btc_small.rolling("2d").apply(is_volatile)
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2014-05-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2014-05-02 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2014-05-03 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2014-05-04 1.0 1.0 1.0 0.0 1.0 1.0 1.0
2014-05-05 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2014-05-06 1.0 1.0 1.0 1.0 1.0 1.0 1.0

Exercise

See exercise 5 in the exercise list.

To make the optimal decision, we need to know the maximum difference between the close price at the end of the window and the open price at the start of the window.

Exercise

See exercise 6 in the exercise list.

Changing Frequencies: .resample#

In addition to computing rolling statistics, we can also change the frequency of the data.

For example, instead of a monthly moving average, suppose that we wanted to compute the average within each calendar month.

We will use the resample method to do this.

Below are some examples.

# business quarter
btc_usd.resample("BQ").mean()
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2014-06-30 546.296557 560.478689 534.375574 549.274426 9692.850513 5.418439e+06 547.779109
2014-09-30 536.939130 544.464239 523.571087 533.646087 8921.203260 4.473120e+06 533.775317
2014-12-31 358.113261 367.361413 347.553261 357.277935 14487.630863 5.182557e+06 357.025437
2015-03-31 242.648111 250.373889 233.306667 241.823889 15871.493002 3.827028e+06 242.062305
2015-06-30 236.023956 239.404396 232.456593 236.209451 7568.650463 1.787362e+06 236.050265
2015-09-30 255.205435 259.369348 250.738478 254.964565 15057.220448 3.752239e+06 255.216907
2015-12-31 343.968043 354.843587 335.019783 346.029674 21987.668969 7.534306e+06 345.209689
2016-03-31 409.723956 415.345165 402.351319 409.552747 7276.401365 2.957477e+06 409.345434
2016-06-30 508.830549 519.757143 491.237582 511.820440 5981.137422 3.332013e+06 509.388134
2016-09-30 614.719891 621.399783 604.309022 614.092500 3888.241125 2.364365e+06 613.211596
2016-12-30 722.921538 732.714176 714.873846 726.758462 4658.746976 3.501319e+06 724.799640
2017-03-31 1031.095275 1058.529451 998.373077 1032.510000 9574.254814 9.767086e+06 1029.350854
2017-06-30 1886.373956 1961.381868 1823.321319 1902.486484 11808.708093 2.482040e+07 1897.089126
2017-09-29 3452.060000 3578.992637 3312.851758 3471.192088 14591.656328 5.001818e+07 3446.717092
2017-12-29 9132.594945 9602.524725 8655.917473 9245.341978 15339.474034 1.498917e+08 9150.932048
2018-03-30 10654.949560 11107.304725 9981.257363 10571.343077 16971.607663 1.711173e+08 10536.375918
2018-06-29 7760.852637 7950.978132 7543.325275 7754.252637 10412.011976 8.196025e+07 7744.048061
2018-09-28 6792.599011 6950.023956 6639.782857 6797.711209 7098.911451 4.863160e+07 6790.558832
2018-12-31 5199.605319 5297.541809 5063.396915 5169.268191 8805.848653 3.969870e+07 5178.241610
2019-03-29 3737.812500 3799.089886 3676.291136 3742.059432 6446.776373 2.414045e+07 3737.221237
2019-06-28 7026.032308 7308.565604 6829.843407 7117.250110 10796.082022 8.051510e+07 7077.940876
2019-09-30 10429.423617 10693.213511 10038.982872 10384.737660 9112.727636 9.513540e+07 10360.841564
2019-12-31 7997.030978 8178.661087 7803.425109 7983.524457 6393.955934 5.178886e+07 7981.938611
2020-03-31 8273.623956 8502.892637 8005.682418 8262.963297 9395.033509 6.978281e+07 8261.113739
2020-06-30 8631.383077 8850.968242 8409.878022 8660.272637 9782.113853 8.484144e+07 8638.033563
2020-09-30 10611.801087 10792.899239 10400.647391 10629.811196 6548.697693 7.025685e+07 10605.035045
2020-12-31 16399.218222 16885.984444 16024.660667 16580.653222 7732.499161 1.321875e+08 16479.764167
2021-03-31 46009.360843 47795.721928 44029.390000 46240.197349 8859.348156 3.778755e+08 46047.178937
2021-06-30 48335.081235 49808.311235 46184.388025 48049.587407 5468.834803 2.532804e+08 48105.596908

Note that unlike with rolling, a single number is returned for each column for each quarter.

The resample method will alter the frequency of the data and the number of rows in the result will be different from the number of rows in the input.

On the other hand, with rolling, the size and frequency of the result are the same as the input.

We can sample at other frequencies and aggregate with multiple aggregations function at once.

# multiple functions at 2 start-of-quarter frequency
btc_usd.resample("2BQS").agg(["min", "max"])
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
min max min max min max min max min max min max min max
Date
2014-04-01 374.17 668.90 386.03 683.26 365.20 651.70 374.20 670.14 1467.591402 29732.720362 9.159133e+05 1.561239e+07 376.976877 667.690345
2014-10-01 0.00 426.64 0.00 453.92 0.00 390.48 0.00 426.63 0.000000 124188.885083 0.000000e+00 2.357627e+07 0.000000 420.127183
2015-04-01 209.76 310.55 222.88 317.99 198.12 292.19 209.72 310.55 1946.293030 42308.005630 4.732609e+05 9.091325e+06 214.884260 306.748292
2015-10-01 235.87 464.53 239.06 502.00 235.00 453.50 237.15 464.53 1253.006376 105959.259141 5.210775e+05 4.719959e+07 237.116083 461.494358
2016-04-01 414.66 767.37 416.99 778.85 1.50 740.11 416.31 766.62 719.159825 33056.289644 4.709121e+05 2.225764e+07 415.569853 754.723539
2016-10-03 607.19 1287.38 610.50 1350.00 604.99 1255.00 607.19 1285.33 888.660021 36018.861120 5.460154e+05 3.883046e+07 607.560859 1275.581651
2017-04-03 1076.59 4921.71 1145.00 4979.90 1076.19 4671.09 1134.58 4921.70 1804.450797 60278.946542 2.128068e+06 2.031684e+08 1127.151197 4808.168193
2017-10-02 4219.74 19187.78 4343.00 19666.00 4137.96 18465.00 4219.53 19187.78 4646.405621 70961.369658 2.032007e+07 7.721430e+08 4233.863791 19110.244062
2018-04-02 5845.20 9827.04 6165.49 9948.98 5774.72 9670.68 5848.33 9823.28 1098.628060 33035.904045 7.093171e+06 3.032200e+08 5936.398196 9827.536792
2018-10-01 3180.84 6604.76 3230.00 6756.00 3122.28 6553.13 3179.54 6604.75 839.297665 39775.389439 5.373482e+06 1.773528e+08 3171.722851 6593.879882
2019-04-01 4092.02 12927.44 4150.00 13880.00 4052.56 12030.43 4136.32 12920.54 1572.155427 37487.802426 1.506857e+07 4.769830e+08 4121.008519 12723.686028
2019-10-01 4841.67 10364.04 5353.04 10500.50 3850.00 10247.35 4841.67 10364.04 1119.109693 58513.389189 8.052271e+06 3.644311e+08 4876.575039 10345.096195
2020-04-01 6428.74 12301.27 6710.21 12473.00 6137.71 11823.00 6619.00 12298.38 1358.285320 34465.186572 1.244918e+07 2.992075e+08 6357.380528 12166.197291
2020-10-01 10543.17 61177.03 10605.79 61781.83 10380.00 58959.57 10553.66 61165.19 1.165644 40786.964254 3.152391e+04 1.369434e+09 10515.174869 60429.059780
2021-04-01 33413.19 63605.69 34085.61 64895.22 30066.00 61998.90 33411.69 63564.48 1440.631820 32468.863970 8.294065e+07 1.233364e+09 32661.425472 63314.685801

As with groupby and rolling, you can also provide custom functions to .resample(...).agg and .resample(...).apply

Exercise

See exercise 7 in the exercise list.

To make the optimal decision we need to, for each month, compute the maximum value of the close price on any day minus the open price on the first day of the month.

Exercise

See exercise 8 in the exercise list.

Optional: API keys#

Recall above that we had the line of code:

quandl.ApiConfig.api_key = "Dn6BtVoBhzuKTuyo6hbp"

This line told the quandl library that when obtaining making requests for data, it should use the API key Dn6BtVoBhzuKTuyo6hbp.

An API key is a sort of password that web services (like the Quandl API) require you to provide when you make requests.

Using this password, we were able to make a request to Quandl to obtain data directly from them.

The API key used here is one that we requested on behalf of this course.

If you plan to use Quandl more extensively, you should obtain your own personal API key from their website and re-run the quandl.ApiConfig.api_key... line of code with your new API key on the right-hand side.

Exercises#

Exercise 1#

By referring to table found at the link above, figure out the correct argument to pass as format in order to parse the dates in the next three cells below.

Test your work by passing your format string to pd.to_datetime.

christmas_str2 = "2017:12:25"
dbacks_win = "M:11 D:4 Y:2001 9:15 PM"
america_bday = "America was born on July 4, 1776"

(back to text)

Exercise 2#

Use pd.to_datetime to express the birthday of one of your friends or family members as a datetime object.

Then use the strftime method to write a message of the format:

NAME's birthday is June 10, 1989 (a Saturday)

(where the name and date are replaced by the appropriate values)

(back to text)

Exercise 3#

For each item in the list, extract the specified data from btc_usd:

  • July 2017 through August 2017 (inclusive)

  • April 25, 2015 to June 10, 2016

  • October 31, 2017

(back to text)

Exercise 4#

Using the shift function, determine the week with the largest percent change in the volume of trades (the "Volume (BTC)" column).

Repeat the analysis at the bi-weekly and monthly frequencies.

Hint

We have data at a daily frequency and one week is 7 days.

Hint

Approximate a month by 30 days.

# your code here

(back to text)

Exercise 5#

Imagine that you have access to the DeLorean time machine from “Back to the Future”.

You are allowed to use the DeLorean only once, subject to the following conditions:

  • You may travel back to any day in the past.

  • On that day, you may purchase one bitcoin at market open.

  • You can then take the time machine 30 days into the future and sell your bitcoin at market close.

  • Then you return to the present, pocketing the profits.

How would you pick the day?

Think carefully about what you would need to compute to make the optimal choice. Try writing it out in the markdown cell below so you have a clear description of the want operator that we will apply after the exercise.

(Note: Don’t look too far below, because in the next non-empty cell we have written out our answer.)

To make this decision, we want to know …

Your answer here

(back to text)

Exercise 6#

Do the following:

  1. Write a pandas function that implements your strategy.

  2. Pass it to the agg method of rolling_btc.

  3. Extract the "Open" column from the result.

  4. Find the date associated with the maximum value in that column.

How much money did you make? Compare with your neighbor.

def daily_value(df):
    # DELETE `pass` below and replace it with your code
    pass

rolling_btc = btc_usd.rolling("30d")

# do steps 2-4 here

(back to text)

Exercise 7#

Now suppose you still have access to the DeLorean, but the conditions are slightly different.

You may now:

  • Travel back to the first day of any month in the past.

  • On that day, you may purchase one bitcoin at market open.

  • You can then travel to any day in that month and sell the bitcoin at market close.

  • Then return to the present, pocketing the profits.

To which month would you travel? On which day of that month would you return to sell the bitcoin?

Discuss with your neighbor what you would need to compute to make the optimal choice. Try writing it out in the markdown cell below so you have a clear description of the want operator that we will apply after the exercise.

(Note: Don’t look too many cells below, because we have written out our answer.)

To make the optimal decision we need …

Your answer here

(back to text)

Exercise 8#

Do the following:

  1. Write a pandas function that implements your strategy.

  2. Pass it to the agg method of resampled_btc.

  3. Extract the "Open" column from the result.

  4. Find the date associated with the maximum value in that column.

How much money did you make? Compare with your neighbor.

Was this strategy more profitable than the previous one? By how much?

def monthly_value(df):
    # DELETE `pass` below and replace it with your code
    pass

resampled_btc = btc_usd.resample("MS")

# Do steps 2-4 here

(back to text)