QuantEcon DataScience

Introduction to Economic Modeling and Data Science

Working with Text

Author

Prerequisites

Outcomes

  • Use text as features for classification
  • Understand latent topic analysis
  • Use folium to create an interactive map
  • Request and combine json data from a web server
In [1]:
# Uncomment following line to install on colab
#! pip install qeds fiona geopandas xgboost gensim folium pyLDAvis descartes

Introduction

Many data sources contain both numerical data and text.

We can use text to create features for any of the prediction methods that we have discussed.

Doing so requires encoding text into some numerical representation.

A good encoding preserves the meaning of the original text, while keeping dimensionality manageable.

In this lecture, we will learn how to work with text through an application — predicting fatalities from avalanche forecasts.

Avalanches

Snow avalanches are a hazard in the mountains. Avalanches can be partially predicted based on snow conditions, weather, and terrain. Avalanche Canada produces daily avalanche forecasts for various Canadian mountainous regions. These forecasts consist of 1-5 ratings for each of three elevation bands, as well as textual descriptions of recent avalanche observations, snowpack, and weather. Avalanche Canada also maintains a list of fatal avalanche incidents . In this lecture, we will attempt to predict fatal incidents from the text of avalanche forecasts. Since fatal incidents are rare, this prediction task will be quite difficult.

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


%matplotlib inline
# activate plot theme
import qeds
qeds.themes.mpl_style();

Data

Avalanche Canada has an unstable json api. The api seems to be largely tailored to displaying the information on various Avalanche Canada websites, which does not make it easy to obtain large amounts of data. Nonetheless, getting information from the API is easier than scraping the website. Generally, whenever you’re considering scraping a website, you should first check whether the site has an API available.

Incident Data

In [3]:
# Get data on avalanche forecasts and incidents from Avalanche Canada
# Avalanche Canada has an unstable public api
# https://github.com/avalanche-canada/ac-web
# Since API might change, this code might break
import json
import os
import urllib.request
import pandas as pd
import time
import requests
import io
import zipfile
import warnings


# Incidents
url = "http://incidents.avalanche.ca/public/incidents/?format=json"
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as response:
    result = json.loads(response.read().decode('utf-8'))
incident_list = result["results"]
while (result["next"] != None):
    req = urllib.request.Request(result["next"])
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    incident_list = incident_list + result["results"]
incidents_brief = pd.DataFrame.from_dict(incident_list,orient="columns")
pd.options.display.max_rows = 20
pd.options.display.max_columns = 8
incidents_brief
Out[3]:
date group_activity id location location_province num_fatal num_injured num_involved
0 2020-03-14 Snowshoeing dd556e1a-90f6-46a9-a31b-02508504f2d7 Brunswick Mountain BC 1 0.0 1.0
1 2020-02-19 Skiing 9fcf2d9f-964b-4329-b3d2-54a7f1c6c9ee Joe Louis aux Mines Madeleine QC 1 1.0 2.0
2 2020-02-13 Snowmobiling 3c2ecc57-f40c-4ae5-afeb-a2abd624d8a8 Willmore Wilderness Park AB 1 0.0 2.0
3 2020-02-02 Snowmobiling c5b412c4-9857-44df-8d90-8385fac3948c Upper Burnt BC 1 0.0 1.0
4 2020-01-10 Skiing 223dceb9-1d88-43ab-a893-f5728024802d Mount Hector AB 1 0.0 1.0
5 2020-01-04 Snowmobiling 1953493c-cf2b-4d67-b934-72c69f987884 Cabin Lake BC 1 0.0 1.0
6 2019-12-30 Snowboarding 5b97e577-add7-41a8-9fdd-6b1ff9a5d3a2 Chuck Creek Parking Area BC 2 1.0 3.0
7 2019-04-20 Backcountry Skiing 49c180c3-aa8e-4f9e-805d-603e510a4485 Yoho National Park BC 1 0.0 3.0
8 2019-04-16 Ice Climbing bc4a88cc-7b79-4382-912d-f2ba5cd1f7ec Howse Peak AB 3 0.0 3.0
9 2019-03-16 Ski touring 37d909e4-c6de-43f1-8416-57a34cd48255 Pharaoh Lake AB 1 0.0 2.0
... ... ... ... ... ... ... ... ...
468 1863-01-16 Unknown a3405ee9-7384-4064-9df5-58b789718064 Lévis QC 2 NaN NaN
469 1856-02-05 Inside Building bc4899c4-70ce-47df-b586-445641e6068d Cape Breton - Big Pond NS 5 NaN NaN
470 1852-02-01 Inside Building 1888475d-8f26-4b0d-bd23-b9fb82e165a7 Promontoir de Québec QC 7 NaN NaN
471 1850-01-01 Hunting/Fishing 6418e7b2-39d6-4ac4-bc2e-7f5518ecf798 Northumberland Inlet - Kingaite 1 NaN NaN
472 1843-12-18 Unknown 56ef01bd-6e1d-450c-90bb-fc53fe2922e9 Cap Blanc (Ville de Québec) 18 QC 1 NaN NaN
473 1840-02-01 Unknown 101c517b-29a4-4c49-8934-f6c56ddd882d Château-Richer QC 1 NaN NaN
474 1836-02-09 Unknown b2e1c50a-1533-4145-a1a2-0befca0154d5 Quebec QC 1 NaN NaN
475 1833-05-24 Unknown 18e8f963-da33-4682-9312-57ca2cc9ad8d Carbonear NL 1 0.0 NaN
476 1825-02-04 Inside Building 083d22df-ed50-4687-b9ab-1649960a0fbe Saint-Joseph de Lévis QC 5 NaN NaN
477 1782-01-01 Unknown f498c48a-981d-43cf-ac16-151b8794435c Nain NL 22 NaN NaN

478 rows × 8 columns

In [4]:
# We can get more information about these incidents e.g. "https://www.avalanche.ca/incidents/37d909e4-c6de-43f1-8416-57a34cd48255"
# this information is also available through the API
def get_incident_details(id):
    url = "http://incidents.avalanche.ca/public/incidents/{}?format=json".format(id)
    req = urllib.request.Request(url)
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    return(result)


incidentsfile = "avalanche_incidents.csv"

# To avoid loading the avalanche Canada servers, we save the incident details locally.
if (not os.path.isfile(incidentsfile)):
    incident_detail_list = incidents_brief.id.apply(get_incident_details).to_list()
    incidents = pd.DataFrame.from_dict(incident_detail_list, orient="columns")
    incidents.to_csv(incidentsfile)
else:
    incidents = pd.read_csv(incidentsfile)

incidents
Out[4]:
Unnamed: 0 avalanche_obs comment documents ... snowpack_obs weather_comment weather_obs alt_coord
0 0 [{'size': '2.0', 'type': 'S', 'trigger': 'Sa',... A snowshoer left on a solo hike March 14 and w... [] ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [-123.201389, 49.487777]
1 1 [{'size': '2.0', 'type': 'S', 'trigger': 'Sa',... 2 personnes sur 4 ont été complètement ensevel... [] ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [None, None]
2 2 [{'size': '2.5', 'type': 'S', 'trigger': 'Ma',... A snowmobiler was caught in an avalanche in st... [{'date': '2020-02-20', 'title': 'Annotated ov... ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [-119.055839, 53.7013]
3 3 [{'size': '2.5', 'type': 'S', 'trigger': 'Ma',... Late in the afternoon on February 2, 2020 auth... [{'date': '2020-02-12', 'title': 'RCMP Info Fe... ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [None, None]
4 4 [{'size': '2.5', 'type': 'S', 'trigger': 'Sa',... A party of three were skiing on a south-facing... [{'date': '2020-01-23', 'title': 'Upper slide ... ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [-116.2594444, 51.575]
5 5 [{'size': '2.5', 'type': 'S', 'trigger': 'Ma',... Two people were snowmobiling on a small but st... [] ... {'hs': 175, 'hn24': None, 'hst': None, 'hst_re... Weather stations west and north of the scene r... {'temp_present': None, 'temp_max': None, 'temp... [-121.221667, 49.973889]
6 6 [{'size': '2.0', 'type': 'S', 'trigger': 'Sa',... A party of three snowboarders were climbing on... [] ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... A series of storms with significant precipitat... {'temp_present': None, 'temp_max': None, 'temp... [None, None]
7 7 [{'size': '3.0', 'type': 'S', 'trigger': 'Sa',... A party of three backcountry skiers were invol... [{'date': '2019-04-21', 'title': 'Parks Canada... ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [-116.5, 51.5]
8 8 [{'size': None, 'type': None, 'trigger': None,... A party of three ice climbers were attempting ... [{'date': '2019-04-21', 'title': 'Howse Peak 2... ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [-116.6808333, 51.8133333]
9 9 [{'size': '2.0', 'type': 'S', 'trigger': 'Sa',... A ski-tourer and split boarder triggered a per... [] ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... Nearest weather station temperature was about ... {'temp_present': None, 'temp_max': None, 'temp... [-115.9105556, 51.1144444]
... ... ... ... ... ... ... ... ... ...
468 468 [{'observation_date': '1863-01-16', 'size': ''... tué par une aval. [] ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [-71.1833333, 46.8]
469 469 [{'observation_date': '1856-02-05', 'size': ''... In house, estimated that the avalanche struck ... [] ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... From Cape Breton Newspaper article - unsure of... {'temp_present': None, 'temp_max': None, 'temp... [None, None]
470 470 [{'observation_date': '1852-02-01', 'size': ''... NaN [] ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [None, None]
471 471 [{'observation_date': '1800-01-01', 'size': ''... DATE ESTIMATED, NOT ACCURATE!!! Based on arti... [] ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [None, None]
472 472 [{'observation_date': '1843-12-18', 'size': ''... No information [] ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [None, None]
473 473 [{'observation_date': '1840-02-01', 'size': ''... mort étouffé lors d'une aval. [] ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [-71.0166667, 46.9666667]
474 474 [{'observation_date': '1836-02-09', 'size': ''... No information is provided. No location. [] ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [-71.216667, 46.816667]
475 475 [] Carbonear Star, May 29, 1833 \nThe definition ... [{'title': 'Carbonear, May 24, 1833', 'source'... ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [-53.2326611, 47.7315028]
476 476 [{'observation_date': '1825-02-04', 'size': ''... autre aval. Cap Diamant [] ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [-71.1819444, 46.8105556]
477 477 [] NAC MG17 D1 B-X-1 pp.38,750 et ff., microfilm ... [{'title': 'Nain, 1781-2', 'source': 'NFLD Geo... ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... NaN {'temp_present': None, 'temp_max': None, 'temp... [-61.718942, 56.536319]

478 rows × 21 columns

Many incidents include coordinates, but others do not. Most however, do include a place name. We can use Natural Resources Canada’s Geolocation Service to retrieve coordinates from place names.

In [5]:
# geocode locations without coordinates
def geolocate(location, province):
    url = "http://geogratis.gc.ca/services/geolocation/en/locate?q={},%20{}"
    req = urllib.request.Request(url.format(urllib.parse.quote(location),province))
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    if (len(result)==0):
        return([None,None])
    else:
        return(result[0]['geometry']['coordinates'])
if not "alt_coord" in incidents.columns:
    incidents["alt_coord"] = [
        geolocate(incidents.location[i], incidents.location_province[i])
        for i in incidents.index
    ]
    incidents.to_csv(incidentsfile)

Now that we have incident data, let’s create some figures.

In [6]:
# clean up activity names
incidents.group_activity.unique()
Out[6]:
array(['Snowshoeing', 'Skiing', 'Snowmobiling', 'Snowboarding',
       'Backcountry Skiing', 'Ice Climbing', 'Ski touring', 'Heliskiing',
       'Snowshoeing & Hiking', 'Mechanized Skiing', 'Work',
       'Mountaineering', 'Other Recreational', 'Out-of-bounds Skiing',
       'At Outdoor Worksite', 'Lift Skiing Closed', 'Lift Skiing Open',
       'Hunting/Fishing', 'Out-of-Bounds Skiing', 'Control Work',
       'Inside Building', 'Car/Truck on Road', 'Inside Car/Truck on Road',
       'Unknown', 'Outside Building'], dtype=object)
In [7]:
incidents.group_activity=incidents.group_activity.replace("Ski touring","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Out-of-Bounds Skiing","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Lift Skiing Closed","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Skiing","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Snowshoeing","Snowshoeing & Hiking")
incidents.group_activity=incidents.group_activity.replace("Snowshoeing and Hiking","Snowshoeing & Hiking")
incidents.group_activity=incidents.group_activity.replace("Mechanized Skiing","Heli or Cat Skiing")
incidents.group_activity=incidents.group_activity.replace("Heliskiing","Heli or Cat Skiing")
incidents.group_activity=incidents.group_activity.replace("At Outdoor Worksite","Work")
incidents.group_activity=incidents.group_activity.replace("Control Work","Work")
incidents.group_activity=incidents.group_activity.replace("Hunting/Fishing","Other Recreational")
incidents.group_activity=incidents.group_activity.replace("Inside Car/Truck on Road","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Car/Truck on Road","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Inside Building","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Outside Building","Car/Truck/Building")


incidents.group_activity.unique()

fig, ax = plt.subplots(1,2, sharey=True, figsize=(12,4))
colors=plt.rcParams["axes.prop_cycle"].by_key()["color"]
incidents.groupby(['group_activity']).id.count().plot(kind='bar', title="Incidents by Activity", ax=ax[0])
incidents.groupby(['group_activity']).num_fatal.sum().plot(kind='bar', title="Deaths by Activity", ax=ax[1], color=colors[1])
ax[0].set_xlabel(None)
ax[1].set_xlabel(None);
In [8]:
incidents["date"] = pd.to_datetime(incidents.ob_date)
incidents["year"] = incidents.date.apply(lambda x: x.year)
incidents.date = incidents.date.dt.date
colors=plt.rcParams["axes.prop_cycle"].by_key()["color"]
f = incidents.groupby(["year"]).num_fatal.sum()
n = incidents.groupby(["year"]).id.count()
yearstart=1950
f=f[f.index>yearstart]
n=n[n.index>yearstart]
fig,ax = plt.subplots(1,1,figsize=(12,4))
n.plot(ax=ax)
f.plot(ax=ax)
ax.set_ylabel("Count")
ax.annotate("Incidents", (2010, 4), color=colors[0])
ax.annotate("Deaths", (2011, 15), color=colors[1]);

Mapping Incidents

Since the incident data includes coordinates, we might as well make a map too. Unfortunately, some latitude and longitudes contain obvious errors. Here, we try to fix them.

In [9]:
import re

# fix errors in latitude, longitude
latlon = incidents.location_coords
def makenumeric(cstr):
    if cstr is None:
        return([None,None])
    elif (type(cstr)==str):
        return([float(s) for s in re.findall(r'-?\d+\.?\d*',cstr)])
    else:
        return(cstr)

latlon = latlon.apply(makenumeric)

def good_lat(lat):
    return(lat >= 41.6 and lat <= 83.12) # min & max for Canada

def good_lon(lon):
    return(lon >= -141 and lon<= -52.6)

def fixlatlon(c):
    if (len(c)<2 or type(c[0])!=float or type(c[1])!=float):
        c = [None, None]
        return(c)
    lat = c[0]
    lon = c[1]
    if not good_lat(lat) and good_lat(lon):
        tmp = lat
        lat = lon
        lon = tmp
    if not good_lon(lon) and good_lon(-lon):
        lon = -lon
    if not good_lon(lon) and good_lon(lat):
        tmp = lat
        lat = lon
        lon = tmp
    if not good_lon(lon) and good_lon(-lat):
        tmp = -lat
        lat = lon
        lon = tmp
    if not good_lat(lat) or not good_lon(lon):
        c[0] = None
        c[1] = None
    else:
        c[0] = lat
        c[1] = lon
    return(c)

incidents["latlon"] = latlon.apply(fixlatlon)
In [10]:
def foo(c, a):
    if (type(a)==str):
        a = [float(s) for s in re.findall(r'-?\d+\.?\d*',a)]
    if len(a) <2:
        a = [None,None]
    return([a[1],a[0]] if type(c[0])!=float else c)
incidents["latlon_filled"]=[foo(c,a) for c,a in zip(incidents["latlon"],incidents["alt_coord"])]
nmiss = sum([a[0]==None for a in incidents.latlon_filled])
n = len(incidents.latlon_filled)
print("{} of {} incidents have latitude & longitude".format(n-nmiss, n))
317 of 478 incidents have latitude & longitude
In [11]:
# download forecast region definitions
# req = urllib.request.Request("https://www.avalanche.ca/api/forecasts")
# The above link doesn't work since COVID-19 lockdown. Currently we use an old cached version instead
#req = ("https://web.archive.org/web/20150319031605if_/http://www.avalanche.ca/api/forecasts")
#with urllib.request.urlopen(req) as response:
#    forecastregions = json.loads(response.read().decode('utf-8'))
req = "https://faculty.arts.ubc.ca/pschrimpf/forecast-regions2015.json"
with urllib.request.urlopen(req) as response:
    regions2015 = json.loads(response.read().decode('utf-8'))

req = "https://faculty.arts.ubc.ca/pschrimpf/forecast-regions2019.json"
with urllib.request.urlopen(req) as response:
    regions2019 = json.loads(response.read().decode('utf-8'))

forecastregions = regions2019
ids = [r['id'] for r in forecastregions['features']]
for r in regions2015['features'] :
     if not r['id'] in ids :
            forecastregions['features'].append(r)

You may have to uncomment the second line below if folium is not installed.

In [12]:
# Map forecast regions and incidents
#!pip install --user folium
import folium
import matplotlib

cmap = matplotlib.cm.get_cmap('Set1')
fmap = folium.Map(location=[60, -98],
                            zoom_start=3,
                            tiles='Stamen Terrain')
with urllib.request.urlopen(req) as response:
    regions_tmp = json.loads(response.read().decode('utf-8'))
folium.GeoJson(regions_tmp,
               tooltip=folium.GeoJsonTooltip(fields=["name"], aliases=[""]),
               highlight_function=lambda x: { 'weight': 10},
              style_function=lambda x: {'weight':1}).add_to(fmap)
activities = incidents.group_activity.unique()
for i in incidents.index:
    if incidents.latlon_filled[i][0] is not None and  incidents.latlon_filled[i][1] is not None:
        cindex=[j for j,x in enumerate(activities) if x==incidents.group_activity[i]][0]
        txt = "{}, {}<br>{} deaths"
        txt = txt.format(incidents.group_activity[i],
                        incidents.ob_date[i],
                        incidents.num_fatal[i]
                        )
        pop = folium.Popup(incidents.comment[i], parse_html=True, max_width=400)
        folium.CircleMarker(incidents.latlon_filled[i],
                      tooltip=txt,
                      popup=pop,
                      color=matplotlib.colors.to_hex(cmap(cindex)), fill=True, radius=5).add_to(fmap)
fmap
Out[12]: