Working with Text#
Author
Prerequisites
Outcomes
Use text as features for classification
Understand latent topic analysis
Use folium to create an interactive map
Request and combine json data from a web server
# Uncomment following line to install on colab
#! pip install fiona geopandas xgboost gensim folium pyLDAvis descartes
Introduction#
Many data sources contain both numerical data and text.
We can use text to create features for any of the prediction methods that we have discussed.
Doing so requires encoding text into some numerical representation.
A good encoding preserves the meaning of the original text, while keeping dimensionality manageable.
In this lecture, we will learn how to work with text through an application — predicting fatalities from avalanche forecasts.
Avalanches#
Snow avalanches are a hazard in the mountains. Avalanches can be partially predicted based on snow conditions, weather, and terrain. Avalanche Canada produces daily avalanche forecasts for various Canadian mountainous regions. These forecasts consist of 1-5 ratings for each of three elevation bands, as well as textual descriptions of recent avalanche observations, snowpack, and weather. Avalanche Canada also maintains a list of fatal avalanche incidents . In this lecture, we will attempt to predict fatal incidents from the text of avalanche forecasts. Since fatal incidents are rare, this prediction task will be quite difficult.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
Data#
Avalanche Canada has an unstable json api. The api seems to be largely tailored to displaying the information on various Avalanche Canada websites, which does not make it easy to obtain large amounts of data. Nonetheless, getting information from the API is easier than scraping the website. Generally, whenever you’re considering scraping a website, you should first check whether the site has an API available.
Incident Data#
# Get data on avalanche forecasts and incidents from Avalanche Canada
# Avalanche Canada has an unstable public api
# https://github.com/avalanche-canada/ac-web
# Since API might change, this code might break
import json
import os
import urllib.request
import pandas as pd
import time
import requests
import io
import zipfile
import warnings
# Incidents
url = "http://incidents.avalanche.ca/public/incidents/?format=json"
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as response:
result = json.loads(response.read().decode('utf-8'))
incident_list = result["results"]
while (result["next"] != None):
req = urllib.request.Request(result["next"])
with urllib.request.urlopen(req) as response:
result = json.loads(response.read().decode('utf-8'))
incident_list = incident_list + result["results"]
incidents_brief = pd.DataFrame.from_dict(incident_list,orient="columns")
pd.options.display.max_rows = 20
pd.options.display.max_columns = 8
incidents_brief
id | date | location | location_province | group_activity | num_involved | num_injured | num_fatal | |
---|---|---|---|---|---|---|---|---|
0 | 6dad9a7f-e732-43cf-8058-5fcac24c7041 | 2023-04-22 | Lake Louise West Bowl | AB | Lift Skiing Closed | 2.0 | 1.0 | 1 |
1 | a3968149-3ee5-4360-8efb-ed8bc3f331c4 | 2023-04-15 | Thunderwater Lake | BC | Snowmobiling | 2.0 | 0.0 | 1 |
2 | 913e84e6-4785-409e-859e-f6456b2fb27c | 2023-04-11 | Treaty Creek | BC | Mechanized Skiing | 5.0 | 1.0 | 1 |
3 | 645d13aa-4a96-4f11-a787-ded7a0051d63 | 2023-03-01 | Coppercrown Mountain | BC | Mechanized Skiing | 10.0 | 4.0 | 3 |
4 | a764c492-bb8d-401d-b7b4-ec5d7915b98e | 2023-02-16 | Terminator 2.5 | BC | Out-of-Bounds Skiing/Snowboarding | 6.0 | 1.0 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
498 | 101c517b-29a4-4c49-8934-f6c56ddd882d | 1840-02-01 | Château-Richer | QC | Unknown | NaN | NaN | 1 |
499 | b2e1c50a-1533-4145-a1a2-0befca0154d5 | 1836-02-09 | Quebec | QC | Unknown | NaN | NaN | 1 |
500 | 18e8f963-da33-4682-9312-57ca2cc9ad8d | 1833-05-24 | Carbonear | NL | Unknown | NaN | 0.0 | 1 |
501 | 083d22df-ed50-4687-b9ab-1649960a0fbe | 1825-02-04 | Saint-Joseph de Lévis | QC | Inside Building | NaN | NaN | 5 |
502 | f498c48a-981d-43cf-ac16-151b8794435c | 1782-01-01 | Nain | NL | Unknown | NaN | NaN | 22 |
503 rows × 8 columns
# We can get more information about these incidents e.g. "https://www.avalanche.ca/incidents/37d909e4-c6de-43f1-8416-57a34cd48255"
# this information is also available through the API
def get_incident_details(id):
url = "http://incidents.avalanche.ca/public/incidents/{}?format=json".format(id)
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as response:
result = json.loads(response.read().decode('utf-8'))
return(result)
incidentsfile = "https://datascience.quantecon.org/assets/data/avalanche_incidents.csv"
# To avoid loading the avalanche Canada servers, we save the incident details locally.
if (not os.path.isfile(incidentsfile)):
incident_detail_list = incidents_brief.id.apply(get_incident_details).to_list()
incidents = pd.DataFrame.from_dict(incident_detail_list, orient="columns")
incidents.to_csv(incidentsfile)
else:
incidents = pd.read_csv(incidentsfile)
incidents
id | ob_date | location | location_desc | ... | weather_comment | snowpack_obs | snowpack_comment | documents | |
---|---|---|---|---|---|---|---|---|---|
0 | 6dad9a7f-e732-43cf-8058-5fcac24c7041 | 2023-04-22 | Lake Louise West Bowl | Lake Louise Ski Resort | ... | {'hs': None, 'hn24': None, 'hst': None, 'hst_r... | [{'date': '2023-04-22', 'title': 'Aerial Photo... | ||
1 | a3968149-3ee5-4360-8efb-ed8bc3f331c4 | 2023-04-15 | Thunderwater Lake | Approximately 35 km west of Radium Hot Springs | ... | {'hs': None, 'hn24': None, 'hst': None, 'hst_r... | [{'date': '2023-04-15', 'title': 'Aerial photo... | ||
2 | 913e84e6-4785-409e-859e-f6456b2fb27c | 2023-04-11 | Treaty Creek | Approximately 70 km north of Stewart | ... | {'hs': None, 'hn24': None, 'hst': None, 'hst_r... | [] | ||
3 | 645d13aa-4a96-4f11-a787-ded7a0051d63 | 2023-03-01 | Coppercrown Mountain | Approximately 30 km southwest of Invermere, BC | ... | {'hs': None, 'hn24': None, 'hst': None, 'hst_r... | [] | ||
4 | a764c492-bb8d-401d-b7b4-ec5d7915b98e | 2023-02-16 | Terminator 2.5 | Near Golden BC | ... | {'hs': None, 'hn24': None, 'hst': None, 'hst_r... | [{'date': '2023-02-16', 'title': 'Photo of sta... | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
498 | 101c517b-29a4-4c49-8934-f6c56ddd882d | 1840-02-01 | Château-Richer | ... | {'hs': None, 'hn24': None, 'hst': None, 'hst_r... | [] | |||
499 | b2e1c50a-1533-4145-a1a2-0befca0154d5 | 1836-02-09 | Quebec | more details unknown | ... | {'hs': None, 'hn24': None, 'hst': None, 'hst_r... | [] | ||
500 | 18e8f963-da33-4682-9312-57ca2cc9ad8d | 1833-05-24 | Carbonear | ... | {'hs': None, 'hn24': None, 'hst': None, 'hst_r... | [{'title': 'Carbonear, May 24, 1833', 'source'... | |||
501 | 083d22df-ed50-4687-b9ab-1649960a0fbe | 1825-02-04 | Saint-Joseph de Lévis | Pointe Lévis | ... | {'hs': None, 'hn24': None, 'hst': None, 'hst_r... | [] | ||
502 | f498c48a-981d-43cf-ac16-151b8794435c | 1782-01-01 | Nain | ... | {'hs': None, 'hn24': None, 'hst': None, 'hst_r... | [{'title': 'Nain, 1781-2', 'source': 'NFLD Geo... |
503 rows × 19 columns
Many incidents include coordinates, but others do not. Most however, do include a place name. We can use Natural Resources Canada’s Geolocation Service to retrieve coordinates from place names.
# geocode locations without coordinates
def geolocate(location, province):
url = "http://geogratis.gc.ca/services/geolocation/en/locate?q={},%20{}"
req = urllib.request.Request(url.format(urllib.parse.quote(location),province))
with urllib.request.urlopen(req) as response:
result = json.loads(response.read().decode('utf-8'))
if (len(result)==0):
return([None,None])
else:
return(result[0]['geometry']['coordinates'])
if not "alt_coord" in incidents.columns:
incidents["alt_coord"] = [
geolocate(incidents.location[i], incidents.location_province[i])
for i in incidents.index
]
incidents.to_csv(incidentsfile)
Now that we have incident data, let’s create some figures.
# clean up activity names
incidents.group_activity.unique()
array(['Lift Skiing Closed', 'Snowmobiling', 'Mechanized Skiing',
'Out-of-Bounds Skiing/Snowboarding', 'Backcountry Skiing',
'Mountaineering', 'Lift Skiing Open',
'Backcountry Skiing/Snowboarding', 'Skiing/Snowboarding',
'Snow Biking', 'Snowshoeing', 'Snowboarding', 'Ice Climbing',
'Ski touring', 'Skiing', 'Heliskiing', 'Snowshoeing & Hiking',
'Work', 'Other Recreational', 'Out-of-bounds Skiing',
'At Outdoor Worksite', 'Hunting/Fishing', 'Out-of-Bounds Skiing',
'Control Work', 'Inside Building', 'Car/Truck on Road',
'Inside Car/Truck on Road', 'Unknown', 'Outside Building'],
dtype=object)
incidents.group_activity=incidents.group_activity.replace("Ski touring","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Out-of-Bounds Skiing","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Lift Skiing Closed","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Skiing","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Snowshoeing","Snowshoeing & Hiking")
incidents.group_activity=incidents.group_activity.replace("Snowshoeing and Hiking","Snowshoeing & Hiking")
incidents.group_activity=incidents.group_activity.replace("Mechanized Skiing","Heli or Cat Skiing")
incidents.group_activity=incidents.group_activity.replace("Heliskiing","Heli or Cat Skiing")
incidents.group_activity=incidents.group_activity.replace("At Outdoor Worksite","Work")
incidents.group_activity=incidents.group_activity.replace("Control Work","Work")
incidents.group_activity=incidents.group_activity.replace("Hunting/Fishing","Other Recreational")
incidents.group_activity=incidents.group_activity.replace("Inside Car/Truck on Road","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Car/Truck on Road","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Inside Building","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Outside Building","Car/Truck/Building")
incidents.group_activity.unique()
fig, ax = plt.subplots(1,2, sharey=True, figsize=(12,4))
colors=plt.rcParams["axes.prop_cycle"].by_key()["color"]
incidents.groupby(['group_activity']).id.count().plot(kind='bar', title="Incidents by Activity", ax=ax[0])
incidents.groupby(['group_activity']).num_fatal.sum().plot(kind='bar', title="Deaths by Activity", ax=ax[1], color=colors[1])
ax[0].set_xlabel(None)
ax[1].set_xlabel(None);

incidents["date"] = pd.to_datetime(incidents.ob_date)
incidents["year"] = incidents.date.apply(lambda x: x.year)
incidents.date = incidents.date.dt.date
colors=plt.rcParams["axes.prop_cycle"].by_key()["color"]
f = incidents.groupby(["year"]).num_fatal.sum()
n = incidents.groupby(["year"]).id.count()
yearstart=1950
f=f[f.index>yearstart]
n=n[n.index>yearstart]
fig,ax = plt.subplots(1,1,figsize=(12,4))
n.plot(ax=ax)
f.plot(ax=ax)
ax.set_ylabel("Count")
ax.annotate("Incidents", (2010, 4), color=colors[0])
ax.annotate("Deaths", (2011, 15), color=colors[1]);

Mapping Incidents#
Since the incident data includes coordinates, we might as well make a map too. Unfortunately, some latitude and longitudes contain obvious errors. Here, we try to fix them.
import re
# fix errors in latitude, longitude
latlon = incidents.location_coords
def makenumeric(cstr):
if cstr is None:
return([None,None])
elif (type(cstr)==str):
return([float(s) for s in re.findall(r'-?\d+\.?\d*',cstr)])
else:
return(cstr)
latlon = latlon.apply(makenumeric)
def good_lat(lat):
return(lat >= 41.6 and lat <= 83.12) # min & max for Canada
def good_lon(lon):
return(lon >= -141 and lon<= -52.6)
def fixlatlon(c):
if (len(c)<2 or type(c[0])!=float or type(c[1])!=float):
c = [None, None]
return(c)
lat = c[0]
lon = c[1]
if not good_lat(lat) and good_lat(lon):
tmp = lat
lat = lon
lon = tmp
if not good_lon(lon) and good_lon(-lon):
lon = -lon
if not good_lon(lon) and good_lon(lat):
tmp = lat
lat = lon
lon = tmp
if not good_lon(lon) and good_lon(-lat):
tmp = -lat
lat = lon
lon = tmp
if not good_lat(lat) or not good_lon(lon):
c[0] = None
c[1] = None
else:
c[0] = lat
c[1] = lon
return(c)
incidents["latlon"] = latlon.apply(fixlatlon)
def foo(c, a):
if (type(a)==str):
a = [float(s) for s in re.findall(r'-?\d+\.?\d*',a)]
if len(a) <2:
a = [None,None]
return([a[1],a[0]] if type(c[0])!=float else c)
incidents["latlon_filled"]=[foo(c,a) for c,a in zip(incidents["latlon"],incidents["alt_coord"])]
nmiss = sum([a[0]==None for a in incidents.latlon_filled])
n = len(incidents.latlon_filled)
print("{} of {} incidents have latitude & longitude".format(n-nmiss, n))
342 of 503 incidents have latitude & longitude
# download forecast region definitions
# req = urllib.request.Request("https://www.avalanche.ca/api/forecasts")
# The above link doesn't work since COVID-19 lockdown. Currently we use an old cached version instead
#req = ("https://web.archive.org/web/20150319031605if_/http://www.avalanche.ca/api/forecasts")
#with urllib.request.urlopen(req) as response:
# forecastregions = json.loads(response.read().decode('utf-8'))
req = "https://faculty.arts.ubc.ca/pschrimpf/forecast-regions2015.json"
with urllib.request.urlopen(req) as response:
regions2015 = json.loads(response.read().decode('utf-8'))
req = "https://faculty.arts.ubc.ca/pschrimpf/forecast-regions2019.json"
with urllib.request.urlopen(req) as response:
regions2019 = json.loads(response.read().decode('utf-8'))
forecastregions = regions2019
ids = [r['id'] for r in forecastregions['features']]
for r in regions2015['features'] :
if not r['id'] in ids :
forecastregions['features'].append(r)
You may have to uncomment the second line below if folium is not installed.
# Map forecast regions and incidents
#!pip install --user folium
import folium
import matplotlib
cmap = matplotlib.cm.get_cmap('Set1')
fmap = folium.Map(location=[60, -98],
zoom_start=3,
tiles='Stamen Terrain')
with urllib.request.urlopen(req) as response:
regions_tmp = json.loads(response.read().decode('utf-8'))
folium.GeoJson(regions_tmp,
tooltip=folium.GeoJsonTooltip(fields=["name"], aliases=[""]),
highlight_function=lambda x: { 'weight': 10},
style_function=lambda x: {'weight':1}).add_to(fmap)
activities = incidents.group_activity.unique()
for i in incidents.index:
if incidents.latlon_filled[i][0] is not None and incidents.latlon_filled[i][1] is not None:
cindex=[j for j,x in enumerate(activities) if x==incidents.group_activity[i]][0]
txt = "{}, {}<br>{} deaths"
txt = txt.format(incidents.group_activity[i],
incidents.ob_date[i],
incidents.num_fatal[i]
)
pop = folium.Popup(incidents.comment[i], parse_html=True, max_width=400)
folium.CircleMarker(incidents.latlon_filled[i],
tooltip=txt,
popup=pop,
color=matplotlib.colors.to_hex(cmap(cindex)), fill=True, radius=5).add_to(fmap)
fmap
/tmp/ipykernel_3127/2943874502.py:6: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
cmap = matplotlib.cm.get_cmap('Set1')