Working with Text#

Author

Prerequisites

Outcomes

  • Use text as features for classification

  • Understand latent topic analysis

  • Use folium to create an interactive map

  • Request and combine json data from a web server

# Uncomment following line to install on colab
#! pip install fiona geopandas xgboost gensim folium pyLDAvis descartes

Introduction#

Many data sources contain both numerical data and text.

We can use text to create features for any of the prediction methods that we have discussed.

Doing so requires encoding text into some numerical representation.

A good encoding preserves the meaning of the original text, while keeping dimensionality manageable.

In this lecture, we will learn how to work with text through an application — predicting fatalities from avalanche forecasts.

Avalanches#

Snow avalanches are a hazard in the mountains. Avalanches can be partially predicted based on snow conditions, weather, and terrain. Avalanche Canada produces daily avalanche forecasts for various Canadian mountainous regions. These forecasts consist of 1-5 ratings for each of three elevation bands, as well as textual descriptions of recent avalanche observations, snowpack, and weather. Avalanche Canada also maintains a list of fatal avalanche incidents . In this lecture, we will attempt to predict fatal incidents from the text of avalanche forecasts. Since fatal incidents are rare, this prediction task will be quite difficult.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


%matplotlib inline

Data#

Avalanche Canada has an unstable json api. The api seems to be largely tailored to displaying the information on various Avalanche Canada websites, which does not make it easy to obtain large amounts of data. Nonetheless, getting information from the API is easier than scraping the website. Generally, whenever you’re considering scraping a website, you should first check whether the site has an API available.

Incident Data#

# Get data on avalanche forecasts and incidents from Avalanche Canada
# Avalanche Canada has an unstable public api
# https://github.com/avalanche-canada/ac-web
# Since API might change, this code might break
import json
import os
import urllib.request
import pandas as pd
import time
import requests
import io
import zipfile
import warnings


# Incidents
url = "http://incidents.avalanche.ca/public/incidents/?format=json"
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as response:
    result = json.loads(response.read().decode('utf-8'))
incident_list = result["results"]
while (result["next"] != None):
    req = urllib.request.Request(result["next"])
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    incident_list = incident_list + result["results"]
incidents_brief = pd.DataFrame.from_dict(incident_list,orient="columns")
pd.options.display.max_rows = 20
pd.options.display.max_columns = 8
incidents_brief
id date location location_province group_activity num_involved num_injured num_fatal
0 6dad9a7f-e732-43cf-8058-5fcac24c7041 2023-04-22 Lake Louise West Bowl AB Lift Skiing Closed 2.0 1.0 1
1 a3968149-3ee5-4360-8efb-ed8bc3f331c4 2023-04-15 Thunderwater Lake BC Snowmobiling 2.0 0.0 1
2 913e84e6-4785-409e-859e-f6456b2fb27c 2023-04-11 Treaty Creek BC Mechanized Skiing 5.0 1.0 1
3 645d13aa-4a96-4f11-a787-ded7a0051d63 2023-03-01 Coppercrown Mountain BC Mechanized Skiing 10.0 4.0 3
4 a764c492-bb8d-401d-b7b4-ec5d7915b98e 2023-02-16 Terminator 2.5 BC Out-of-Bounds Skiing/Snowboarding 6.0 1.0 2
... ... ... ... ... ... ... ... ...
498 101c517b-29a4-4c49-8934-f6c56ddd882d 1840-02-01 Château-Richer QC Unknown NaN NaN 1
499 b2e1c50a-1533-4145-a1a2-0befca0154d5 1836-02-09 Quebec QC Unknown NaN NaN 1
500 18e8f963-da33-4682-9312-57ca2cc9ad8d 1833-05-24 Carbonear NL Unknown NaN 0.0 1
501 083d22df-ed50-4687-b9ab-1649960a0fbe 1825-02-04 Saint-Joseph de Lévis QC Inside Building NaN NaN 5
502 f498c48a-981d-43cf-ac16-151b8794435c 1782-01-01 Nain NL Unknown NaN NaN 22

503 rows × 8 columns

# We can get more information about these incidents e.g. "https://www.avalanche.ca/incidents/37d909e4-c6de-43f1-8416-57a34cd48255"
# this information is also available through the API
def get_incident_details(id):
    url = "http://incidents.avalanche.ca/public/incidents/{}?format=json".format(id)
    req = urllib.request.Request(url)
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    return(result)


incidentsfile = "https://datascience.quantecon.org/assets/data/avalanche_incidents.csv"

# To avoid loading the avalanche Canada servers, we save the incident details locally.
if (not os.path.isfile(incidentsfile)):
    incident_detail_list = incidents_brief.id.apply(get_incident_details).to_list()
    incidents = pd.DataFrame.from_dict(incident_detail_list, orient="columns")
    incidents.to_csv(incidentsfile)
else:
    incidents = pd.read_csv(incidentsfile)

incidents
id ob_date location location_desc ... weather_comment snowpack_obs snowpack_comment documents
0 6dad9a7f-e732-43cf-8058-5fcac24c7041 2023-04-22 Lake Louise West Bowl Lake Louise Ski Resort ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... [{'date': '2023-04-22', 'title': 'Aerial Photo...
1 a3968149-3ee5-4360-8efb-ed8bc3f331c4 2023-04-15 Thunderwater Lake Approximately 35 km west of Radium Hot Springs ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... [{'date': '2023-04-15', 'title': 'Aerial photo...
2 913e84e6-4785-409e-859e-f6456b2fb27c 2023-04-11 Treaty Creek Approximately 70 km north of Stewart ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
3 645d13aa-4a96-4f11-a787-ded7a0051d63 2023-03-01 Coppercrown Mountain Approximately 30 km southwest of Invermere, BC ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
4 a764c492-bb8d-401d-b7b4-ec5d7915b98e 2023-02-16 Terminator 2.5 Near Golden BC ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... [{'date': '2023-02-16', 'title': 'Photo of sta...
... ... ... ... ... ... ... ... ... ...
498 101c517b-29a4-4c49-8934-f6c56ddd882d 1840-02-01 Château-Richer ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
499 b2e1c50a-1533-4145-a1a2-0befca0154d5 1836-02-09 Quebec more details unknown ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
500 18e8f963-da33-4682-9312-57ca2cc9ad8d 1833-05-24 Carbonear ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... [{'title': 'Carbonear, May 24, 1833', 'source'...
501 083d22df-ed50-4687-b9ab-1649960a0fbe 1825-02-04 Saint-Joseph de Lévis Pointe Lévis ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
502 f498c48a-981d-43cf-ac16-151b8794435c 1782-01-01 Nain ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... [{'title': 'Nain, 1781-2', 'source': 'NFLD Geo...

503 rows × 19 columns

Many incidents include coordinates, but others do not. Most however, do include a place name. We can use Natural Resources Canada’s Geolocation Service to retrieve coordinates from place names.

# geocode locations without coordinates
def geolocate(location, province):
    url = "http://geogratis.gc.ca/services/geolocation/en/locate?q={},%20{}"
    req = urllib.request.Request(url.format(urllib.parse.quote(location),province))
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    if (len(result)==0):
        return([None,None])
    else:
        return(result[0]['geometry']['coordinates'])
if not "alt_coord" in incidents.columns:
    incidents["alt_coord"] = [
        geolocate(incidents.location[i], incidents.location_province[i])
        for i in incidents.index
    ]
    incidents.to_csv(incidentsfile)

Now that we have incident data, let’s create some figures.

# clean up activity names
incidents.group_activity.unique()
array(['Lift Skiing Closed', 'Snowmobiling', 'Mechanized Skiing',
       'Out-of-Bounds Skiing/Snowboarding', 'Backcountry Skiing',
       'Mountaineering', 'Lift Skiing Open',
       'Backcountry Skiing/Snowboarding', 'Skiing/Snowboarding',
       'Snow Biking', 'Snowshoeing', 'Snowboarding', 'Ice Climbing',
       'Ski touring', 'Skiing', 'Heliskiing', 'Snowshoeing & Hiking',
       'Work', 'Other Recreational', 'Out-of-bounds Skiing',
       'At Outdoor Worksite', 'Hunting/Fishing', 'Out-of-Bounds Skiing',
       'Control Work', 'Inside Building', 'Car/Truck on Road',
       'Inside Car/Truck on Road', 'Unknown', 'Outside Building'],
      dtype=object)
incidents.group_activity=incidents.group_activity.replace("Ski touring","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Out-of-Bounds Skiing","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Lift Skiing Closed","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Skiing","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Snowshoeing","Snowshoeing & Hiking")
incidents.group_activity=incidents.group_activity.replace("Snowshoeing and Hiking","Snowshoeing & Hiking")
incidents.group_activity=incidents.group_activity.replace("Mechanized Skiing","Heli or Cat Skiing")
incidents.group_activity=incidents.group_activity.replace("Heliskiing","Heli or Cat Skiing")
incidents.group_activity=incidents.group_activity.replace("At Outdoor Worksite","Work")
incidents.group_activity=incidents.group_activity.replace("Control Work","Work")
incidents.group_activity=incidents.group_activity.replace("Hunting/Fishing","Other Recreational")
incidents.group_activity=incidents.group_activity.replace("Inside Car/Truck on Road","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Car/Truck on Road","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Inside Building","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Outside Building","Car/Truck/Building")


incidents.group_activity.unique()

fig, ax = plt.subplots(1,2, sharey=True, figsize=(12,4))
colors=plt.rcParams["axes.prop_cycle"].by_key()["color"]
incidents.groupby(['group_activity']).id.count().plot(kind='bar', title="Incidents by Activity", ax=ax[0])
incidents.groupby(['group_activity']).num_fatal.sum().plot(kind='bar', title="Deaths by Activity", ax=ax[1], color=colors[1])
ax[0].set_xlabel(None)
ax[1].set_xlabel(None);
../_images/3207b44adb9e280a631c3cc82a8d658d0ae8db1a5233c77f51f276469a32deda.png
incidents["date"] = pd.to_datetime(incidents.ob_date)
incidents["year"] = incidents.date.apply(lambda x: x.year)
incidents.date = incidents.date.dt.date
colors=plt.rcParams["axes.prop_cycle"].by_key()["color"]
f = incidents.groupby(["year"]).num_fatal.sum()
n = incidents.groupby(["year"]).id.count()
yearstart=1950
f=f[f.index>yearstart]
n=n[n.index>yearstart]
fig,ax = plt.subplots(1,1,figsize=(12,4))
n.plot(ax=ax)
f.plot(ax=ax)
ax.set_ylabel("Count")
ax.annotate("Incidents", (2010, 4), color=colors[0])
ax.annotate("Deaths", (2011, 15), color=colors[1]);
../_images/e6588b83d1e2900340bb52bf6c06f4d8efdd6be91653b4f924ab36c8af9740dc.png

Mapping Incidents#

Since the incident data includes coordinates, we might as well make a map too. Unfortunately, some latitude and longitudes contain obvious errors. Here, we try to fix them.

import re

# fix errors in latitude, longitude
latlon = incidents.location_coords
def makenumeric(cstr):
    if cstr is None:
        return([None,None])
    elif (type(cstr)==str):
        return([float(s) for s in re.findall(r'-?\d+\.?\d*',cstr)])
    else:
        return(cstr)

latlon = latlon.apply(makenumeric)

def good_lat(lat):
    return(lat >= 41.6 and lat <= 83.12) # min & max for Canada

def good_lon(lon):
    return(lon >= -141 and lon<= -52.6)

def fixlatlon(c):
    if (len(c)<2 or type(c[0])!=float or type(c[1])!=float):
        c = [None, None]
        return(c)
    lat = c[0]
    lon = c[1]
    if not good_lat(lat) and good_lat(lon):
        tmp = lat
        lat = lon
        lon = tmp
    if not good_lon(lon) and good_lon(-lon):
        lon = -lon
    if not good_lon(lon) and good_lon(lat):
        tmp = lat
        lat = lon
        lon = tmp
    if not good_lon(lon) and good_lon(-lat):
        tmp = -lat
        lat = lon
        lon = tmp
    if not good_lat(lat) or not good_lon(lon):
        c[0] = None
        c[1] = None
    else:
        c[0] = lat
        c[1] = lon
    return(c)

incidents["latlon"] = latlon.apply(fixlatlon)
def foo(c, a):
    if (type(a)==str):
        a = [float(s) for s in re.findall(r'-?\d+\.?\d*',a)]
    if len(a) <2:
        a = [None,None]
    return([a[1],a[0]] if type(c[0])!=float else c)
incidents["latlon_filled"]=[foo(c,a) for c,a in zip(incidents["latlon"],incidents["alt_coord"])]
nmiss = sum([a[0]==None for a in incidents.latlon_filled])
n = len(incidents.latlon_filled)
print("{} of {} incidents have latitude & longitude".format(n-nmiss, n))
342 of 503 incidents have latitude & longitude
# download forecast region definitions
# req = urllib.request.Request("https://www.avalanche.ca/api/forecasts")
# The above link doesn't work since COVID-19 lockdown. Currently we use an old cached version instead
#req = ("https://web.archive.org/web/20150319031605if_/http://www.avalanche.ca/api/forecasts")
#with urllib.request.urlopen(req) as response:
#    forecastregions = json.loads(response.read().decode('utf-8'))
req = "https://faculty.arts.ubc.ca/pschrimpf/forecast-regions2015.json"
with urllib.request.urlopen(req) as response:
    regions2015 = json.loads(response.read().decode('utf-8'))

req = "https://faculty.arts.ubc.ca/pschrimpf/forecast-regions2019.json"
with urllib.request.urlopen(req) as response:
    regions2019 = json.loads(response.read().decode('utf-8'))

forecastregions = regions2019
ids = [r['id'] for r in forecastregions['features']]
for r in regions2015['features'] :
     if not r['id'] in ids :
            forecastregions['features'].append(r)

You may have to uncomment the second line below if folium is not installed.

# Map forecast regions and incidents
#!pip install --user folium
import folium
import matplotlib

cmap = matplotlib.cm.get_cmap('Set1')
fmap = folium.Map(location=[60, -98],
                            zoom_start=3,
                            tiles='Stamen Terrain')
with urllib.request.urlopen(req) as response:
    regions_tmp = json.loads(response.read().decode('utf-8'))
folium.GeoJson(regions_tmp,
               tooltip=folium.GeoJsonTooltip(fields=["name"], aliases=[""]),
               highlight_function=lambda x: { 'weight': 10},
              style_function=lambda x: {'weight':1}).add_to(fmap)
activities = incidents.group_activity.unique()
for i in incidents.index:
    if incidents.latlon_filled[i][0] is not None and  incidents.latlon_filled[i][1] is not None:
        cindex=[j for j,x in enumerate(activities) if x==incidents.group_activity[i]][0]
        txt = "{}, {}<br>{} deaths"
        txt = txt.format(incidents.group_activity[i],
                        incidents.ob_date[i],
                        incidents.num_fatal[i]
                        )
        pop = folium.Popup(incidents.comment[i], parse_html=True, max_width=400)
        folium.CircleMarker(incidents.latlon_filled[i],
                      tooltip=txt,
                      popup=pop,
                      color=matplotlib.colors.to_hex(cmap(cindex)), fill=True, radius=5).add_to(fmap)
fmap
/tmp/ipykernel_3127/2943874502.py:6: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = matplotlib.cm.get_cmap('Set1')
Make this Notebook Trusted to load map: File -> Trust Notebook