Working with Text

Author

Prerequisites

Outcomes

  • Use text as features for classification

  • Understand latent topic analysis

  • Use folium to create an interactive map

  • Request and combine json data from a web server

# Uncomment following line to install on colab
#! pip install fiona geopandas xgboost gensim folium pyLDAvis descartes

Introduction

Many data sources contain both numerical data and text.

We can use text to create features for any of the prediction methods that we have discussed.

Doing so requires encoding text into some numerical representation.

A good encoding preserves the meaning of the original text, while keeping dimensionality manageable.

In this lecture, we will learn how to work with text through an application — predicting fatalities from avalanche forecasts.

Avalanches

Snow avalanches are a hazard in the mountains. Avalanches can be partially predicted based on snow conditions, weather, and terrain. Avalanche Canada produces daily avalanche forecasts for various Canadian mountainous regions. These forecasts consist of 1-5 ratings for each of three elevation bands, as well as textual descriptions of recent avalanche observations, snowpack, and weather. Avalanche Canada also maintains a list of fatal avalanche incidents . In this lecture, we will attempt to predict fatal incidents from the text of avalanche forecasts. Since fatal incidents are rare, this prediction task will be quite difficult.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


%matplotlib inline

Data

Avalanche Canada has an unstable json api. The api seems to be largely tailored to displaying the information on various Avalanche Canada websites, which does not make it easy to obtain large amounts of data. Nonetheless, getting information from the API is easier than scraping the website. Generally, whenever you’re considering scraping a website, you should first check whether the site has an API available.

Incident Data

# Get data on avalanche forecasts and incidents from Avalanche Canada
# Avalanche Canada has an unstable public api
# https://github.com/avalanche-canada/ac-web
# Since API might change, this code might break
import json
import os
import urllib.request
import pandas as pd
import time
import requests
import io
import zipfile
import warnings


# Incidents
url = "http://incidents.avalanche.ca/public/incidents/?format=json"
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as response:
    result = json.loads(response.read().decode('utf-8'))
incident_list = result["results"]
while (result["next"] != None):
    req = urllib.request.Request(result["next"])
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    incident_list = incident_list + result["results"]
incidents_brief = pd.DataFrame.from_dict(incident_list,orient="columns")
pd.options.display.max_rows = 20
pd.options.display.max_columns = 8
incidents_brief
id date location location_province group_activity num_involved num_injured num_fatal
0 0be6f146-74fc-45fb-a748-799f0599aa23 2022-04-13 Mount Des Poilus BC Mountaineering 1.0 1.0 1
1 88fb4749-8b10-4640-8aa0-acdffeebfe76 2022-04-05 West Ridge, Blue Moon BC Lift Skiing Open 1.0 1.0 1
2 bb4569b4-a158-4ca7-bb4c-6cfc58b6498d 2022-03-06 Charlie's Bottom NL Snowmobiling 1.0 0.0 1
3 ddbe24b9-40bc-40ba-b8b5-94dd4e02e9dd 2022-02-20 Mt. Kenney BC Mechanized Skiing 4.0 2.0 1
4 ce706316-3687-45f0-a8db-fb63b7000ea5 2022-02-05 Cassiope Peak BC skiing/snowboarding 4.0 1.0 1
... ... ... ... ... ... ... ... ...
489 101c517b-29a4-4c49-8934-f6c56ddd882d 1840-02-01 Château-Richer QC Unknown NaN NaN 1
490 b2e1c50a-1533-4145-a1a2-0befca0154d5 1836-02-09 Quebec QC Unknown NaN NaN 1
491 18e8f963-da33-4682-9312-57ca2cc9ad8d 1833-05-24 Carbonear NL Unknown NaN 0.0 1
492 083d22df-ed50-4687-b9ab-1649960a0fbe 1825-02-04 Saint-Joseph de Lévis QC Inside Building NaN NaN 5
493 f498c48a-981d-43cf-ac16-151b8794435c 1782-01-01 Nain NL Unknown NaN NaN 22

494 rows × 8 columns

# We can get more information about these incidents e.g. "https://www.avalanche.ca/incidents/37d909e4-c6de-43f1-8416-57a34cd48255"
# this information is also available through the API
def get_incident_details(id):
    url = "http://incidents.avalanche.ca/public/incidents/{}?format=json".format(id)
    req = urllib.request.Request(url)
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    return(result)


incidentsfile = "https://datascience.quantecon.org/assets/data/avalanche_incidents.csv"

# To avoid loading the avalanche Canada servers, we save the incident details locally.
if (not os.path.isfile(incidentsfile)):
    incident_detail_list = incidents_brief.id.apply(get_incident_details).to_list()
    incidents = pd.DataFrame.from_dict(incident_detail_list, orient="columns")
    incidents.to_csv(incidentsfile)
else:
    incidents = pd.read_csv(incidentsfile)

incidents
id ob_date location location_desc ... weather_comment snowpack_obs snowpack_comment documents
0 0be6f146-74fc-45fb-a748-799f0599aa23 2022-04-13 Mount Des Poilus Approximately 24km NNW of Field BC ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... [{'date': '2022-04-18', 'title': 'overview, pu...
1 88fb4749-8b10-4640-8aa0-acdffeebfe76 2022-04-05 West Ridge, Blue Moon Approx. 6km SE of Whistler, BC ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
2 bb4569b4-a158-4ca7-bb4c-6cfc58b6498d 2022-03-06 Charlie's Bottom Long Range Mountains, Blow Me Down area, appro... ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
3 ddbe24b9-40bc-40ba-b8b5-94dd4e02e9dd 2022-02-20 Mt. Kenney Approximately 23km NW of Terrace, BC ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
4 ce706316-3687-45f0-a8db-fb63b7000ea5 2022-02-05 Cassiope Peak Approximately 17km NE of Pemberton ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... [{'date': '2022-02-14', 'title': 'Overview pho...
... ... ... ... ... ... ... ... ... ...
489 101c517b-29a4-4c49-8934-f6c56ddd882d 1840-02-01 Château-Richer ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
490 b2e1c50a-1533-4145-a1a2-0befca0154d5 1836-02-09 Quebec more details unknown ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
491 18e8f963-da33-4682-9312-57ca2cc9ad8d 1833-05-24 Carbonear ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... [{'title': 'Carbonear, May 24, 1833', 'source'...
492 083d22df-ed50-4687-b9ab-1649960a0fbe 1825-02-04 Saint-Joseph de Lévis Pointe Lévis ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
493 f498c48a-981d-43cf-ac16-151b8794435c 1782-01-01 Nain ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... [{'title': 'Nain, 1781-2', 'source': 'NFLD Geo...

494 rows × 19 columns

Many incidents include coordinates, but others do not. Most however, do include a place name. We can use Natural Resources Canada’s Geolocation Service to retrieve coordinates from place names.

# geocode locations without coordinates
def geolocate(location, province):
    url = "http://geogratis.gc.ca/services/geolocation/en/locate?q={},%20{}"
    req = urllib.request.Request(url.format(urllib.parse.quote(location),province))
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    if (len(result)==0):
        return([None,None])
    else:
        return(result[0]['geometry']['coordinates'])
if not "alt_coord" in incidents.columns:
    incidents["alt_coord"] = [
        geolocate(incidents.location[i], incidents.location_province[i])
        for i in incidents.index
    ]
    incidents.to_csv(incidentsfile)

Now that we have incident data, let’s create some figures.

# clean up activity names
incidents.group_activity.unique()
array(['Mountaineering', 'Lift Skiing Open', 'Snowmobiling',
       'Mechanized Skiing', 'skiing/snowboarding', 'Skiing',
       'Skiing/Snowboarding', 'Snow Biking', 'Snowshoeing',
       'Snowboarding', 'Backcountry Skiing', 'Ice Climbing',
       'Ski touring', 'Heliskiing', 'Snowshoeing & Hiking', 'Work',
       'Other Recreational', 'Out-of-bounds Skiing',
       'At Outdoor Worksite', 'Lift Skiing Closed', 'Hunting/Fishing',
       'Out-of-Bounds Skiing', 'Control Work', 'Inside Building',
       'Car/Truck on Road', 'Inside Car/Truck on Road', 'Unknown',
       'Outside Building'], dtype=object)
incidents.group_activity=incidents.group_activity.replace("Ski touring","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Out-of-Bounds Skiing","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Lift Skiing Closed","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Skiing","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Snowshoeing","Snowshoeing & Hiking")
incidents.group_activity=incidents.group_activity.replace("Snowshoeing and Hiking","Snowshoeing & Hiking")
incidents.group_activity=incidents.group_activity.replace("Mechanized Skiing","Heli or Cat Skiing")
incidents.group_activity=incidents.group_activity.replace("Heliskiing","Heli or Cat Skiing")
incidents.group_activity=incidents.group_activity.replace("At Outdoor Worksite","Work")
incidents.group_activity=incidents.group_activity.replace("Control Work","Work")
incidents.group_activity=incidents.group_activity.replace("Hunting/Fishing","Other Recreational")
incidents.group_activity=incidents.group_activity.replace("Inside Car/Truck on Road","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Car/Truck on Road","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Inside Building","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Outside Building","Car/Truck/Building")


incidents.group_activity.unique()

fig, ax = plt.subplots(1,2, sharey=True, figsize=(12,4))
colors=plt.rcParams["axes.prop_cycle"].by_key()["color"]
incidents.groupby(['group_activity']).id.count().plot(kind='bar', title="Incidents by Activity", ax=ax[0])
incidents.groupby(['group_activity']).num_fatal.sum().plot(kind='bar', title="Deaths by Activity", ax=ax[1], color=colors[1])
ax[0].set_xlabel(None)
ax[1].set_xlabel(None);
../_images/working_with_text_9_0.png
incidents["date"] = pd.to_datetime(incidents.ob_date)
incidents["year"] = incidents.date.apply(lambda x: x.year)
incidents.date = incidents.date.dt.date
colors=plt.rcParams["axes.prop_cycle"].by_key()["color"]
f = incidents.groupby(["year"]).num_fatal.sum()
n = incidents.groupby(["year"]).id.count()
yearstart=1950
f=f[f.index>yearstart]
n=n[n.index>yearstart]
fig,ax = plt.subplots(1,1,figsize=(12,4))
n.plot(ax=ax)
f.plot(ax=ax)
ax.set_ylabel("Count")
ax.annotate("Incidents", (2010, 4), color=colors[0])
ax.annotate("Deaths", (2011, 15), color=colors[1]);
../_images/working_with_text_10_0.png

Mapping Incidents

Since the incident data includes coordinates, we might as well make a map too. Unfortunately, some latitude and longitudes contain obvious errors. Here, we try to fix them.

import re

# fix errors in latitude, longitude
latlon = incidents.location_coords
def makenumeric(cstr):
    if cstr is None:
        return([None,None])
    elif (type(cstr)==str):
        return([float(s) for s in re.findall(r'-?\d+\.?\d*',cstr)])
    else:
        return(cstr)

latlon = latlon.apply(makenumeric)

def good_lat(lat):
    return(lat >= 41.6 and lat <= 83.12) # min & max for Canada

def good_lon(lon):
    return(lon >= -141 and lon<= -52.6)

def fixlatlon(c):
    if (len(c)<2 or type(c[0])!=float or type(c[1])!=float):
        c = [None, None]
        return(c)
    lat = c[0]
    lon = c[1]
    if not good_lat(lat) and good_lat(lon):
        tmp = lat
        lat = lon
        lon = tmp
    if not good_lon(lon) and good_lon(-lon):
        lon = -lon
    if not good_lon(lon) and good_lon(lat):
        tmp = lat
        lat = lon
        lon = tmp
    if not good_lon(lon) and good_lon(-lat):
        tmp = -lat
        lat = lon
        lon = tmp
    if not good_lat(lat) or not good_lon(lon):
        c[0] = None
        c[1] = None
    else:
        c[0] = lat
        c[1] = lon
    return(c)

incidents["latlon"] = latlon.apply(fixlatlon)
def foo(c, a):
    if (type(a)==str):
        a = [float(s) for s in re.findall(r'-?\d+\.?\d*',a)]
    if len(a) <2:
        a = [None,None]
    return([a[1],a[0]] if type(c[0])!=float else c)
incidents["latlon_filled"]=[foo(c,a) for c,a in zip(incidents["latlon"],incidents["alt_coord"])]
nmiss = sum([a[0]==None for a in incidents.latlon_filled])
n = len(incidents.latlon_filled)
print("{} of {} incidents have latitude & longitude".format(n-nmiss, n))
333 of 494 incidents have latitude & longitude
# download forecast region definitions
# req = urllib.request.Request("https://www.avalanche.ca/api/forecasts")
# The above link doesn't work since COVID-19 lockdown. Currently we use an old cached version instead
#req = ("https://web.archive.org/web/20150319031605if_/http://www.avalanche.ca/api/forecasts")
#with urllib.request.urlopen(req) as response:
#    forecastregions = json.loads(response.read().decode('utf-8'))
req = "https://faculty.arts.ubc.ca/pschrimpf/forecast-regions2015.json"
with urllib.request.urlopen(req) as response:
    regions2015 = json.loads(response.read().decode('utf-8'))

req = "https://faculty.arts.ubc.ca/pschrimpf/forecast-regions2019.json"
with urllib.request.urlopen(req) as response:
    regions2019 = json.loads(response.read().decode('utf-8'))

forecastregions = regions2019
ids = [r['id'] for r in forecastregions['features']]
for r in regions2015['features'] :
     if not r['id'] in ids :
            forecastregions['features'].append(r)

You may have to uncomment the second line below if folium is not installed.

# Map forecast regions and incidents
#!pip install --user folium
import folium
import matplotlib

cmap = matplotlib.cm.get_cmap('Set1')
fmap = folium.Map(location=[60, -98],
                            zoom_start=3,
                            tiles='Stamen Terrain')
with urllib.request.urlopen(req) as response:
    regions_tmp = json.loads(response.read().decode('utf-8'))
folium.GeoJson(regions_tmp,
               tooltip=folium.GeoJsonTooltip(fields=["name"], aliases=[""]),
               highlight_function=lambda x: { 'weight': 10},
              style_function=lambda x: {'weight':1}).add_to(fmap)
activities = incidents.group_activity.unique()
for i in incidents.index:
    if incidents.latlon_filled[i][0] is not None and  incidents.latlon_filled[i][1] is not None:
        cindex=[j for j,x in enumerate(activities) if x==incidents.group_activity[i]][0]
        txt = "{}, {}<br>{} deaths"
        txt = txt.format(incidents.group_activity[i],
                        incidents.ob_date[i],
                        incidents.num_fatal[i]
                        )
        pop = folium.Popup(incidents.comment[i], parse_html=True, max_width=400)
        folium.CircleMarker(incidents.latlon_filled[i],
                      tooltip=txt,
                      popup=pop,
                      color=matplotlib.colors.to_hex(cmap(cindex)), fill=True, radius=5).add_to(fmap)
fmap
Make this Notebook Trusted to load map: File -> Trust Notebook