Working with Text#

Author

Prerequisites

Outcomes

  • Use text as features for classification

  • Understand latent topic analysis

  • Use folium to create an interactive map

  • Request and combine json data from a web server

# Uncomment following line to install on colab
#! pip install fiona geopandas xgboost gensim folium pyLDAvis descartes

Introduction#

Many data sources contain both numerical data and text.

We can use text to create features for any of the prediction methods that we have discussed.

Doing so requires encoding text into some numerical representation.

A good encoding preserves the meaning of the original text, while keeping dimensionality manageable.

In this lecture, we will learn how to work with text through an application — predicting fatalities from avalanche forecasts.

Avalanches#

Snow avalanches are a hazard in the mountains. Avalanches can be partially predicted based on snow conditions, weather, and terrain. Avalanche Canada produces daily avalanche forecasts for various Canadian mountainous regions. These forecasts consist of 1-5 ratings for each of three elevation bands, as well as textual descriptions of recent avalanche observations, snowpack, and weather. Avalanche Canada also maintains a list of fatal avalanche incidents . In this lecture, we will attempt to predict fatal incidents from the text of avalanche forecasts. Since fatal incidents are rare, this prediction task will be quite difficult.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


%matplotlib inline

Data#

Avalanche Canada has an unstable json api. The api seems to be largely tailored to displaying the information on various Avalanche Canada websites, which does not make it easy to obtain large amounts of data. Nonetheless, getting information from the API is easier than scraping the website. Generally, whenever you’re considering scraping a website, you should first check whether the site has an API available.

Incident Data#

# Get data on avalanche forecasts and incidents from Avalanche Canada
# Avalanche Canada has an unstable public api
# https://github.com/avalanche-canada/ac-web
# Since API might change, this code might break
import json
import os
import urllib.request
import pandas as pd
import time
import requests
import io
import zipfile
import warnings


# Incidents
url = "http://incidents.avalanche.ca/public/incidents/?format=json"
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as response:
    result = json.loads(response.read().decode('utf-8'))
incident_list = result["results"]
while (result["next"] != None):
    req = urllib.request.Request(result["next"])
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    incident_list = incident_list + result["results"]
incidents_brief = pd.DataFrame.from_dict(incident_list,orient="columns")
pd.options.display.max_rows = 20
pd.options.display.max_columns = 8
incidents_brief
id date location location_province group_activity num_involved num_injured num_fatal
0 6754acba-56cb-46c2-9ce6-66ee4e1f17c4 2024-01-27 Hasler Recreation Area BC Snowmobiling 3.0 0.0 1
1 761a8854-758d-4a16-8d90-75d1b999ef02 2023-11-11 Ranger Creek AB Ice Climbing 2.0 0.0 1
2 6dad9a7f-e732-43cf-8058-5fcac24c7041 2023-04-22 Lake Louise West Bowl AB Lift Skiing Closed 2.0 1.0 1
3 a3968149-3ee5-4360-8efb-ed8bc3f331c4 2023-04-15 Thunderwater Lake BC Snowmobiling 2.0 0.0 1
4 913e84e6-4785-409e-859e-f6456b2fb27c 2023-04-11 Treaty Creek BC Mechanized Skiing 5.0 1.0 1
... ... ... ... ... ... ... ... ...
500 101c517b-29a4-4c49-8934-f6c56ddd882d 1840-02-01 Château-Richer QC Unknown NaN NaN 1
501 b2e1c50a-1533-4145-a1a2-0befca0154d5 1836-02-09 Quebec QC Unknown NaN NaN 1
502 18e8f963-da33-4682-9312-57ca2cc9ad8d 1833-05-24 Carbonear NL Unknown NaN 0.0 1
503 083d22df-ed50-4687-b9ab-1649960a0fbe 1825-02-04 Saint-Joseph de Lévis QC Inside Building NaN NaN 5
504 f498c48a-981d-43cf-ac16-151b8794435c 1782-01-01 Nain NL Unknown NaN NaN 22

505 rows × 8 columns

# We can get more information about these incidents e.g. "https://www.avalanche.ca/incidents/37d909e4-c6de-43f1-8416-57a34cd48255"
# this information is also available through the API
def get_incident_details(id):
    url = "http://incidents.avalanche.ca/public/incidents/{}?format=json".format(id)
    req = urllib.request.Request(url)
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    return(result)


incidentsfile = "https://datascience.quantecon.org/assets/data/avalanche_incidents.csv"

# To avoid loading the avalanche Canada servers, we save the incident details locally.
if (not os.path.isfile(incidentsfile)):
    incident_detail_list = incidents_brief.id.apply(get_incident_details).to_list()
    incidents = pd.DataFrame.from_dict(incident_detail_list, orient="columns")
    incidents.to_csv(incidentsfile)
else:
    incidents = pd.read_csv(incidentsfile)

incidents
id ob_date location location_desc ... weather_comment snowpack_obs snowpack_comment documents
0 6754acba-56cb-46c2-9ce6-66ee4e1f17c4 2024-01-27 Hasler Recreation Area Approximately 57 km SW of Chetwynd ... {'hs': 150, 'hn24': 27, 'hst': None, 'hst_rese... [{'date': '2024-01-29', 'title': 'Hasler Avala...
1 761a8854-758d-4a16-8d90-75d1b999ef02 2023-11-11 Ranger Creek Lone Ranger ice climb, Peter Lougheed Provinci... ... {'hs': 50, 'hn24': 20, 'hst': 20, 'hst_reset':... []
2 6dad9a7f-e732-43cf-8058-5fcac24c7041 2023-04-22 Lake Louise West Bowl Lake Louise Ski Resort ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... [{'date': '2023-04-22', 'title': 'Aerial Photo...
3 a3968149-3ee5-4360-8efb-ed8bc3f331c4 2023-04-15 Thunderwater Lake Approximately 35 km west of Radium Hot Springs ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... [{'date': '2023-04-15', 'title': 'Aerial photo...
4 913e84e6-4785-409e-859e-f6456b2fb27c 2023-04-11 Treaty Creek Approximately 70 km north of Stewart ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
... ... ... ... ... ... ... ... ... ...
500 101c517b-29a4-4c49-8934-f6c56ddd882d 1840-02-01 Château-Richer ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
501 b2e1c50a-1533-4145-a1a2-0befca0154d5 1836-02-09 Quebec more details unknown ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
502 18e8f963-da33-4682-9312-57ca2cc9ad8d 1833-05-24 Carbonear ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... [{'title': 'Carbonear, May 24, 1833', 'source'...
503 083d22df-ed50-4687-b9ab-1649960a0fbe 1825-02-04 Saint-Joseph de Lévis Pointe Lévis ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... []
504 f498c48a-981d-43cf-ac16-151b8794435c 1782-01-01 Nain ... {'hs': None, 'hn24': None, 'hst': None, 'hst_r... [{'title': 'Nain, 1781-2', 'source': 'NFLD Geo...

505 rows × 19 columns

Many incidents include coordinates, but others do not. Most however, do include a place name. We can use Natural Resources Canada’s Geolocation Service to retrieve coordinates from place names.

# geocode locations without coordinates
def geolocate(location, province):
    url = "http://geogratis.gc.ca/services/geolocation/en/locate?q={},%20{}"
    req = urllib.request.Request(url.format(urllib.parse.quote(location),province))
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    if (len(result)==0):
        return([None,None])
    else:
        return(result[0]['geometry']['coordinates'])
if not "alt_coord" in incidents.columns:
    incidents["alt_coord"] = [
        geolocate(incidents.location[i], incidents.location_province[i])
        for i in incidents.index
    ]
    incidents.to_csv(incidentsfile)

Now that we have incident data, let’s create some figures.

# clean up activity names
incidents.group_activity.unique()
array(['Snowmobiling', 'Ice Climbing', 'Lift Skiing Closed',
       'Mechanized Skiing', 'Out-of-Bounds Skiing/Snowboarding',
       'Backcountry Skiing', 'Mountaineering', 'Lift Skiing Open',
       'Backcountry Skiing/Snowboarding', 'Skiing/Snowboarding',
       'Snow Biking', 'Snowshoeing', 'Snowboarding', 'Ski touring',
       'Skiing', 'Heliskiing', 'Snowshoeing & Hiking', 'Work',
       'Other Recreational', 'Out-of-bounds Skiing',
       'At Outdoor Worksite', 'Hunting/Fishing', 'Out-of-Bounds Skiing',
       'Control Work', 'Inside Building', 'Car/Truck on Road',
       'Inside Car/Truck on Road', 'Unknown', 'Outside Building'],
      dtype=object)
incidents.group_activity=incidents.group_activity.replace("Ski touring","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Out-of-Bounds Skiing","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Lift Skiing Closed","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Skiing","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Snowshoeing","Snowshoeing & Hiking")
incidents.group_activity=incidents.group_activity.replace("Snowshoeing and Hiking","Snowshoeing & Hiking")
incidents.group_activity=incidents.group_activity.replace("Mechanized Skiing","Heli or Cat Skiing")
incidents.group_activity=incidents.group_activity.replace("Heliskiing","Heli or Cat Skiing")
incidents.group_activity=incidents.group_activity.replace("At Outdoor Worksite","Work")
incidents.group_activity=incidents.group_activity.replace("Control Work","Work")
incidents.group_activity=incidents.group_activity.replace("Hunting/Fishing","Other Recreational")
incidents.group_activity=incidents.group_activity.replace("Inside Car/Truck on Road","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Car/Truck on Road","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Inside Building","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Outside Building","Car/Truck/Building")


incidents.group_activity.unique()

fig, ax = plt.subplots(1,2, sharey=True, figsize=(12,4))
colors=plt.rcParams["axes.prop_cycle"].by_key()["color"]
incidents.groupby(['group_activity']).id.count().plot(kind='bar', title="Incidents by Activity", ax=ax[0])
incidents.groupby(['group_activity']).num_fatal.sum().plot(kind='bar', title="Deaths by Activity", ax=ax[1], color=colors[1])
ax[0].set_xlabel(None)
ax[1].set_xlabel(None);
../_images/3691eb5ff6e611fc4b533c7f3bafb5cc4108b8c696335282b8968cc472c4a722.png
incidents["date"] = pd.to_datetime(incidents.ob_date)
incidents["year"] = incidents.date.apply(lambda x: x.year)
incidents.date = incidents.date.dt.date
colors=plt.rcParams["axes.prop_cycle"].by_key()["color"]
f = incidents.groupby(["year"]).num_fatal.sum()
n = incidents.groupby(["year"]).id.count()
yearstart=1950
f=f[f.index>yearstart]
n=n[n.index>yearstart]
fig,ax = plt.subplots(1,1,figsize=(12,4))
n.plot(ax=ax)
f.plot(ax=ax)
ax.set_ylabel("Count")
ax.annotate("Incidents", (2010, 4), color=colors[0])
ax.annotate("Deaths", (2011, 15), color=colors[1]);
../_images/c4f181e136d283fb3bc22541d1e436aa33dac13e6c12f648399c3bfd3aab0762.png

Mapping Incidents#

Since the incident data includes coordinates, we might as well make a map too. Unfortunately, some latitude and longitudes contain obvious errors. Here, we try to fix them.

import re

# fix errors in latitude, longitude
latlon = incidents.location_coords
def makenumeric(cstr):
    if cstr is None:
        return([None,None])
    elif (type(cstr)==str):
        return([float(s) for s in re.findall(r'-?\d+\.?\d*',cstr)])
    else:
        return(cstr)

latlon = latlon.apply(makenumeric)

def good_lat(lat):
    return(lat >= 41.6 and lat <= 83.12) # min & max for Canada

def good_lon(lon):
    return(lon >= -141 and lon<= -52.6)

def fixlatlon(c):
    if (len(c)<2 or type(c[0])!=float or type(c[1])!=float):
        c = [None, None]
        return(c)
    lat = c[0]
    lon = c[1]
    if not good_lat(lat) and good_lat(lon):
        tmp = lat
        lat = lon
        lon = tmp
    if not good_lon(lon) and good_lon(-lon):
        lon = -lon
    if not good_lon(lon) and good_lon(lat):
        tmp = lat
        lat = lon
        lon = tmp
    if not good_lon(lon) and good_lon(-lat):
        tmp = -lat
        lat = lon
        lon = tmp
    if not good_lat(lat) or not good_lon(lon):
        c[0] = None
        c[1] = None
    else:
        c[0] = lat
        c[1] = lon
    return(c)

incidents["latlon"] = latlon.apply(fixlatlon)
def foo(c, a):
    if (type(a)==str):
        a = [float(s) for s in re.findall(r'-?\d+\.?\d*',a)]
    if len(a) <2:
        a = [None,None]
    return([a[1],a[0]] if type(c[0])!=float else c)
incidents["latlon_filled"]=[foo(c,a) for c,a in zip(incidents["latlon"],incidents["alt_coord"])]
nmiss = sum([a[0]==None for a in incidents.latlon_filled])
n = len(incidents.latlon_filled)
print("{} of {} incidents have latitude & longitude".format(n-nmiss, n))
135 of 505 incidents have latitude & longitude
# download forecast region definitions
# req = urllib.request.Request("https://www.avalanche.ca/api/forecasts")
# The above link doesn't work since COVID-19 lockdown. Currently we use an old cached version instead
#req = ("https://web.archive.org/web/20150319031605if_/http://www.avalanche.ca/api/forecasts")
#with urllib.request.urlopen(req) as response:
#    forecastregions = json.loads(response.read().decode('utf-8'))
req = "https://faculty.arts.ubc.ca/pschrimpf/forecast-regions2015.json"
with urllib.request.urlopen(req) as response:
    regions2015 = json.loads(response.read().decode('utf-8'))

req = "https://faculty.arts.ubc.ca/pschrimpf/forecast-regions2019.json"
with urllib.request.urlopen(req) as response:
    regions2019 = json.loads(response.read().decode('utf-8'))

forecastregions = regions2019
ids = [r['id'] for r in forecastregions['features']]
for r in regions2015['features'] :
     if not r['id'] in ids :
            forecastregions['features'].append(r)

You may have to uncomment the second line below if folium is not installed.

# Map forecast regions and incidents
#!pip install --user folium
import folium
import matplotlib

cmap = matplotlib.cm.get_cmap('Set1')
fmap = folium.Map(location=[60, -98],
                            zoom_start=3,
                            tiles='Stamen Terrain')
with urllib.request.urlopen(req) as response:
    regions_tmp = json.loads(response.read().decode('utf-8'))
folium.GeoJson(regions_tmp,
               tooltip=folium.GeoJsonTooltip(fields=["name"], aliases=[""]),
               highlight_function=lambda x: { 'weight': 10},
              style_function=lambda x: {'weight':1}).add_to(fmap)
activities = incidents.group_activity.unique()
for i in incidents.index:
    if incidents.latlon_filled[i][0] is not None and  incidents.latlon_filled[i][1] is not None:
        cindex=[j for j,x in enumerate(activities) if x==incidents.group_activity[i]][0]
        txt = "{}, {}<br>{} deaths"
        txt = txt.format(incidents.group_activity[i],
                        incidents.ob_date[i],
                        incidents.num_fatal[i]
                        )
        pop = folium.Popup(incidents.comment[i], parse_html=True, max_width=400)
        folium.CircleMarker(incidents.latlon_filled[i],
                      tooltip=txt,
                      popup=pop,
                      color=matplotlib.colors.to_hex(cmap(cindex)), fill=True, radius=5).add_to(fmap)
fmap
/tmp/ipykernel_2757/2943874502.py:6: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = matplotlib.cm.get_cmap('Set1')
Make this Notebook Trusted to load map: File -> Trust Notebook

Take a moment to click around the map and read about some of the incidents.

Between presenting this information on a map and the list on https://www.avalanche.ca/incidents/ , which do you prefer and why?

Matching Incidents to Regions#

Later, we will want to match incidents to forecasts, so let’s find the closest region to each incident.

Note that distance here will be in units of latitude, longitude (or whatever coordinate system we use). At the equator, a distance of 1 is approximately 60 nautical miles.

Since longitude lines get closer together farther from the equator, these distances will be understated the further North you go.

This is not much of a problem if we’re just finding the nearest region, but if we care about accurate distances, we should re-project the latitude and longitude into a different coordinate system.

# Match incidents to nearest forecast regions.
from shapely.geometry import Point, Polygon, shape
point = Point(incidents.latlon_filled[0][1],incidents.latlon_filled[0][0])
def distances(latlon):
    point=Point(latlon[1],latlon[0])
    df = pd.DataFrame.from_dict([{'id':feature['id'],
                                  'distance':shape(feature['geometry']).distance(point)} for
                                 feature in forecastregions['features']])
    return(df)
def foo(x):
    if (x[0]==None):
        return(None)
    d = distances(x)
    return(d.id[d.distance.idxmin()])
incidents['nearest_region'] = incidents.latlon_filled.apply(foo)
incidents['nearest_distance'] = incidents.latlon_filled.apply(lambda x: None if x[0]==None else distances(x).distance.min())
incidents
id ob_date location location_desc ... latlon latlon_filled nearest_region nearest_distance
0 6754acba-56cb-46c2-9ce6-66ee4e1f17c4 2024-01-27 Hasler Recreation Area Approximately 57 km SW of Chetwynd ... [55.37941, -122.33844] [55.37941, -122.33844] north-rockies 0.000000
1 761a8854-758d-4a16-8d90-75d1b999ef02 2023-11-11 Ranger Creek Lone Ranger ice climb, Peter Lougheed Provinci... ... [50.763413, -115.291472] [50.763413, -115.291472] kananaskis 0.000000
2 6dad9a7f-e732-43cf-8058-5fcac24c7041 2023-04-22 Lake Louise West Bowl Lake Louise Ski Resort ... [51.467515, -116.145756] [51.467515, -116.145756] banff-yoho-kootenay 0.000000
3 a3968149-3ee5-4360-8efb-ed8bc3f331c4 2023-04-15 Thunderwater Lake Approximately 35 km west of Radium Hot Springs ... [50.65355, -116.60074] [50.65355, -116.60074] purcells 0.000000
4 913e84e6-4785-409e-859e-f6456b2fb27c 2023-04-11 Treaty Creek Approximately 70 km north of Stewart ... [56.557108, -130.012342] [56.557108, -130.012342] northwest-coastal 0.000000
... ... ... ... ... ... ... ... ... ...
500 101c517b-29a4-4c49-8934-f6c56ddd882d 1840-02-01 Château-Richer ... [None, None] [None, None] None NaN
501 b2e1c50a-1533-4145-a1a2-0befca0154d5 1836-02-09 Quebec more details unknown ... [None, None] [None, None] None NaN
502 18e8f963-da33-4682-9312-57ca2cc9ad8d 1833-05-24 Carbonear ... [47.733333587646484, -53.21666717529297] [47.733333587646484, -53.21666717529297] chic-chocs 12.827272
503 083d22df-ed50-4687-b9ab-1649960a0fbe 1825-02-04 Saint-Joseph de Lévis Pointe Lévis ... [None, None] [None, None] None NaN
504 f498c48a-981d-43cf-ac16-151b8794435c 1782-01-01 Nain ... [56.53333282470703, -61.68333435058594] [56.53333282470703, -61.68333435058594] chic-chocs 8.633647

505 rows × 26 columns

Forecast Data#

We’ll now download all forecasts for all regions since November 2011 (roughly the earliest data available).

We can only request one forecast at a time, so this takes many hours to download.

To make this process run more quickly for readers, we ran the code ourselves and then stored the data in the cloud.

The function below will fetch all the forecasts from the cloud storage location and save them to a folder named avalanche_forecasts.

def download_cached_forecasts():
    # download the zipped file and unzip it here
    url = "https://datascience.quantecon.org/assets/data/avalanche_forecasts.zip?raw=true"
    with requests.get(url) as res:
        if not res.ok:
            raise ValueError("failed to download the cached forecasts")
        with zipfile.ZipFile(io.BytesIO(res.content)) as z:
            for f in z.namelist():
                if (os.path.isfile(f) and z.getinfo(f).file_size < os.stat(f).st_size):
                    warnings.warn(f"'File $f exists and is larger than version in cache. Not replacing.")
                else :
                    z.extract(f)

download_cached_forecasts()

The code below is what we initially ran to obtain all the forecasts.

You will notice that this code checks to see whether the files can be found in the avalanche_forecasts directory (they can if you ran the download_cached_forecasts above!) and will only download them if they aren’t found.

You can experiment with this caching by deleting one or more files from the avalanche_forecasts folder and re-running the cells below.

# Functions for downloading forecasts from Avalanche Canada

def get_forecast(date, region):
    url = "https://www.avalanche.ca/api/bulletin-archive/{}/{}.json".format(date.isoformat(),region)
    try:
        req = urllib.request.Request(url)
        with urllib.request.urlopen(req) as response:
            result = json.loads(response.read().decode('utf-8'))
        return(result)
    except:
        return(None)

def get_forecasts(start, end, region):
    day = start
    forecasts = []
    while(day<=end and day<end.today()):
        #print("working on {}, {}".format(region,day))
        forecasts = forecasts + [get_forecast(day, region)]
        #print("sleeping")
        time.sleep(0.1) # to avoid too much load on Avalanche Canada servers
        day = day + pd.Timedelta(1,"D")
    return(forecasts)

def get_season(year, region):
    start_month = 11
    start_day = 20
    last_month = 5
    last_day = 1
    if (not os.path.isdir("avalanche_forecasts")):
        os.mkdir("avalanche_forecasts")
    seasonfile = "avalanche_forecasts/{}_{}-{}.json".format(region, year, year+1)
    if (not os.path.isfile(seasonfile)):
        startdate = pd.to_datetime("{}-{}-{} 12:00".format(year, start_month, start_day))
        lastdate = pd.to_datetime("{}-{}-{} 12:00".format(year+1, last_month, last_day))
        season = get_forecasts(startdate,lastdate,region)
        with open(seasonfile, 'w') as outfile:
            json.dump(season, outfile, ensure_ascii=False)
    else:
        with open(seasonfile, "rb") as json_data:
            season = json.load(json_data)
    return(season)
forecastlist=[]

for year in range(2011,2019):
    print("working on {}".format(year))
    for region in [region["id"] for region in forecastregions["features"]]:
        forecastlist = forecastlist + get_season(year, region)
working on 2011
working on 2012
working on 2013
working on 2014
working on 2015
working on 2016
working on 2017
working on 2018
# convert to DataFrame and extract some variables
forecasts = pd.DataFrame.from_dict([f for f in forecastlist if not f==None],orient="columns")

forecasts["danger_date"] = forecasts.dangerRatings.apply(lambda r: r[0]["date"])
forecasts["danger_date"] = pd.to_datetime(forecasts.danger_date, utc=True).dt.date
forecasts["danger_alpine"]=forecasts.dangerRatings.apply(lambda r: r[0]["dangerRating"]["alp"])
forecasts["danger_treeline"]=forecasts.dangerRatings.apply(lambda r: r[0]["dangerRating"]["tln"])
forecasts["danger_belowtree"]=forecasts.dangerRatings.apply(lambda r: r[0]["dangerRating"]["btl"])
forecasts.head()
id region dateIssued validUntil ... danger_date danger_alpine danger_treeline danger_belowtree
0 bid_29924 northwest-coastal 2011-11-20T00:49:00.000Z 2011-11-21T00:00:00.000Z ... 2011-11-21 2:Moderate 2:Moderate 1:Low
1 bid_29943 northwest-coastal 2011-11-21T01:00:00.000Z 2011-11-22T00:00:00.000Z ... 2011-11-22 3:Considerable 3:Considerable 2:Moderate
2 bid_29965 northwest-coastal 2011-11-22T01:43:00.000Z 2011-11-23T00:00:00.000Z ... 2011-11-23 4:High 4:High 3:Considerable
3 bid_30080 northwest-coastal 2011-11-23T01:38:00.000Z 2011-11-24T00:00:00.000Z ... 2011-11-24 4:High 4:High 3:Considerable
4 bid_30358 northwest-coastal 2011-11-24T00:55:00.000Z 2011-11-25T00:00:00.000Z ... 2011-11-25 4:High 4:High 3:Considerable

5 rows × 18 columns

# merge incidents to forecasts
adf = pd.merge(forecasts, incidents, how="left",
               left_on=["region","danger_date"],
               right_on=["nearest_region","date"],
              indicator=True)
adf["incident"] = adf._merge=="both"
print("There were {} incidents matched with forecasts data. These occured on {}% of day-regions with forecasts".format(adf.incident.sum(),adf.incident.mean()*100))
There were 38 incidents matched with forecasts data. These occured on 0.2598290598290598% of day-regions with forecasts
import seaborn as sns
ratings=sorted(adf.danger_alpine.unique())
ava_colors = ["#52BA4A", "#FFF300", "#F79218", "#EF1C29", "#1A1A1A", "#BFBFBF"]
for x in ["danger_alpine", "danger_treeline", "danger_belowtree"]:
    fig=sns.catplot(x=x, kind="count",col="incident", order=ratings, data=adf, sharey=False,
                    palette=ava_colors, height=3, aspect=2)
    plt.subplots_adjust(top=0.9)
    fig.fig.suptitle(x.replace("danger_",""))
    display(fig)
<seaborn.axisgrid.FacetGrid at 0x7fef49cc3df0>
<seaborn.axisgrid.FacetGrid at 0x7fef49b33ca0>
<seaborn.axisgrid.FacetGrid at 0x7fef67598fd0>
../_images/5bf68c62a1065ab7930a608bcb2dde7cc5d2108ec5255614b2f372cb10b47576.png ../_images/dc1685ad19e4eb4bd624da79a7044395116b8403e00a2d46c17e2a7ba4f9661f.png ../_images/807fe24a649fe5aa387b19b28c16efe2c3f23a007a9377c6c3464eba2e6cfded.png

Predicting Incidents from Text#

Preprocessing#

The first step when using text as data is to pre-process the text.

In preprocessing, we will:

  1. Clean: Remove unwanted punctuation and non-text characters.

  2. Tokenize: Break sentences down into words.

  3. Remove “stopwords”: Eliminate common words that actually provide no information, like “a” and “the”.

  4. Lemmatize words: Reduce words to their dictionary “lemma” e.g. “snowing” and “snowed” both become snow (verb).

from bs4 import BeautifulSoup
import nltk
import string
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Remove stopwords (the, a, is, etc)
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords=stopwords.union(set(string.punctuation))
# Lemmatize words e.g. snowed and snowing are both snow (verb)
wnl = nltk.WordNetLemmatizer()
def text_prep(txt):
    soup = BeautifulSoup(txt, "lxml")
    [s.extract() for s in soup('style')] # remove css
    txt=soup.text # remove html tags
    txt = txt.lower()
    tokens = [token for token in nltk.tokenize.word_tokenize(txt)]
    tokens = [token for token in tokens if not token in stopwords]
    #tokens = [token for token in tokens if not token ]
    tokens = [wnl.lemmatize(token) for token in tokens]
    if (len(tokens)==0):
        tokens = ["EMPTYSTRING"]
    return(tokens)

text_prep(forecasts.highlights[1000])
[nltk_data] Downloading package omw-1.4 to /home/runner/nltk_data...
[nltk_data] Downloading package punkt to /home/runner/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /home/runner/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /home/runner/nltk_data...
['avalanche',
 'danger',
 'could',
 'spike',
 'high',
 'slope',
 'getting',
 'baked',
 'sun',
 'avoid',
 'traveling',
 'underneath',
 'area']

Now, let’s apply this to all avalanche summaries.

text_data = [text_prep(txt) for txt in adf.avalancheSummary]
/tmp/ipykernel_2757/1742078185.py:14: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
  soup = BeautifulSoup(txt, "lxml")

Let’s make a bar plot of the most common words.

wf = nltk.FreqDist([word for doc in text_data for word in doc]).most_common(20)
words = [x[0] for x in wf]
cnt = [x[1] for x in wf]

fig, ax = plt.subplots(figsize=(12,4))
ax.bar(range(len(words)), cnt);
ax.set_xticks(range(len(words)));
ax.set_xticklabels(words, rotation='vertical');
ax.set_title('Most common words in avalanche summaries');
ax.set_xlabel('Word');
ax.set_ylabel('Occurences');
plt.show()
../_images/ad381482e0eecaa155d9b954eb4d4cef66ca3682d189e6d93b51db8a3e6975e9.png

Feature Engineering#

The “bag of words” approach is the simplest way to convert a collection of processed text documents to a feature matrix. We view each document as a bag of words, and our feature matrix counts how many times each word appears. This method is called a “bag of words” because we ignore the document’s word order.

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vectorizer = CountVectorizer(max_features=500, min_df=5, max_df=0.7)
text_data = [text_prep(txt) for txt in adf.avalancheSummary]
y = adf.incident
X = vectorizer.fit_transform([' '.join(doc) for doc in text_data])
/tmp/ipykernel_2757/1742078185.py:14: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
  soup = BeautifulSoup(txt, "lxml")

We can also perform more complicated feature engineering. One extension of the “bag of words” method is to consider counts of pairs or triples of consecutive words. These are called n-grams and can be created by setting the n_gram argument to CountVectorizer. Another alternative might be to accommodate the fact that common words will inherently have higher counts by using term-frequency inverse-document-frequency (see below).

After creating our feature matrix, we can now apply any classification method to predict incidents.

Naive Bayes Classifier#

A common text data classifier is the Naive Bayes classifier. This classifier predicts incidents using Bayes’ rules.

\[ P(incident | words) = \frac{P(words|incident) P(incidents)}{P(words)} \]

The classifier is naive, though; it assumes words are independent of one another in any given incident.

\[ P(words|incident) = \prod_{w \in words} P(w|incident) \]

Although this assumption is implausible for text, the Naive Bayes classifier can be computed extremely quickly, and sometimes quite well.

from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=124)
from sklearn import naive_bayes
classifier = naive_bayes.MultinomialNB()
classifier.fit(Xtrain,ytrain)
np.mean(classifier.predict(Xtest)==ytest)
0.9927073837739289
from sklearn import metrics
print(metrics.confusion_matrix(ytest, classifier.predict(Xtest)))
print(metrics.classification_report(ytest, classifier.predict(Xtest)))
[[4355   21]
 [  11    1]]
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      4376
        True       0.05      0.08      0.06        12

    accuracy                           0.99      4388
   macro avg       0.52      0.54      0.53      4388
weighted avg       0.99      0.99      0.99      4388
# print text with highest predicted probabilities
phat=classifier.predict_proba(X)[:,1]
def remove_html(txt):
    soup = BeautifulSoup(txt, "lxml")
    [s.extract() for s in soup('style')] # remove css
    return(soup.text)
docs = [remove_html(txt) for txt in adf.avalancheSummary]
txt_high = [(_,x) for _, x in sorted(zip(phat,docs), key=lambda pair: pair[0],reverse=True)]
txt_high[:10]
/tmp/ipykernel_2757/4160257397.py:4: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
  soup = BeautifulSoup(txt, "lxml")
[(0.9984412700946458,
  "In the dogtooth backcountry, observers reported a size 2.5 avalanche with a crown 1m+ in depth, that was likely human triggered.  Remote triggering of avalanches continues to be reported from the region.  Explosive control work once again produced large avalanches to size 3.5 on all aspects with crowns ranging from 40 - 120 in depth.I've left the previous narratives in from earlier this because they help to illustrate the nature of the current hazard. From Thursday:  Lots of avalanche activity to report from the region.  In the north Moderate to Strong NW winds at ridge top continued to load lee slopes resulting in a natural avalanche cycle to size 2.5 with crown depths 20 - 60cm in depth.  Just south of Golden explosive control work produced spectacular results with avalanches to size 3.5 & crown depths 100 - 160cm in depth.  All aspects at & above treeline were involved with the majority of the failures occurring on the early Feb. surface hoar.  In the central portion of the region a natural avalanche cycle to size 3.5 was observed on all aspects and elevations.  At this point it's not surprising that the early Feb SH was to blame.  As skies cleared Wednesdays, observers in the south of the region were able to get their eyes on a widespread avalanche cycle that occurred Monday & Tuesday producing avalanches to size 3.  A party of skiers accidently triggered a size 2 avalanche in the Panorama backcountry.  (Not within the resort boundaries)  The avalanche was on a 35 degree slope, which is the average pitch of an intermediate/advanced run at a ski area, in other words, not very steep.  The avalanche's crown measured 150 cm.  This continues the trend from a very active natural avalanche cycle that reached it's height on Monday and is ongoing.From Monday/Tuesday, Size 3 avalanches were common, with some size 4. Timber fell.  One remotely triggered avalanche triggered from 300m away caught my eye - it speaks to the propagation potential of the surface hoar now deeply buried. If this layer is triggered the avalanche will be big; this layer can be triggered from below so be mindful of overhead hazard. This layer will not go away in the next few days. Also, a sledder triggered a size 3 avalanche in the Quartz Creek area and while the machine is reportedly totaled, the rider is okay.There are two great observations from professionals working in the region recently.  The first one can be found on the MCR, it talks about the widespread & sensitive nature of the SH in the Northern Purcells.  You can start here and work your way into the March observations: http://www.acmg.ca/mcr/archives.asp The other is in the form of a youtube video which also deals with the surface hoar, highlighting the potential for remote triggering and large propagations when this layer fails.  Check out the shooting cracks in the video.  Another 50cm + of snow has fallen since this video was shot. http://www.youtube.com/watch?v=UmSJkS3SSbA&feature=youtu.be"),
 (0.9981895199766527,
  'On Friday avalanches ran naturally to size 2, while explosive control work produced avalanches up to size 2.5. Check out the Mountain Information Network for more details about a human triggered size 2 on a NW facing slope along the Bonnington Traverse. On Thursday reports included numerous natural avalanches up to Size 3 in response to heavy loading from snow, wind and rain. Most of the natural avalanches were storm and wind slabs, but a few persistent slabs also ran naturally on the late February persistent weak layer. Explosives and other artificial triggers (i.e. snow cats) produced additional Size 2-3 persistent slab activity, with remote and sympathetic triggers as well as 50-100 cm thick slab releasing from the impact of the dropped charge, before it exploded.'),
 (0.994350299143326,
  'On Thursday control work produced numerous storm slab avalanches on all aspects to size 1.5. A MIN report from Thursday http://bit.ly/2lnYiMC indicates significant slab development at and above treeline. Reports from Wednesday include several rider and explosive triggered storm slab avalanches up to size 1.5 and natural soft wind slab avalanches up to size 2. Of note was a remotely triggered size 1.5 windslab. This slab was triggered from 7m away with a 30cm crown, running on facets overlying a crust that formed mid-February.'),
 (0.9875405939427593,
  'Explosive control work on Friday produced only a few size 1 and isolated size 2 avalanches failing in the recent storm snow or running on the recently buried January 15th crust. On Thursday there were reports of natural avalanches up to size 2.5, while ski cuts and explosive control work produced numerous size 1-1.5, and up to size 2, storm slab avalanches 20-50 cm deep on predominantly northerly aspect, running far and wide within the recent storm snow.Wednesday ski cutting and explosive control work produced numerous easily triggered size 1 storm slab results up to 40 cm deep running on the January 15th crust. And on Tuesday, explosive control work and ski cutting produced numerous size 1-1.5 storm slab releases on leeward aspects at treeline and into the alpine, running on the January 15th crust. '),
 (0.9818750863986755,
  'On Monday explosives control produced storm and persistent slab avalanches to size 3.5 which occurred on all aspects at treeline and above. On the same day a snowboarder triggered a size 2.5 storm slab avalanche on Boulder Mtn near Revelstoke. Although the individual went for a nasty ride, no injuries were reported. In the southwest corner of the region a vehicle remotely triggered a slab avalanche from a distance of 100m on a southeast aspect. It is unknown if the failure plane was the crust under the recent storm snow or the February 10th interface.On Tuesday avalanches continued to run naturally and with explosive assistance to size 3. The region continues to produce large and spooky remote triggered avalanches.  '),
 (0.9770914555252335,
  'Thursday, explosive control produced storm slab results up to size 2 between 1700-2100 m. And on Wednesday ski cuts produced only small results up to size 1 in pockets of wind slab up to 20 cm deep.On Tuesday a snow cat remotely (from a distance) triggered a size 3-3.5 persistent slab avalanche that stepped down from 80-250 cm and ran to ground in the lower start zone on a south aspect at 2100 m. Explosive control work also produced cornice releases up to size 2.5, but failed to produce any slab results on the slopes below. Monday natural wind slab activity up to size 1.5 was reported on steep, north-east facing features. Ski cuts and explosive control work produced several storm slab results (size 1.5-2) on north-east through south-east aspects between 1400-1600 m where the late January crust is present.On the weekend the east facing Mt. Corrigan slidepath produced a very large, natural avalanche. The avalanche is estimated to be a size 4.0, and it took out mature timber in the path as it overran the Flathead FSR south of Corbin.  '),
 (0.9766288363811061,
  'On Sunday a skier was involved in an avalanche on an east facing treeline feature at Kootenay Pass. Preliminary information puts this incident close to what the guidebook calls Missile Ridge Northeast 2. Further details are not available at this time.On Saturday we received two reports of what might be the first large avalanches failing on the mid-December interface. The first was initiated by explosive control work, the size 2.5 avalanche ran on a 35 to 40 degree slope that was southeast through southwest facing at treeline. The second avalanche was a size 3.0 that released naturally on a 30 degree east facing slope between 1900 and 1400 m. This crown was up to 75 cm in depth. Control work produced numerous other storm slabs to size 2 on northwest, north, northeast, southeast, south and southwest aspects. Natural avalanches to size 2 were reported on north facing features at 2100 m.'),
 (0.9672798071054715,
  'On Friday the Monashees were a bit more reactive due to heavier snow fall amounts, natural, skier controlled, and accidentally triggered storm slab avalanches were reported up to size 2.5. In the Selkirks natural cornice falls and skier controlled storm slabs up to size 2.0 were reported. On Thursday we had reports of natural storm slab avalanches up to size 1.5 and natural cornice falls up to size 2.0. On Wednesday natural slab avalanches were reported up to size 2.0 in the alpine on northerly aspects. Natural avalanches up to size 3.0 were reported on Tuesday from alpine elevations on east thru south aspects. Skier accidental and skier remote avalanches were also reported from Tuesday up to size 2.0 that failed at the crust interface; on surface hoar in the case of the remotes. On Monday we had reports of natural storm slab avalanches up to size 3.0 in the northern Selkirks. Some of these avalanches released additional slab avalanches in the track of the slide that added to the size and depth of the debris. '),
 (0.9567939418743828,
  "Observations from Thursday are limited due to poor visibility and travel during the storm. There was likely natural avalanche activity that would have been most prevalent at upper elevations. Explosive control work produced storm slab avalanches to size 1.5 Thursday morning, but that was before the bulk of the snow fell. We received a great MIN report early Thursday afternoon that talked about shooting cracks in the storm slab that were traveling for several meters. More details here.There was a rather anomalous size 3.5 that was seen on the Cheakamus Glacier from Whistler March 15th. We don't have details on the failure plane of this avalanche, but it may have run on the mid-February crust."),
 (0.9565156329843935,
  "Storm slab avalanches continue to be reported running naturally. Most of the storm snow avalanches are size 1.5-2.0, however one was reported to be size 3.0 where the storm slab is now 80 cm thick. Explosives control released some large avalanches down to the early February persistent weak layer. There were also a couple of accidentally triggered storm slab avalanches, one that buried a sledder on a northwest aspect in the alpine. Neighboring forecast regions have reported large avalanches initiating in the new storm snow, then stepping down to deeper layers, some\xa0 running full path to the ground.\xa0 The South Columbia's reported a size 4.5 on a south aspect than ran more that 1500 vertical metres.")]
# print text with lowest predicted probabilities
txt_low = [(_,x) for _, x in sorted(zip(phat,docs), key=lambda pair: pair[0])]
txt_low[:10]
[(2.3009168291128568e-20,
  'An early report from Wednesday includes a natural size 3 cornice triggered avalanche and a natural size 3 wet slab avalanche at 2200-2400 m elevation. On Tuesday, natural avalanche activity was reported in many parts of the region. Most of this activity was wet slabs up to size 3 and loose wet avalanches up to size 2 in response to the high elevation rain. A natural size 2.5 persistent slab avalanche was also observed on an east and southeast aspect slope at 2200 m in the area south of Nelson which failed on a layer down 50 cm. Ski cutting on Tuesday morning triggered several storm slab up to size 1.5 and explosives triggered numerous wet slab avalanches up to size 2.5 in the afternoon. On Thursday, if the sun comes out in full force, it will further destabilize an already warm snowpack. Natural loose wet and wet slab avalanches remain possible on steep sun exposed slopes. Large persistent slab avalanches also remain a serious concern until the snowpack has time to recover from the recent rain and warming. In the high alpine, recent storm snow will remain touchy and may fail naturally with solar triggers. '),
 (4.058276673819625e-17,
  'No new avalanches were reported on Friday. On Thursday, storm slabs up to size 3.5 were observed in the north of the region failing up to 100 cm deep. A size 2.5 wind slab was observed on a northeast aspect at 2300 m which failed down 100 cm and ran 1 km. Numerous solar triggered loose wet and loose dry avalanches were also reported. There have been no recent avalanches reported from the Coquihalla area or south of the region.On Sunday, the recent storm snow is expected to remain reactive to human triggering at higher elevations, especially in wind loaded terrain. Natural solar triggered sluffing is expected from steep sun exposed slopes. Natural wind slab avalanches and cornice releases are also possible when the sun is at its strongest.'),
 (4.8942451469271114e-17,
  'Observations were limited on Tuesday but reports from the north of the region include isolated natural wet avalanches and ski cutting triggering size 1 loose wet avalanches. On Monday, numerous loose wet avalanches up to size 2 were observed in the Coquihalla area. In the north of the region on Monday, three natural size 1.5-3 cornice releases were observed on north and northeast aspects. Also in the north, a skier triggered a size 1 storm slab in a loaded alpine feature and a skier triggered a size 1 persistent slab on an east aspect at 1800 m which released down 40 cm on the late February surface hoar layer.On Thursday, the new snow should be burying a widespread crust and is expected to form new wind slabs in exposed terrain. It has become difficult to trigger the February weak layers but there is still a chance that smaller wind slab avalanches or a cornice fall could still step down and release a persistent slab avalanche. At lower elevation, continued rain may result in wet sluffing from steep terrain features.'),
 (4.8942451469271114e-17,
  'Observations were limited on Tuesday but reports from the north of the region include isolated natural wet avalanches and ski cutting triggering size 1 loose wet avalanches. On Monday, numerous loose wet avalanches up to size 2 were observed in the Coquihalla area. In the north of the region on Monday, three natural size 1.5-3 cornice releases were observed on north and northeast aspects. Also in the north, a skier triggered a size 1 storm slab in a loaded alpine feature and a skier triggered a size 1 persistent slab on an east aspect at 1800 m which released down 40 cm on the late February surface hoar layer.On Thursday, the new snow should be burying a widespread crust and is expected to form new wind slabs in exposed terrain. It has become difficult to trigger the February weak layers but there is still a chance that smaller wind slab avalanches or a cornice fall could still step down and release a persistent slab avalanche. At lower elevation, continued rain may result in wet sluffing from steep terrain features.'),
 (9.575168819112755e-17,
  'On Friday, a natural size 2 wet slab was observed on an unsupported feature at 2250 m elevation on southerly aspects which stepped down to old weak layers. Explosives triggered a couple cornices size 2-2.5. On Thursday, several natural size 2 storm slabs and wet slabs were observed from steep sun exposed slopes as well as widespread loose wet avalanches. A natural size 2.5 persistent slab avalanche was observed which likely released on the mid-February weak layer. Natural cornices up to size 1.5 and explosive triggered cornices up to size 2.5 were also reported. A MIN post from Thursday shows a great example of a snowmobile triggered storm slab avalanche on steep feature. Click here for more details.On Sunday, the recent storm snow is expected to remain reactive to human triggering, especially in wind loaded terrain. Natural solar triggered sluffing is expected from steep sun exposed slopes. Natural storm slab avalanches and cornice releases are also possible when the sun is shining.'),
 (1.0384016363823241e-16,
  'On Saturday, one natural size 2 slab was observed on a south aspect at 1900 m in the Duffey area and another smaller slab was reported on a steep wind-loaded feature in the Coquihalla area. Dry loose avalanches were reported in steep terrain in the Coquihalla area and wet loose avalanches were observed throughout the region in steep south-facing terrain.On Friday, wind slabs were sensitive to skier triggering to size 2 on north and northeast facing slopes around 2000 m. These slabs were 10 to 25 m wide with crowns averaging 40 cm in depth. On Thursday wind loaded features produced natural avalanches to size 2.5. Several natural wind slab avalanches to size 2 were observed on steep north facing alpine terrain. Control work produced avalanches to size 2 in steep unsupported terrain. '),
 (2.9859425366006045e-16,
  'Decreasing traffic in the mountains has led to limited avalanche observations over the past few days, however a large (size 2) recent cornice fall from was observed in the north of the region on Sunday. It failed to trigger any slab.A few natural storm and wind slab releases were observed on Thursday, ranging from size 1-2.5. One of these was a size 2 slab triggered by a loose wet avalanche while the 2.5 wind slab was triggered by natural ice fall on a north aspect at 2200 metres. It featured a 45 cm crown fracture.Reports from Wednesday included an observation of a natural size 2 cornice release from a north aspect in the north of the region. Numerous natural loose wet avalanches were also observed on sun-exposed aspects in the Monashees.Observations from just over a week ago showed a pattern of heightened cornice failure activity.Looking forward, a period of increasing warming, full sun, and warm overnight temperatures will be maintaining elevated chances of cornice and loose wet avalanche activity.'),
 (2.9859425366006045e-16,
  'Decreasing traffic in the mountains has led to limited avalanche observations over the past few days, however a large (size 2) recent cornice fall from was observed in the north of the region on Sunday. It failed to trigger any slab.A few natural storm and wind slab releases were observed on Thursday, ranging from size 1-2.5. One of these was a size 2 slab triggered by a loose wet avalanche while the 2.5 wind slab was triggered by natural ice fall on a north aspect at 2200 metres. It featured a 45 cm crown fracture.Reports from Wednesday included an observation of a natural size 2 cornice release from a north aspect in the north of the region. Numerous natural loose wet avalanches were also observed on sun-exposed aspects in the Monashees.Observations from just over a week ago showed a pattern of heightened cornice failure activity.Looking forward, a period of increasing warming, full sun, and warm overnight temperatures will be maintaining elevated chances of cornice and loose wet avalanche activity.'),
 (3.8263358658374586e-16,
  'On Tuesday, two natural size 1.5 wet slabs were observed on a north aspect at 1150 m in the Whistler area. Natural and skier triggered loose wet avalanches were also reported. On Monday, natural storm slab avalanches up to size 1.5 were observed west of Bralone. In the Whistler area, explosives were triggering storm slabs up to size 2 and a skier triggered a size 1.5 storm slab on a north aspect at 2100 m elevation. Most of this storm slab activity had slab thickness of 20-30 cm but one was 80 cm thick in a wind loaded pocket. A natural cornice release was also reported on a northeast aspect at 2300 m. On Thursday, the new snow will bury the widespread surface crust and is expected to form new wind slabs in exposed terrain. It has become difficult to trigger the mid-February crust/facet layer but there is still a chance that smaller wind slab avalanches or cornices could still step down and release a persistent slab avalanche. At lower elevation, rain may result in wet sluffing from steep terrain features.'),
 (4.25181734625419e-16,
  'On Wednesday, numerous slabs up to size 2.5 were reported on solar aspects in the far north of the region.\xa0 Some of these releases were over 2m deep on old weak layers.\xa0 In the south, the activity consisted of loose wet avalanches up to size 2.5 and cornice releases up to size 2.5.\xa0 THe cornice releases were not triggering slabs in the southern part of the region.\xa0 On Tuesday, a skier triggered a size 1.5 wet slab avalanche on a steep roll on an east aspect at 1400m elevation. The slab was 10-30cm thick and slide on a crust. A natural size 3.5 was reported north of Stewart on a north aspect at 2200m. This was 3-4m thick and failed on the ground on a steep glaciated feature. Numerous natural cornice failures and loose wet avalanches were reported up to size 2. The continued warming, sun exposure, and limited overnight recovery means natural cornice releases, deep persistent slab releases, and loose sluffing are expected to continue on Friday. Very large avalanches will remain a serious concern until the region gets some sustained periods of cooling,')]

Exercise

See exercise 1 in the exercise list.

Predicting deaths from forecast text is very difficult because deaths are so rare. A prediction exercise more likely to succeed would be to predict the avalanche rating from the forecast text. However, doing so is a very artificial task, with little practical use.

Another alternative would be to gather more data on non-fatal avalanches. Avalanche Canada also has user-submitted “Mountain Information Network” reports. These reports include observations of natural avalanches and information on non-fatal avalanche incidents. Since the data is user-submitted, it is messy and more difficult to work with. Nonetheless, working with it would be good practice and could lead to some insights.

Unsupervised Learning#

The regression and classification methods that we have seen so far are examples of supervised learning — we are trying to predict an observed outcome. In unsupervised learning, we do not have an observed outcome to predict. Instead, we try to find informative patterns in the data. Unsupervised learning can be particularly useful with text data. We will look at two related techniques for topic modeling. These techniques attempt to extract distinct topics from a collection of text documents.

Latent Semantic Analysis#

Latent semantic analysis is used by some search engines to rank the similarities among documents. Latent semantic analysis begins with a term document matrix, \(X\). The term document matrix is a number of documents by number of terms matrix where the i,jth entry is the measure of how often term j appears in document i. This could be the same bag of words feature matrix we constructed above, or it could be some other measure. For this example, we will use the term-frequency, inverse-document-frequency representation.

\[ x^{tfidf}_{ij} = \frac{\text{occurences of term j in document i}}{\text{length of document i}} \log \left(\frac{\text{number of documents}}{\text{number of documents containing term j}}\right) \]

Given a term document matrix, \(X\), latent semantic analysis computes a lower rank approximation to \(X\) through the singular value decomposition. This lower rank approximation can potentially be interpreted or used instead of \(X\) for other learning algorithms. In other contexts, similar decompositions are referred to as principal components analysis or factor models.

# LSA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
tfidf_vectorizer = TfidfVectorizer(use_idf=True, smooth_idf=True)
X = tfidf_vectorizer.fit_transform([' '.join(doc) for doc in text_data])
svd_model = TruncatedSVD(n_components=10, algorithm='randomized', n_iter=100, random_state=122)
svd_model.fit(X)
TruncatedSVD(n_components=10, n_iter=100, random_state=122)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Here, we have computed a rank 10 approximation to the tf-idf matrix. We can see how much variance of the original matrix that our 10 components reproduce. We can also look at how all terms in the document contribute to each of the 10 components.

print(svd_model.explained_variance_ratio_)
print(svd_model.explained_variance_ratio_.cumsum())
terms = tfidf_vectorizer.get_feature_names_out() 
comp_label=[]
for i, comp in enumerate(svd_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
    print("Topic "+str(i)+": ")
    message = ""
    for t in sorted_terms:
        message = message + "{:.2f} * {} + ".format(t[1],t[0])
    print(message)
    comp_label.append(message)
[0.0138138  0.03554883 0.03357888 0.01687638 0.01410824 0.01372995
 0.01263796 0.01076844 0.01056224 0.00848542]
[0.0138138  0.04936263 0.08294151 0.0998179  0.11392614 0.12765609
 0.14029406 0.15106249 0.16162474 0.17011015]
Topic 0: 
0.38 * avalanche + 0.27 * size + 0.26 * slab + 0.25 * reported + 0.21 * new + 0.18 * triggered + 0.17 * wind + 
Topic 1: 
0.69 * new + 0.38 * reported + 0.24 * avalanche + 0.11 * observation + 0.10 * activity + 0.09 * observed + 0.09 * region + 
Topic 2: 
1.00 * emptystring + 0.00 * size + 0.00 * slab + 0.00 * triggered + 0.00 * aspect + 0.00 * wind + 0.00 * skier + 
Topic 3: 
0.43 * recent + 0.41 * observation + 0.25 * please + 0.25 * information + 0.25 * mountain + 0.22 * network + 0.18 * region + 
Topic 4: 
0.43 * reported + 0.41 * recent + 0.21 * loose + 0.16 * wet + 0.13 * avalanche + 0.11 * steep + 0.10 * solar + 
Topic 5: 
0.39 * loose + 0.32 * wet + 0.28 * observed + 0.27 * steep + 0.22 * terrain + 0.17 * solar + 0.15 * dry + 
Topic 6: 
0.35 * activity + 0.21 * recent + 0.20 * may + 0.19 * snow + 0.16 * storm + 0.14 * likely + 0.13 * rain + 
Topic 7: 
0.42 * wind + 0.24 * terrain + 0.23 * recent + 0.23 * slab + 0.18 * steep + 0.17 * small + 0.15 * snow + 
Topic 8: 
0.43 * observed + 0.34 * recent + 0.32 * report + 0.29 * activity + 0.21 * today + 0.12 * nothing + 0.12 * avalanche + 
Topic 9: 
0.43 * activity + 0.40 * report + 0.16 * natural + 0.09 * size + 0.09 * new + 0.06 * monday + 0.05 * please + 

Finally, we can attempt to visualize the components.

The LSA reduced the dimensionality of the representation of documents from thousands (of term-frequency inverse-document-frequency) to ten components. While ten is more manageable than thousands, it is still too many dimensions to effectively visualize. t-SNE is a technique to further reduce dimensionality. t-SNE is a nonlinear data transformation from many dimensions to 2, while attempting to preserve the original clustering of their original domain. In other words, if a set of documents are closely clustered in the 10 dimensional LSA space, then they will also be close together in the 2 dimensional t-SNE representation.

lsa_topic_matrix = svd_model.transform(X)
from sklearn.manifold import TSNE
nplot = 2000 # reduce the size of the data to speed computation and make the plot less cluttered
lsa_topic_sample = lsa_topic_matrix[np.random.choice(lsa_topic_matrix.shape[0], nplot, replace=False)]
tsne_lsa_model = TSNE(n_components=2, perplexity=50, learning_rate=500,
                      n_iter=1000, verbose=10, random_state=0, angle=0.75)
tsne_lsa_vectors = tsne_lsa_model.fit_transform(lsa_topic_sample)
[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 2000 samples in 0.002s...
[t-SNE] Computed neighbors for 2000 samples in 0.126s...
[t-SNE] Computed conditional probabilities for sample 1000 / 2000
[t-SNE] Computed conditional probabilities for sample 2000 / 2000
[t-SNE] Mean sigma: 0.000000
[t-SNE] Computed conditional probabilities in 0.081s
[t-SNE] Iteration 50: error = 71.2528305, gradient norm = 0.0649942 (50 iterations in 0.408s)
[t-SNE] Iteration 100: error = 71.1580353, gradient norm = 0.0637116 (50 iterations in 0.298s)
[t-SNE] Iteration 150: error = 71.2348251, gradient norm = 0.0666469 (50 iterations in 0.309s)
[t-SNE] Iteration 200: error = 71.3684845, gradient norm = 0.0714017 (50 iterations in 0.281s)
[t-SNE] Iteration 250: error = 71.3397064, gradient norm = 0.0670542 (50 iterations in 0.294s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 71.339706
[t-SNE] Iteration 300: error = 1.4851254, gradient norm = 0.0032509 (50 iterations in 0.238s)
[t-SNE] Iteration 350: error = 1.3660066, gradient norm = 0.0023993 (50 iterations in 0.241s)
[t-SNE] Iteration 400: error = 1.3270653, gradient norm = 0.0018702 (50 iterations in 0.241s)
[t-SNE] Iteration 450: error = 1.3099629, gradient norm = 0.0010042 (50 iterations in 0.240s)
[t-SNE] Iteration 500: error = 1.2990619, gradient norm = 0.0008214 (50 iterations in 0.243s)
[t-SNE] Iteration 550: error = 1.2897600, gradient norm = 0.0008391 (50 iterations in 0.248s)
[t-SNE] Iteration 600: error = 1.2823085, gradient norm = 0.0008226 (50 iterations in 0.244s)
[t-SNE] Iteration 650: error = 1.2753701, gradient norm = 0.0007648 (50 iterations in 0.247s)
[t-SNE] Iteration 700: error = 1.2710071, gradient norm = 0.0006577 (50 iterations in 0.247s)
[t-SNE] Iteration 750: error = 1.2671975, gradient norm = 0.0006257 (50 iterations in 0.251s)
[t-SNE] Iteration 800: error = 1.2644715, gradient norm = 0.0006654 (50 iterations in 0.248s)
[t-SNE] Iteration 850: error = 1.2614040, gradient norm = 0.0005994 (50 iterations in 0.254s)
[t-SNE] Iteration 900: error = 1.2583418, gradient norm = 0.0005160 (50 iterations in 0.251s)
[t-SNE] Iteration 950: error = 1.2559053, gradient norm = 0.0005859 (50 iterations in 0.256s)
[t-SNE] Iteration 1000: error = 1.2524540, gradient norm = 0.0007028 (50 iterations in 0.253s)
[t-SNE] KL divergence after 1000 iterations: 1.252454

The t-SNE model creates a non-linear projection from our 10 dimensional LSA topics onto two dimensional space. It can be useful for visualizing high-dimensional data. One word of caution: the output of the t-SNE model can depend on the parameters of the algorithm. Failure to see clear clusters in the t-SNE visualization could mean either the original data was not clustered in higher dimensional space or that the t-SNE algorithm parameters were chosen poorly.

cmap = matplotlib.cm.get_cmap('Paired')
fig, ax = plt.subplots(1,2,figsize=(16,6))
n_topics=len(svd_model.components_)
lsa_keys = np.argmax(lsa_topic_sample, axis=1)
ax[0].scatter(x=tsne_lsa_vectors[:,0],y=tsne_lsa_vectors[:,1], color=[cmap(i) for i in lsa_keys], alpha=0.8)
bbox_props = dict(boxstyle="round4,pad=0.1", lw=0.2, fc="white")
for i in range(n_topics):
    m = tsne_lsa_vectors[lsa_keys==i, :].mean(axis=0)
    ax[0].text(m[0], m[1], str(i), ha="center", va="center",
               size=15, color=cmap(i),
               bbox=bbox_props)
    ax[1].text(0,1-(i+1)*1/(n_topics+1),"Topic " + str(i) + " : "+ comp_label[i],ha="left", va="center", color=cmap(i))
    ax[1].axis('off')
fig.tight_layout()
/tmp/ipykernel_2757/2683641287.py:1: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = matplotlib.cm.get_cmap('Paired')
../_images/5de9307d2820aa16bb524be9c249f00a22676d9b0c3b6b64bd0b2fded00fa181.png

From this plot, we can immediately see two things. First, most documents are closest to topic 0. Second, most topics are not well-separated.

Exercise

See exercise 2 in the exercise list.

Latent Dirichlet Analysis#

Latent dirichlet analysis (LDA) produces similar outputs as latent semantic analysis, but LDA often produces nicer results. The statistical theory underlying LSA is built on continuous \(X\) features. LDA uses similar ideas, but takes into account that text is discrete.

# LDA
import gensim
# gensim works with a list of lists of tokens
text_data = [text_prep(txt) for txt in forecasts.avalancheSummary]
/tmp/ipykernel_2757/1742078185.py:14: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
  soup = BeautifulSoup(txt, "lxml")
# convert to bag of words
dictionary = gensim.corpora.Dictionary(text_data)
bow_data = [dictionary.doc2bow(text) for text in text_data]
ldamodel = gensim.models.ldamodel.LdaModel(bow_data, num_topics = 5, id2word=dictionary, passes=15)
topics = ldamodel.print_topics(num_words=10)
for topic in topics:
    print(topic)
(0, '0.062*"size" + 0.055*"slab" + 0.048*"avalanche" + 0.033*"triggered" + 0.028*"aspect" + 0.021*"wind" + 0.020*"natural" + 0.019*"storm" + 0.018*"reported" + 0.017*"skier"')
(1, '0.077*"avalanche" + 0.035*"new" + 0.030*"slab" + 0.028*"loose" + 0.028*"activity" + 0.027*"reported" + 0.025*"wet" + 0.024*"wind" + 0.022*"snow" + 0.019*"storm"')
(2, '0.051*"avalanche" + 0.036*"layer" + 0.032*"persistent" + 0.025*"weak" + 0.020*"slab" + 0.020*"snowpack" + 0.019*"min" + 0.018*"large" + 0.017*"report" + 0.013*"observation"')
(3, '0.059*"avalanche" + 0.031*"size" + 0.020*"natural" + 0.014*"aspect" + 0.013*"large" + 0.012*"cycle" + 0.011*"report" + 0.010*"3" + 0.010*"occurred" + 0.010*"region"')
(4, '0.031*"snow" + 0.023*"avalanche" + 0.019*"slab" + 0.016*"freezing" + 0.015*"season" + 0.012*"level" + 0.012*"year" + 0.012*"good" + 0.011*"start" + 0.010*"time"')
import pyLDAvis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
lda_display = pyLDAvis.gensim_models.prepare(ldamodel, bow_data, dictionary)
lda_display
/usr/share/miniconda3/envs/lecture-datascience/lib/python3.9/site-packages/pyLDAvis/_prepare.py:243: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  default_term_info = default_term_info.sort_values(

Exercise

See exercise 3 in the exercise list.

Exercise

See exercise 4 in the exercise list.

Exercises#

Exercise 1#

Use another classification method to predict incidents. Check whether your method outperforms the Naive Bayes classifier.

(back to text)

Exercise 2#

Apply LSA to the weather or snowpack descriptions. Can you notice any patterns?

(back to text)

Exercise 3#

Apply LDA to the weather or snowpack descriptions. Can you notice any patterns?

(back to text)

Exercise 4#

Use the reduced rank representation of text from LSA or LDA as a feature matrix to predict avalanche incidents. Compare the performance with the bag of words feature matrix.

(back to text)