Machine Learning (a primer)

(Portland Python, March 2008) (john melesky)

In a nutshell

Take facts, turn it into knowledge.

In a nutshell

Take facts, turn it into knowledge, algorithmically.

Discovering things

Also known as "unsupervised learning", it's what you do when you have a whole lot of unstructured data you know little about.

Spellcheck, Google-style

The problem: check the spelling of things that aren't in the dictionary

Spellcheck, Google-style

The problem: check the spelling of things that aren't in the dictionary

Indigo Montoya
Inigo Montana
Inigo Montoya
Neego Montoya
Inigo Mantoya

Spellcheck

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

Spellcheck

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

When a new query comes in, find the most common query within a short distance and suggest it.

Spellcheck

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

When a new query comes in, find the most common query within a short distance and suggest it.

And that's it.

Clustering

Problem: given a big pile of documents, figure out what different categories there are.

Clustering

Solution: simple geometry

Clustering

Solution: simple (high-dimensional) geometry

Clustering

Solution: (a whole lot of) simple (high-dimensional) geometry

Technique: k-Means Clustering

  1. Pick some (k) random points in your vector space.
  2. For each document, figure out the nearest point.
  3. Lather, rinse, repeat.
  4. Voila! Slow-cooked category discovery

Supervised Learning

When you already know something about your data, and you want to apply that knowledge to more, less-known data

Classification

You have 100 documents in two different categories. Predict the category for the next 5000 documents.

Technique: Nearest Neighbor

  1. Plot your knowns
  2. Figure out the closest known to your unknown (geometrically)

Technique: Linear Separation

  1. Plot your knowns
  2. Figure out a line separating the categories
  3. Use that line to classify the unknowns

Linear Separation: Step 1

Linear Separation: Step 2

Non-linearly separable data

Non-linearly separable data

Non-linearly separable data

Light-bulb jokes

Technique: Support Vector Machines

Technique: Support Vector Machines

Technique: Naive Bayesian Classifiers

Not geometric, but statistical.

Bayes' Theorem

Future probabilities derived from prior probabilities

Bayes' Theorem

If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?

Bayes' Theorem

If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?

(hint: it's not 95%)

Bayes' Theorem

Answer: Depends on how many people use drugs.

Bayes' Theorem

Answer: Depends on how many people use drugs.

If the rate of drug use is 1%, then we have:

test positivetest negative
users95% of 1%5% of 1%
non-users5% of 99%95% of 99%

Bayes' Theorem

Answer: Depends on how many people use drugs.

Number of positive results: 0.95% + 4.95% == 5.9%

Number of correct positive results: 0.95% / 5.9% == 16.1%

The basic process

  1. Look at your data, figure out a good numeric representation
  2. Turn your data into numbers (usually vectors of numbers)
  3. Run your algorithms
  4. Profit! (or Fun!)

Figuring out a good representation