Machine Learning (a primer)

(Portland Python, March 2008) (john melesky)

In a nutshell

Take facts, turn it into knowledge.

In a nutshell

Take facts, turn it into knowledge, algorithmically.

Discovering things

Also known as "unsupervised learning", it's what you do when you have a whole lot of unstructured data you know little about.

Spellcheck, Google-style

The problem: check the spelling of things that aren't in the dictionary

Spellcheck, Google-style

The problem: check the spelling of things that aren't in the dictionary

Indigo Montoya
Inigo Montana
Inigo Montoya
Neego Montoya
Inigo Mantoya

Spellcheck

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

Spellcheck

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

When a new query comes in, find the most common query within a short distance and suggest it.

Spellcheck

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

When a new query comes in, find the most common query within a short distance and suggest it.

And that's it.

Clustering

Problem: given a big pile of documents, figure out what different categories there are.

Clustering

Solution: simple geometry

Clustering

Solution: simple (high-dimensional) geometry

Clustering

Solution: (a whole lot of) simple (high-dimensional) geometry

Technique: k-Means Clustering

Pick some (k) random points in your vector space.
For each document, figure out the nearest point.
Lather, rinse, repeat.
Voila! Slow-cooked category discovery

Supervised Learning

When you already know something about your data, and you want to apply that knowledge to more, less-known data

Classification

You have 100 documents in two different categories. Predict the category for the next 5000 documents.

Technique: Nearest Neighbor

Plot your knowns
Figure out the closest known to your unknown (geometrically)

Technique: Linear Separation

Plot your knowns
Figure out a line separating the categories
Use that line to classify the unknowns

Linear Separation: Step 1

Linear Separation: Step 2

Non-linearly separable data

Technique: Support Vector Machines

Technique: Naive Bayesian Classifiers

Not geometric, but statistical.

Bayes' Theorem

Future probabilities derived from prior probabilities

Bayes' Theorem

If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?

Bayes' Theorem

If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?

(hint: it's not 95%)

Bayes' Theorem

Answer: Depends on how many people use drugs.

Bayes' Theorem

Answer: Depends on how many people use drugs.

If the rate of drug use is 1%, then we have:

	test positive	test negative
users	95% of 1%	5% of 1%
non-users	5% of 99%	95% of 99%

Bayes' Theorem

Answer: Depends on how many people use drugs.

Number of positive results: 0.95% + 4.95% == 5.9%

Number of correct positive results: 0.95% / 5.9% == 16.1%

The basic process

Look at your data, figure out a good numeric representation
Turn your data into numbers (usually vectors of numbers)
Run your algorithms
Profit! (or Fun!)

Machine Learning (a primer)

In a nutshell

In a nutshell

Discovering things

Spellcheck, Google-style

Spellcheck, Google-style

Spellcheck

Spellcheck

Spellcheck

Clustering

Clustering

Clustering

Clustering

Technique: k-Means Clustering

Supervised Learning

Classification

Technique: Nearest Neighbor

Technique: Linear Separation

Linear Separation: Step 1

Linear Separation: Step 2

Non-linearly separable data

Non-linearly separable data

Non-linearly separable data

Light-bulb jokes

Technique: Support Vector Machines

Technique: Support Vector Machines

Technique: Naive Bayesian Classifiers

Bayes' Theorem

Bayes' Theorem

Bayes' Theorem

Bayes' Theorem

Bayes' Theorem

Bayes' Theorem

The basic process

Figuring out a good representation