jmelesky
/
pdorg_site


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287
							<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html> <head>
<title>Machine Learning for Fun and Profit</title>

<style>

.slide {
  border: 2px solid #000066;
  background-color: #CCCCFF;
  position: absolute;
  padding: 5%;
  width: 85%;
  height: 80%;
}

.red {
  background-color: #FF8888;
}


</style>

<script src="scripts/jquery-1.2.3.js" type="text/javascript"></script>
<script src="scripts/slideshow.js" type="text/javascript"></script>
</head>

<body>


<div class='slide'>
<h1>Machine Learning (a primer)</h1>
<p>(Portland Python, March 2008) (john melesky)</p>
</div>


<div class='slide'>
<h1>In a nutshell</h1>
<p>Take facts, turn it into knowledge.</p>
</div>

<div class='slide'>
<h1>In a nutshell</h1>
<p>Take facts, turn it into knowledge, algorithmically.</p>
</div>


<div class='slide'>
<h1>Discovering things</h1>
<p>Also known as "unsupervised learning", it's what you do when you have a whole lot of unstructured data you know little about.</p>
</div>


<div class='slide'>
<h1>Spellcheck, Google-style</h1>
<p>The problem: check the spelling of things that aren't in the dictionary</p>
</div>


<div class='slide'>
<h1>Spellcheck, Google-style</h1>
<p>The problem: check the spelling of things that aren't in the dictionary</p>
Indigo Montoya<br/>
Inigo Montana<br/>
Inigo Montoya<br/>
Neego Montoya<br/>
Inigo Mantoya<br/>
</div>


<div class='slide'>
<h1>Spellcheck</h1>
<p>Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)</p>
</div>


<div class='slide'>
<h1>Spellcheck</h1>
<p>Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)</p>
<p>When a new query comes in, find the most common query within a short distance and suggest it.</p>
</div>


<div class='slide'>
<h1>Spellcheck</h1>
<p>Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)</p>
<p>When a new query comes in, find the most common query within a short distance and suggest it.</p>
<p>And that's it.</p>
</div>


<div class='slide'>
<h1>Clustering</h1>
<p>Problem: given a big pile of documents, figure out what different categories there are.</p>
</div>


<div class='slide'>
<h1>Clustering</h1>
<p>Solution: simple geometry</p>
</div>

<div class='slide'>
<h1>Clustering</h1>
<p>Solution: simple (high-dimensional) geometry</p>
</div>


<div class='slide'>
<h1>Clustering</h1>
<p>Solution: (a whole lot of) simple (high-dimensional) geometry</p>
</div>


<div class='slide'>
<h1>Technique: k-Means Clustering</h1>
<ol>
  <li>Pick some (k) random points in your vector space.</li>
  <li>For each document, figure out the nearest point.</li>
  <li>Lather, rinse, repeat.</li>
  <li>Voila! Slow-cooked category discovery</li>
</ol>
</div>


<div class='slide'>
<h1>Supervised Learning</h1>
<p>When you already know something about your data, and you want to apply that knowledge to more, less-known data</p>
</div>


<div class='slide'>
<h1>Classification</h1>
<p>You have 100 documents in two different categories. Predict the category for the next 5000 documents.</p>
</div>


<div class='slide'>
<h1>Technique: Nearest Neighbor</h1>
<ol>
  <li>Plot your knowns</li>
  <li>Figure out the closest known to your unknown (geometrically)</li>
</ol>
</div>


<div class='slide'>
<h1>Technique: Linear Separation</h1>
<ol>
  <li>Plot your knowns</li>
  <li>Figure out a line separating the categories</li>
  <li>Use that line to classify the unknowns</li>
</ol>
</div>

<div class='slide'>
<h1>Linear Separation: Step 1</h1>
<img src="media/basedata.png" />
</div>

<div class='slide'>
<h1>Linear Separation: Step 2</h1>
<img src="media/cleansep.png" />
</div>


<div class='slide'>
<h1>Non-linearly separable data</h1>
<img src="media/badset1.png" />
</div>

<div class='slide'>
<h1>Non-linearly separable data</h1>
<img src="media/badset2.png" />
</div>

<div class='slide'>
<h1>Non-linearly separable data</h1>
<img src="media/badset3.png" />
</div>


<div class='slide'>
<h1>Light-bulb jokes</h1>
</div>


<div class='slide'>
<h1>Technique: Support Vector Machines</h1>
</div>

<div class='slide'>
<h1>Technique: Support Vector Machines</h1>
<img height='450' src="media/dsc01228-02-h.jpg" />
</div>

<div class='slide'>
<h1>Technique: Naive Bayesian Classifiers</h1>
<p>Not geometric, but statistical.</p>
</div>


<div class='slide'>
<h1>Bayes' Theorem</h1>
<p>Future probabilities derived from prior probabilities</p>
</div>


<div class='slide'>
<h1>Bayes' Theorem</h1>
<p>If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?</p>
</div>

<div class='slide'>
<h1>Bayes' Theorem</h1>
<p>If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?</p>
<p>(hint: it's not 95%)</p>
</div>


<div class='slide'>
<h1>Bayes' Theorem</h1>
<p>Answer: Depends on how many people use drugs.</p>
</div>


<div class='slide'>
<h1>Bayes' Theorem</h1>
<p>Answer: Depends on how many people use drugs.</p>
<p>If the rate of drug use is 1%, then we have:</p>
<center>
<table border=1>
<tr><th></th><th>test positive</th><th>test negative</th></tr>
<tr><th>users</th><td>95% of 1%</td><td>5% of 1%</td></tr>
<tr><th>non-users</th><td>5% of 99%</td><td>95% of 99%</td></tr>
</table>
</center>
</div>


<div class='slide'>
<h1>Bayes' Theorem</h1>
<p>Answer: Depends on how many people use drugs.</p>
<p>Number of positive results: 0.95% + 4.95% == 5.9%</p>
<p>Number of <i>correct</i> positive results: 0.95% / 5.9% == 16.1%</p>
</div>


<div class='slide'>
<h1></h1>
</div>


<div class='slide'>
<h1>The basic process</h1>
<ol>
  <li>Look at your data, figure out a good numeric representation</li>
  <li>Turn your data into numbers (usually vectors of numbers)</li>
  <li>Run your algorithms</li>
  <li>Profit! (or Fun!)</li>
</ol>
</div>


<div class='slide'>
<h1>Figuring out a good representation</h1>
</div>


<div class='slide'>
<h1></h1>
</div>


</body></html>