123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287 |
- <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
- <html> <head>
- <title>Machine Learning for Fun and Profit</title>
- <style>
- .slide {
- border: 2px solid #000066;
- background-color: #CCCCFF;
- position: absolute;
- padding: 5%;
- width: 85%;
- height: 80%;
- }
- .red {
- background-color: #FF8888;
- }
- </style>
- <script src="scripts/jquery-1.2.3.js" type="text/javascript"></script>
- <script src="scripts/slideshow.js" type="text/javascript"></script>
- </head>
- <body>
- <div class='slide'>
- <h1>Machine Learning (a primer)</h1>
- <p>(Portland Python, March 2008) (john melesky)</p>
- </div>
- <div class='slide'>
- <h1>In a nutshell</h1>
- <p>Take facts, turn it into knowledge.</p>
- </div>
- <div class='slide'>
- <h1>In a nutshell</h1>
- <p>Take facts, turn it into knowledge, algorithmically.</p>
- </div>
- <div class='slide'>
- <h1>Discovering things</h1>
- <p>Also known as "unsupervised learning", it's what you do when you have a whole lot of unstructured data you know little about.</p>
- </div>
- <div class='slide'>
- <h1>Spellcheck, Google-style</h1>
- <p>The problem: check the spelling of things that aren't in the dictionary</p>
- </div>
- <div class='slide'>
- <h1>Spellcheck, Google-style</h1>
- <p>The problem: check the spelling of things that aren't in the dictionary</p>
- Indigo Montoya<br/>
- Inigo Montana<br/>
- Inigo Montoya<br/>
- Neego Montoya<br/>
- Inigo Mantoya<br/>
- </div>
- <div class='slide'>
- <h1>Spellcheck</h1>
- <p>Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)</p>
- </div>
- <div class='slide'>
- <h1>Spellcheck</h1>
- <p>Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)</p>
- <p>When a new query comes in, find the most common query within a short distance and suggest it.</p>
- </div>
- <div class='slide'>
- <h1>Spellcheck</h1>
- <p>Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)</p>
- <p>When a new query comes in, find the most common query within a short distance and suggest it.</p>
- <p>And that's it.</p>
- </div>
- <div class='slide'>
- <h1>Clustering</h1>
- <p>Problem: given a big pile of documents, figure out what different categories there are.</p>
- </div>
- <div class='slide'>
- <h1>Clustering</h1>
- <p>Solution: simple geometry</p>
- </div>
- <div class='slide'>
- <h1>Clustering</h1>
- <p>Solution: simple (high-dimensional) geometry</p>
- </div>
- <div class='slide'>
- <h1>Clustering</h1>
- <p>Solution: (a whole lot of) simple (high-dimensional) geometry</p>
- </div>
- <div class='slide'>
- <h1>Technique: k-Means Clustering</h1>
- <ol>
- <li>Pick some (k) random points in your vector space.</li>
- <li>For each document, figure out the nearest point.</li>
- <li>Lather, rinse, repeat.</li>
- <li>Voila! Slow-cooked category discovery</li>
- </ol>
- </div>
- <div class='slide'>
- <h1>Supervised Learning</h1>
- <p>When you already know something about your data, and you want to apply that knowledge to more, less-known data</p>
- </div>
- <div class='slide'>
- <h1>Classification</h1>
- <p>You have 100 documents in two different categories. Predict the category for the next 5000 documents.</p>
- </div>
- <div class='slide'>
- <h1>Technique: Nearest Neighbor</h1>
- <ol>
- <li>Plot your knowns</li>
- <li>Figure out the closest known to your unknown (geometrically)</li>
- </ol>
- </div>
- <div class='slide'>
- <h1>Technique: Linear Separation</h1>
- <ol>
- <li>Plot your knowns</li>
- <li>Figure out a line separating the categories</li>
- <li>Use that line to classify the unknowns</li>
- </ol>
- </div>
- <div class='slide'>
- <h1>Linear Separation: Step 1</h1>
- <img src="media/basedata.png" />
- </div>
- <div class='slide'>
- <h1>Linear Separation: Step 2</h1>
- <img src="media/cleansep.png" />
- </div>
- <div class='slide'>
- <h1>Non-linearly separable data</h1>
- <img src="media/badset1.png" />
- </div>
- <div class='slide'>
- <h1>Non-linearly separable data</h1>
- <img src="media/badset2.png" />
- </div>
- <div class='slide'>
- <h1>Non-linearly separable data</h1>
- <img src="media/badset3.png" />
- </div>
- <div class='slide'>
- <h1>Light-bulb jokes</h1>
- </div>
- <div class='slide'>
- <h1>Technique: Support Vector Machines</h1>
- </div>
- <div class='slide'>
- <h1>Technique: Support Vector Machines</h1>
- <img height='450' src="media/dsc01228-02-h.jpg" />
- </div>
- <div class='slide'>
- <h1>Technique: Naive Bayesian Classifiers</h1>
- <p>Not geometric, but statistical.</p>
- </div>
- <div class='slide'>
- <h1>Bayes' Theorem</h1>
- <p>Future probabilities derived from prior probabilities</p>
- </div>
- <div class='slide'>
- <h1>Bayes' Theorem</h1>
- <p>If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?</p>
- </div>
- <div class='slide'>
- <h1>Bayes' Theorem</h1>
- <p>If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?</p>
- <p>(hint: it's not 95%)</p>
- </div>
- <div class='slide'>
- <h1>Bayes' Theorem</h1>
- <p>Answer: Depends on how many people use drugs.</p>
- </div>
- <div class='slide'>
- <h1>Bayes' Theorem</h1>
- <p>Answer: Depends on how many people use drugs.</p>
- <p>If the rate of drug use is 1%, then we have:</p>
- <center>
- <table border=1>
- <tr><th></th><th>test positive</th><th>test negative</th></tr>
- <tr><th>users</th><td>95% of 1%</td><td>5% of 1%</td></tr>
- <tr><th>non-users</th><td>5% of 99%</td><td>95% of 99%</td></tr>
- </table>
- </center>
- </div>
- <div class='slide'>
- <h1>Bayes' Theorem</h1>
- <p>Answer: Depends on how many people use drugs.</p>
- <p>Number of positive results: 0.95% + 4.95% == 5.9%</p>
- <p>Number of <i>correct</i> positive results: 0.95% / 5.9% == 16.1%</p>
- </div>
- <div class='slide'>
- <h1></h1>
- </div>
- <div class='slide'>
- <h1>The basic process</h1>
- <ol>
- <li>Look at your data, figure out a good numeric representation</li>
- <li>Turn your data into numbers (usually vectors of numbers)</li>
- <li>Run your algorithms</li>
- <li>Profit! (or Fun!)</li>
- </ol>
- </div>
- <div class='slide'>
- <h1>Figuring out a good representation</h1>
- </div>
- <div class='slide'>
- <h1></h1>
- </div>
- </body></html>
|