index.html 5.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287
  1. <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
  2. <html> <head>
  3. <title>Machine Learning for Fun and Profit</title>
  4. <style>
  5. .slide {
  6. border: 2px solid #000066;
  7. background-color: #CCCCFF;
  8. position: absolute;
  9. padding: 5%;
  10. width: 85%;
  11. height: 80%;
  12. }
  13. .red {
  14. background-color: #FF8888;
  15. }
  16. </style>
  17. <script src="scripts/jquery-1.2.3.js" type="text/javascript"></script>
  18. <script src="scripts/slideshow.js" type="text/javascript"></script>
  19. </head>
  20. <body>
  21. <div class='slide'>
  22. <h1>Machine Learning (a primer)</h1>
  23. <p>(Portland Python, March 2008) (john melesky)</p>
  24. </div>
  25. <div class='slide'>
  26. <h1>In a nutshell</h1>
  27. <p>Take facts, turn it into knowledge.</p>
  28. </div>
  29. <div class='slide'>
  30. <h1>In a nutshell</h1>
  31. <p>Take facts, turn it into knowledge, algorithmically.</p>
  32. </div>
  33. <div class='slide'>
  34. <h1>Discovering things</h1>
  35. <p>Also known as "unsupervised learning", it's what you do when you have a whole lot of unstructured data you know little about.</p>
  36. </div>
  37. <div class='slide'>
  38. <h1>Spellcheck, Google-style</h1>
  39. <p>The problem: check the spelling of things that aren't in the dictionary</p>
  40. </div>
  41. <div class='slide'>
  42. <h1>Spellcheck, Google-style</h1>
  43. <p>The problem: check the spelling of things that aren't in the dictionary</p>
  44. Indigo Montoya<br/>
  45. Inigo Montana<br/>
  46. Inigo Montoya<br/>
  47. Neego Montoya<br/>
  48. Inigo Mantoya<br/>
  49. </div>
  50. <div class='slide'>
  51. <h1>Spellcheck</h1>
  52. <p>Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)</p>
  53. </div>
  54. <div class='slide'>
  55. <h1>Spellcheck</h1>
  56. <p>Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)</p>
  57. <p>When a new query comes in, find the most common query within a short distance and suggest it.</p>
  58. </div>
  59. <div class='slide'>
  60. <h1>Spellcheck</h1>
  61. <p>Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)</p>
  62. <p>When a new query comes in, find the most common query within a short distance and suggest it.</p>
  63. <p>And that's it.</p>
  64. </div>
  65. <div class='slide'>
  66. <h1>Clustering</h1>
  67. <p>Problem: given a big pile of documents, figure out what different categories there are.</p>
  68. </div>
  69. <div class='slide'>
  70. <h1>Clustering</h1>
  71. <p>Solution: simple geometry</p>
  72. </div>
  73. <div class='slide'>
  74. <h1>Clustering</h1>
  75. <p>Solution: simple (high-dimensional) geometry</p>
  76. </div>
  77. <div class='slide'>
  78. <h1>Clustering</h1>
  79. <p>Solution: (a whole lot of) simple (high-dimensional) geometry</p>
  80. </div>
  81. <div class='slide'>
  82. <h1>Technique: k-Means Clustering</h1>
  83. <ol>
  84. <li>Pick some (k) random points in your vector space.</li>
  85. <li>For each document, figure out the nearest point.</li>
  86. <li>Lather, rinse, repeat.</li>
  87. <li>Voila! Slow-cooked category discovery</li>
  88. </ol>
  89. </div>
  90. <div class='slide'>
  91. <h1>Supervised Learning</h1>
  92. <p>When you already know something about your data, and you want to apply that knowledge to more, less-known data</p>
  93. </div>
  94. <div class='slide'>
  95. <h1>Classification</h1>
  96. <p>You have 100 documents in two different categories. Predict the category for the next 5000 documents.</p>
  97. </div>
  98. <div class='slide'>
  99. <h1>Technique: Nearest Neighbor</h1>
  100. <ol>
  101. <li>Plot your knowns</li>
  102. <li>Figure out the closest known to your unknown (geometrically)</li>
  103. </ol>
  104. </div>
  105. <div class='slide'>
  106. <h1>Technique: Linear Separation</h1>
  107. <ol>
  108. <li>Plot your knowns</li>
  109. <li>Figure out a line separating the categories</li>
  110. <li>Use that line to classify the unknowns</li>
  111. </ol>
  112. </div>
  113. <div class='slide'>
  114. <h1>Linear Separation: Step 1</h1>
  115. <img src="media/basedata.png" />
  116. </div>
  117. <div class='slide'>
  118. <h1>Linear Separation: Step 2</h1>
  119. <img src="media/cleansep.png" />
  120. </div>
  121. <div class='slide'>
  122. <h1>Non-linearly separable data</h1>
  123. <img src="media/badset1.png" />
  124. </div>
  125. <div class='slide'>
  126. <h1>Non-linearly separable data</h1>
  127. <img src="media/badset2.png" />
  128. </div>
  129. <div class='slide'>
  130. <h1>Non-linearly separable data</h1>
  131. <img src="media/badset3.png" />
  132. </div>
  133. <div class='slide'>
  134. <h1>Light-bulb jokes</h1>
  135. </div>
  136. <div class='slide'>
  137. <h1>Technique: Support Vector Machines</h1>
  138. </div>
  139. <div class='slide'>
  140. <h1>Technique: Support Vector Machines</h1>
  141. <img height='450' src="media/dsc01228-02-h.jpg" />
  142. </div>
  143. <div class='slide'>
  144. <h1>Technique: Naive Bayesian Classifiers</h1>
  145. <p>Not geometric, but statistical.</p>
  146. </div>
  147. <div class='slide'>
  148. <h1>Bayes' Theorem</h1>
  149. <p>Future probabilities derived from prior probabilities</p>
  150. </div>
  151. <div class='slide'>
  152. <h1>Bayes' Theorem</h1>
  153. <p>If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?</p>
  154. </div>
  155. <div class='slide'>
  156. <h1>Bayes' Theorem</h1>
  157. <p>If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?</p>
  158. <p>(hint: it's not 95%)</p>
  159. </div>
  160. <div class='slide'>
  161. <h1>Bayes' Theorem</h1>
  162. <p>Answer: Depends on how many people use drugs.</p>
  163. </div>
  164. <div class='slide'>
  165. <h1>Bayes' Theorem</h1>
  166. <p>Answer: Depends on how many people use drugs.</p>
  167. <p>If the rate of drug use is 1%, then we have:</p>
  168. <center>
  169. <table border=1>
  170. <tr><th></th><th>test positive</th><th>test negative</th></tr>
  171. <tr><th>users</th><td>95% of 1%</td><td>5% of 1%</td></tr>
  172. <tr><th>non-users</th><td>5% of 99%</td><td>95% of 99%</td></tr>
  173. </table>
  174. </center>
  175. </div>
  176. <div class='slide'>
  177. <h1>Bayes' Theorem</h1>
  178. <p>Answer: Depends on how many people use drugs.</p>
  179. <p>Number of positive results: 0.95% + 4.95% == 5.9%</p>
  180. <p>Number of <i>correct</i> positive results: 0.95% / 5.9% == 16.1%</p>
  181. </div>
  182. <div class='slide'>
  183. <h1></h1>
  184. </div>
  185. <div class='slide'>
  186. <h1>The basic process</h1>
  187. <ol>
  188. <li>Look at your data, figure out a good numeric representation</li>
  189. <li>Turn your data into numbers (usually vectors of numbers)</li>
  190. <li>Run your algorithms</li>
  191. <li>Profit! (or Fun!)</li>
  192. </ol>
  193. </div>
  194. <div class='slide'>
  195. <h1>Figuring out a good representation</h1>
  196. </div>
  197. <div class='slide'>
  198. <h1></h1>
  199. </div>
  200. </body></html>