index.txt 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518
  1. # Write your own Bayesian Classifier!
  2. John Melesky
  3. (Open Source Bridge, June 2009)
  4. ---
  5. # What's a Bayesian Classifier?
  6. ---
  7. # What's a Bayesian Classifier?
  8. Something which classifies based on:
  9. 1. Information about past categorizations
  10. 2. Bayesian statistics (Bayes' Theorem)
  11. ---
  12. # What's Bayes' Theorem?
  13. Let's check [Wikipedia](http://phaedrusdeinus.org/Bayes'_theorem.html).
  14. ---
  15. # Derrr....
  16. ---
  17. # An example: random drug testing
  18. 3% of the population are using Zopadrine.
  19. We have a drug test with a 98% accuracy rate.
  20. ---
  21. # An example: random drug testing
  22. 3% of the population are using Zopadrine.
  23. We have a drug test with a 98% accuracy rate.
  24. Bob is tested, and the result is positive. How likely is it that Bob uses Zopadrine?
  25. ---
  26. # Break it down
  27. Let's assume a population of 10000 people.
  28. ---
  29. # Break it down
  30. 3% are users.
  31. <table border=1>
  32. <tr><td></td><td>Population</td></tr>
  33. <tr><td>Clean</td><td>9700</td></tr>
  34. <tr><td>Users</td><td>300</td></tr>
  35. <tr><td>Total</td><td>10000</td></tr>
  36. </table>
  37. ---
  38. # Break it down
  39. The test is 98% accurate.
  40. <table border=1>
  41. <tr><td></td><td>Population</td><td>Test negative</td><td>Test positive</td></tr>
  42. <tr><td>Clean</td><td>9700</td><td>9506</td><td>194</td></tr>
  43. <tr><td>Users</td><td>300</td><td>6</td><td>294</td></tr>
  44. <tr><td>Total</td><td>10000</td><td>9512</td><td>488</td></tr>
  45. </table>
  46. ---
  47. # Break it down
  48. Bob is tested, and the result is positive. How likely is it that Bob uses Zopadrine?
  49. <table border=1>
  50. <tr><td></td><td>Population</td><td>Test negative</td><td>Test positive</td></tr>
  51. <tr><td>Clean</td><td>9700</td><td>9506</td><td>194</td></tr>
  52. <tr><td>Users</td><td>300</td><td>6</td><td bgcolor="#ff6666">294</td></tr>
  53. <tr><td>Total</td><td>10000</td><td>9512</td><td bgcolor="#ff6666">488</td></tr>
  54. </table>
  55. ---
  56. # Break it down
  57. 294 / 488 = 60.24%
  58. ---
  59. # Back to Bayes' Theorem
  60. ![Bayes' Theorem](img/bayes.png)
  61. ---
  62. # Back to Bayes' Theorem
  63. <table>
  64. <tr><td>P = probability</td><td rowspan="4"><img src="img/bayes.png"></td></tr>
  65. <tr><td>A = "is a user"</td></tr>
  66. <tr><td>B = "tests positive"</td></tr>
  67. <tr><td>x|y = x, given y</td></tr>
  68. </table>
  69. ---
  70. # Back to Bayes' Theorem
  71. <table>
  72. <tr><td>P(A) = probability of being a user</td><td rowspan="4"><img src="img/bayes.png"></td></tr>
  73. <tr><td>P(B|A) = probability of testing positive, given being a user</td></tr>
  74. <tr><td>P(B) = probability of testing positive</td></tr>
  75. <tr><td>P(A|B) = probability Bob's a user</td></tr>
  76. </table>
  77. ---
  78. # Back to Bayes' Theorem
  79. <table>
  80. <tr><td>P(A) = 3%</td><td rowspan="4"><img src="img/bayes.png"></td></tr>
  81. <tr><td>P(B|A) = probability of testing positive, given being a user</td></tr>
  82. <tr><td>P(B) = probability of testing positive</td></tr>
  83. <tr><td>P(A|B) = probability Bob's a user</td></tr>
  84. </table>
  85. ---
  86. # Back to Bayes' Theorem
  87. <table>
  88. <tr><td>P(A) = 3%</td><td rowspan="4"><img src="img/bayes.png"></td></tr>
  89. <tr><td>P(B|A) = 98%</td></tr>
  90. <tr><td>P(B) = probability of testing positive</td></tr>
  91. <tr><td>P(A|B) = probability Bob's a user</td></tr>
  92. </table>
  93. ---
  94. # Back to the numbers
  95. <table border=1>
  96. <tr><td></td><td>Population</td><td>Test negative</td><td>Test positive</td></tr>
  97. <tr><td>Clean</td><td>9700</td><td>9506</td><td>194</td></tr>
  98. <tr><td>Users</td><td>300</td><td>6</td><td>294</td></tr>
  99. <tr><td>Total</td><td bgcolor="#ff6666">10000</td><td>9512</td><td bgcolor="#ff6666">488</td></tr>
  100. </table>
  101. ---
  102. # Back to Bayes' Theorem
  103. <table>
  104. <tr><td>P(A) = 3%</td><td rowspan="4"><img src="img/bayes.png"></td></tr>
  105. <tr><td>P(B|A) = 98%</td></tr>
  106. <tr><td>P(B) = 4.88%</td></tr>
  107. <tr><td>P(A|B) = probability Bob's a user</td></tr>
  108. </table>
  109. ---
  110. # Back to Bayes' Theorem
  111. <table>
  112. <tr><td>P(A) = 3%</td><td rowspan="4"><img src="img/bayes.png"></td></tr>
  113. <tr><td>P(B|A) = 98%</td></tr>
  114. <tr><td>P(B) = 4.88%</td></tr>
  115. <tr><td>P(A|B) = (98% * 3%)/4.88% = 60.24%</td></tr>
  116. </table>
  117. ---
  118. # This works with population numbers, too
  119. P(A) = 300
  120. P(B|A) = 9800
  121. P(B) = 488
  122. P(A|B) = 6024
  123. Which is useful for reasons we'll see later.
  124. ---
  125. # Bayes' Theorem, in code
  126. My examples are going to be in perl.
  127. sub bayes {
  128. my ($p_a, $p_b, $p_b_a) = @_;
  129. my $p_a_b = ($p_b_a * $p_a) / $p_b;
  130. return $p_a_b;
  131. }
  132. ---
  133. # Bayes' Theorem, in code
  134. But you could just as easily work in Python.
  135. def bayes(p_a, p_b, p_b_a):
  136. return (p_b_a * p_a) / p_b
  137. ---
  138. # Bayes' Theorem, in code
  139. Or Java
  140. public static Double bayes(Double p_a, Double p_b, Double p_b_a) {
  141. Double p_a_b = (p_b_a * p_a) / p_b;
  142. return p_a_b;
  143. }
  144. ---
  145. # Bayes' Theorem, in code
  146. Or SML
  147. let bayes(p_a, p_b, p_b_a) = (p_b_a * p_a) / p_b
  148. ---
  149. # Bayes' Theorem, in code
  150. Or Erlang
  151. bayes(p_a, p_b, p_b_a) ->
  152. (p_b_a * p_a) / p_b.
  153. ---
  154. # Bayes' Theorem, in code
  155. Or Haskell
  156. bayes p_a p_b p_b_a = (p_b_a * p_a) / p_b
  157. ---
  158. # Bayes' Theorem, in code
  159. Or Scheme
  160. (define (bayes p_a p_b p_b_a)
  161. (/ (* p_b_a p_a) p_b))
  162. ---
  163. # Bayes' Theorem, in code
  164. LOLCODE, anyone? Befunge? Unlambda?
  165. If it supports floating point operations, you're set.
  166. ---
  167. # How does that make a classifier?
  168. A = "is spam"
  169. B = "contains the string 'viagra'"
  170. What's P(A|B)?
  171. ---
  172. # What do we need for a classifier?
  173. 1. We need to tokenize our training set
  174. 2. Then build a model
  175. 3. Then test that model
  176. 4. Then apply that model to new data
  177. ---
  178. # What do we need for a classifier?
  179. 1. **We need to tokenize our training set**
  180. 2. Then build a model
  181. 3. Then test that model
  182. 4. Then apply that model to new data
  183. ---
  184. # Tokenizing your training set
  185. *Fancy* perl
  186. sub tokenize {
  187. my $contents = shift;
  188. my %tokens = map { $_ => 1 } split(/\s+/, $contents);
  189. return %tokens;
  190. }
  191. ---
  192. # Tokenizing your training set
  193. sub tokenize_file {
  194. my $filename = shift;
  195. my $contents = '';
  196. open(FILE, $filename);
  197. read(FILE, $contents, -s FILE);
  198. close(FILE);
  199. return tokenize($contents);
  200. }
  201. ---
  202. # Tokenizing your training set
  203. This is the "bag of words" model.
  204. For each category (spam, not spam), we need to know how many documents in the training set contain a given word.
  205. ---
  206. # Tokenizing your training set
  207. my %spam_tokens = ();
  208. my %notspam_tokens = ();
  209. foreach my $file (@spam_files) {
  210. my %tokens = tokenize_file($file);
  211. %spam_tokens = combine_hash(\%spam_tokens, \%tokens);
  212. }
  213. foreach my $file (@notspam_files) {
  214. my %tokens = tokenize_file($file);
  215. %notspam_tokens = combine_hash(\%notspam_tokens, \%tokens);
  216. }
  217. ---
  218. # Tokenizing your training set
  219. sub combine_hash {
  220. my ($hash1, $hash2) = @_;
  221. my %resulthash = %{ $hash1 };
  222. foreach my $key (keys(%{ $hash2 })) {
  223. if ($resulthash{$key}) {
  224. $resulthash{$key} += $hash2->{$key};
  225. } else {
  226. $resulthash{$key} = $hash2->{$key};
  227. }
  228. }
  229. return %resulthash;
  230. }
  231. ---
  232. # What do we need for a classifier?
  233. 1. We need to tokenize our training set
  234. 2. **Then build a model**
  235. 3. Then test that model
  236. 4. Then apply that model to new data
  237. ---
  238. # Build a model
  239. my %total_tokens = combine_hash(\%spam_tokens, \%notspam_tokens);
  240. my $total_spam_files = scalar(@spam_files);
  241. my $total_notspam_files = scalar(@notspam_files);
  242. my $total_files = $total_spam_files + $total_notspam_files;
  243. my $probability_spam = $total_spam_files / $total_files;
  244. my $probability_notspam = $total_notspam_files / $total_files;
  245. ---
  246. # Build a model
  247. In this case, our model is just a bunch of numbers.
  248. ---
  249. # Build a model
  250. In this case, our model is just a bunch of numbers.
  251. (a little secret: it's *all* a bunch of numbers)
  252. ---
  253. # What do we need for a classifier?
  254. 1. We need to tokenize our training set
  255. 2. Then build a model
  256. 3. **Then test that model**
  257. 4. Then apply that model to new data
  258. ---
  259. # \*cough\* \*cough\*
  260. ---
  261. # What do we need for a classifier?
  262. 1. We need to tokenize our training set
  263. 2. Then build a model
  264. 3. Then test that model
  265. 4. **Then apply that model to new data**
  266. ---
  267. # Apply that model to new data
  268. my %test_tokens = tokenize_file($test_file);
  269. foreach my $token (keys(%test_tokens)) {
  270. if (exists($total_tokens{$token})) {
  271. my $p_t_s = (($spam_tokens{$token} || 0) + 1) /
  272. ($total_spam_files + $total_tokens);
  273. $spam_accumulator = $spam_accumulator * $p_t_s;
  274. my $p_t_ns = (($notspam_tokens{$token} || 0) + 1) /
  275. ($total_notspam_files + $total_tokens);
  276. $notspam_accumulator = $notspam_accumulator * $p_t_ns;
  277. }
  278. }
  279. ---
  280. # Apply that model to new data
  281. my $score_spam = bayes( $probability_spam,
  282. $total_tokens,
  283. $spam_accumulator );
  284. my $score_notspam = bayes( $probability_notspam,
  285. $total_tokens,
  286. $notspam_accumulator );
  287. my $likelihood_spam = $score_spam / ($score_spam + $score_notspam);
  288. my $likelihood_notspam = $score_notspam / ($score_spam + $score_notspam);
  289. printf("likelihood of spam email: %0.2f %%\n", ($likelihood_spam * 100));
  290. ---
  291. # Boom
  292. ---
  293. # What sucks?
  294. ---
  295. # What sucks?
  296. - Our tokenization
  297. ---
  298. # What sucks?
  299. - Our tokenization
  300. - Our memory limitations
  301. ---
  302. # What sucks?
  303. - Our tokenization
  304. - Our memory limitations
  305. - Saving/loading models
  306. ---
  307. # Improve memory use
  308. ### Limit the number of tokens
  309. We want to use the tokens with the highest information values. That means tokens that are predominantly in one category but not the other.
  310. ---
  311. # Improve memory use
  312. ### Limit the number of tokens
  313. We want to use the tokens with the highest information values. That means tokens that are predominantly in one category but not the other.
  314. There are a bunch of ways to calculate this, though the big one is Information Gain.
  315. ---
  316. # Improve tokenization, simple stuff
  317. - Weed out punctuation
  318. - Weed out stopwords
  319. - normaLize CASE
  320. - Strip out markup
  321. ---
  322. # Improve tokenization, advanced stuff
  323. ### Stemming
  324. "wrestling", "wrestler", "wrestled", and "wrestle" are all the same word concept.
  325. Pros: fewer tokens, related tokens match
  326. Cons: some words are hard to stem correctly (e.g. "cactus")
  327. ---
  328. # Improve tokenization, advanced stuff
  329. ### Include bigrams
  330. Bigrams are token pairs. For example, "open source", "ron paul", "twitter addict".
  331. Pros: we start distinguishing between Star Wars and astronomy wars
  332. Cons: our memory use balloons
  333. ---
  334. # Improve tokenization, advanced stuff
  335. ### Use numbers
  336. Instead of binary (word x is in doc y), we store frequencies (word x appears z times in doc y).
  337. Pros: damage from weak associations is reduced; easier to find the important words in a document
  338. Cons: the math becomes more complex; in many cases, accuracy doesn't actually increase
  339. ---
  340. # Improve tokenization, advanced stuff
  341. ### Use non-token features
  342. Sometimes we want to use non-textual attributes of documents. For example, length of document, percent of capital letters.
  343. ---
  344. # Improve tokenization, advanced stuff
  345. ### Use non-token features
  346. Sometimes we want to use non-textual attributes of documents. For example, length of document, percent of capital letters.
  347. We can also grab structural information, like the sender, or subject line, and treat them differently. Or whether the word appears early or late in the document.
  348. ---
  349. # Improve tokenization, advanced stuff
  350. ### Use non-token features
  351. Sometimes we want to use non-textual attributes of documents. For example, length of document, percent of capital letters.
  352. We can also grab structural information, like the sender, or subject line, and treat them differently. Or whether the word appears early or late in the document.
  353. Pros: a little can go a long way
  354. Cons: selecting these can be a dark art. or an incredible memory burden.
  355. ---
  356. # Which leads us to
  357. ---
  358. # Which leads us to
  359. Tokenization == Vectorization
  360. ---
  361. # In other words
  362. Our documents are all just vectors of numbers.
  363. ---
  364. # Or even
  365. Our documents are all just points in a high-dimensional Cartesian space.
  366. ---
  367. # Vectors of numbers
  368. This concept opens up a whole world of statistical methods for categorization, including decision trees, linear separations, and support vector machines.
  369. ---
  370. # Points in space
  371. And this opens up a whole different world of geometric methods for categorization and information manipulation, including k-nearest-neighbor classification and various clustering algorithms.
  372. ---
  373. # Alright
  374. It's been a long trip. Any questions?
  375. ---
  376. # Thanks
  377. Thanks for coming. Thanks to OS Bridge for having me.