(LPW '07) (john melesky)
This is an hour talk, squeezed into 30 minutes, so let's get going.
(LPW '07) (john melesky)
This is an hour talk, squeezed into 30 minutes, so let's get going.
(oh, i'm available for freelance work in machine learning/natural language processing)
Q: How many Londoners does it take to change a lightbulb?
Q: How many Londoners does it take to change a lightbulb?
A: 127
If this is true, that means that 8.415 million Londoners (99%!) are changing lightbulbs right now.
If this is true, that means that 8.415 million Londoners (99%!) are changing lightbulbs right now.
By contrast, only 1% of the rest of the world are currently changing lightbulbs.
If you're changing a lightbulb right now, what's the likelihood you're a Londoner?
If you're changing a lightbulb right now, what's the likelihood you're a Londoner?
(hint: the answer is not 99%)
If you look it up on Wikipedia, you'll see something like this.
Translated, roughly:
sub bayes { my ($p_a, $p_b, $p_b_a) = @_; my $p_a_b = ($p_b_a * $p_a) / $p_b; return $p_a_b; }
sub tokenize { my $contents = shift; my %tokens = map { $_ => 1 } split(/\s+/, $contents); return %tokens; }
my %work_tokens = (); my %notwork_tokens = (); foreach my $file (@work_files) { my %tokens = tokenize_file("training_set/" . $file); %work_tokens = combine_hash(\%work_tokens, \%tokens); } foreach my $file (@notwork_files) { my %tokens = tokenize_file("training_set/" . $file); %notwork_tokens = combine_hash(\%notwork_tokens, \%tokens); } my %total_tokens = combine_hash(\%work_tokens, \%notwork_tokens);
sub combine_hash { my ($hash1, $hash2) = @_; my %resulthash = %{ $hash1 }; foreach my $key (keys(%{ $hash2 })) { if ($resulthash{$key}) { $resulthash{$key} += $hash2->{$key}; } else { $resulthash{$key} = $hash2->{$key}; } } return %resulthash; }
sub tokenize_file { my $filename = shift; my $contents = ''; open(FILE, $filename); read(FILE, $contents, -s FILE); close(FILE); return tokenize($contents); }
my $total_work_files = scalar(@work_files); my $total_notwork_files = scalar(@notwork_files); my $total_files = $total_work_files + $total_notwork_files; my $probability_work = $total_work_files / $total_files; my $probability_notwork = $total_notwork_files / $total_files;
Wait a minute ...
Wait a minute ...
What is P(B|A), when you have more than one B?
Wait a minute ...
What is P(B|A), when you have more than one B?
For that matter, what is P(B), when you have more than one B?
P(B1|A) P(B2|A) ... P(Bn|A)
Let's, um, ignore that for now.
Let's, um, ignore that for now.
Trust me, it will work out.
my %total_tokens = combine_hash(\%work_tokens, \%notwork_tokens); my $work_accumulator = 1; my $notwork_accumulator = 1; my $total_tokens = scalar(keys(%test_tokens)); foreach my $token (keys(%test_tokens)) { if (exists($total_tokens{$token})) { my $p_t_w = (($work_tokens{$token} || 0) + 1) / ($total_work_files + $total_tokens); $work_accumulator = $work_accumulator * $p_t_w; my $p_t_nw = (($notwork_tokens{$token} || 0) + 1) / ($total_notwork_files + $total_tokens); $notwork_accumulator = $notwork_accumulator * $p_t_nw; } }
my $score_work = bayes( $probability_work, $total_tokens, $work_accumulator); my $score_notwork = bayes( $probability_notwork, $total_tokens, $notwork_accumulator); my $likelihood_work = $score_work / ($score_work + $score_notwork); my $likelihood_notwork = $score_notwork / ($score_work + $score_notwork); printf("likelihood of work email: %0.2f %%\n", ($likelihood_work * 100)); printf("likelihood of notwork email: %0.2f %%\n", ($likelihood_notwork * 100));