There was a Gmail filter. And it was good.
Or it was decent anyway. What you're looking at above is Food-Bot 1.0. In Fall 2009, out of sheer desperation for free food, I created a Gmail account and subscribed to several thousand mailing lists on campus. Using a giant filter, I deleted everything not containing a food-related word. The inbox was then left with emails that were of interest to me (potentially containing free food). This method actually ended up being pretty effective, as it led me to finding lots of free food. But being a student in computer science, I couldn't help but ask myself, "Can we do better?"
And so I set out to solve the ultimate problem, a problem that,
if solved effectively, could revolutionize the lives of thousands of college students across the country. As I began formalizing and exploring the problem, I realized
it is far less simple than it might first appear, and not unlike a famous challenging problem in computer science. This page explains various aspects of the free food problem as well as various strategies for solving it.
problem statement
Given an arbitrary document, d, determine whether d contains information about a free
food event, and if so, return an array of correctly-associated information about each event (date/time, location, and food type).
That's not so bad, right? After all, it's pretty easy for us humans to identify free food events when we see them in emails.
Can it really be that hard to write a program that solves this problem?
Let's break apart the problem to see what all needs to be solved in order to accomplish this goal.
An important aspect of this problem is the correct classification of an arbitrary document as either free food or non-free food. The obvious solution
to this problem would be to hard-code various properties that you deem telling of the correct classification (for example, we know "pizza" and "free"
are both words that will generally have a high occurrence in free food documents, and words like "donation" and "sell" will have a higher occurrence in
non-free food documents). But can we do better?
My solution uses basic ideas from AI, especially the idea of Maximum Likelihood Estimation. The program classifies documents using a method which is almost exactly
like the method many spam filters use, called
Naive Bayes Classification.
Basically, we treat the document as an array of words, assume independence of words, and determine the probability of the document belonging to a particular category (in our case, free food or non-free food) based on learned data. Because of my Gmail account, I have a huge database of emails to use for the training set. I can gather the information I need from this training set to be able to classify a new document of unknown classification. In our model, we're going to use two different parameters to help us classify the document: the words themselves, and also the source of the event text (email address or URL).
Let:

= Free Food category

= Non-Free Food category

= number of documents used in training

= number of "free food words" used in training

= number of "free food sources" used in training

= number of words in document j

= source of document (email or URL)

= set of all documents given by source s
d = an arbitrary document of text
1() is the indicator function (Returns 1 if the condition specified is met, 0 otherwise).
Using Bayes' rule:
Because we’re using Maximum Likelihood Estimation, we’re only interested in
relative probabilities for classification purposes, and so the denominator can be ignored. So, we have:
Unrolling conditional probability:
Because the document is simply an array of
n words,
d = [
w0,
w1,
w2, ... ,
wn], and we'll assume that the occurrence of each word in the document is an independent event, we can multiply the probability of each of these words occurring together:
To find probability of word
i occurring given that the document belongs to the sets
FF and
s, we need to go to our training set and count the number of occurrences of word
i in all of the documents classified as
FF and obtained from source s over the number of occurrences of all trained words in all documents classified as
FF and obtained from source
s.
And so:
To find the probability of a document coming from source
s given that the document is classified as
FF, we count the number of documents observed from source
s over the number of occurrences of all sources for all documents classified as
FF.
And so:
The probability of a document being classified as free food is just the number of observed documents that were classified as free food over total number of observed documents:
And so:
And:
When the program receives an arbitrary document of text, it uses the databases of word probabilities to apply the above equations, and (with a few slight modifications) compares the resulting likelihood of the document belonging to
FF with the likelihood of the document belonging to
NFF, and classifies the document as belonging to the category which had the higher likelihood.
If a word being considered does not occur in any of the learned word databases, a penalty of (1/total number of words in that database)*factor is used as the probability of that word occurring given that category. In order to avoid learning words that are probably insignificant and might cause undesirable results due to the limited size of the training set, I made a further approximation by only counting these probabilities on a variety of words I have selected from having seen a number of documents, and words that appear to be at least remotely significant.
Ok, so let's say we now we know that the document contains information about free food. All that's left is to take the document and strip out all of the necessary data.
Easy right? It would be. If English wasn't so unpredictable. The problem is made extremely complex
by the simple fact that we can make very little assumptions about the way people will choose to communicate information about a free food event. For example, people
can say 5 o'clock, 5:00, 5pm, 5, or even five. And they all mean that there's free food going down at 5:00 PM. What's a poor food-bot to do??
While I don't intend to disclose all of my event-parsing techniques, I will make known a few of the interesting techniques I used. There are essentially two steps in associating
information that is relevant to particular event after the classification phase is over. The first step is finding all potentially relevant information in the document.
The second step is to associate some of the potentially relevant information that seems to be describing the same free food event.
The first step is basically accomplished by feeding (hehe) the word array of the document into several subroutines that find all dates, times, locations, classification words (words
that might be useful in determining if a particular event is one with free food or not), and
food types within a document. Each subroutine then uses arrays of regular expressions to determine where in the document any of these items occur. In date/time and location
logic, I have a large series of logic blocks covering various cases of how a date, time, or location might be expressed. The blocks of logic can be quite complicated in some
cases.
For the food and classification word-parsing logic, I use arrays of arrays of regular expressions. The reason for this is that in order to maximize chances of correctly
locating a food or classification word in the document, it is necessary to be very precise. This necessitates a mechanism to match a series of consecutive words, each matching a particular
regular expression. For example, for the food 'ice cream' it isn't enough to go through each word in the document and look for 'ice', but
it is necessary to look for 'ice' followed by 'cream', and so for each food term there must be an array of regular expressions expressing what to match for on each consecutive
word in order for it to be considered a food term.
OK, so we have found all of the locations, dates, times, and classification words in the document. How do we know how to pair these pieces of information up in order to create
an accurate event description? Because of the interesting complication of
the English language permitting few assumptions about the way knowledge is expressed, I mostly rely on euclidean distances of words within a document to make assumptions
about which information belongs with which event. By this I mean words that are "close" to the event get paired with that event, regardless of whether they occur before or
after the event.
But you may be wondering how does one even
begin finding the events in the first place? This question is also non-trivial, and has caused me to change my algorithm design multiple times.
I initially created events around each food word that was found in the document, and then paired the other information to these events, but before long it became obvious that this
was a bad idea, since many free food documents have many food words that are all describing the same event (so using this method you are left with the incorrect conclusion that
there are many free food events taking place, when in fact it was just one event with many kinds of food).
I remember when I thought of the solution I currently use: it was while I was in Seattle, literally
right after a day of interviews with Microsoft, that I was struck with inspiration (maybe because of all those
coding interview questions): why not start with dates? The
reason for this is that in most cases people only mention the date of the event
once during the event description. When people mention dates more than once I can
use various techniques to determine if they are the same event, and if so merge them. After coding this new idea, I found that this was actually a far more effective way to start
finding the event in the document!
The next step is to merge all of the other information to these events. I start by pairing words to events that are closer in the same
paragraph, and then pair the
remaining unpaired words to events that are closer in the entire document. By looking at Euclidean distance within a paragraph first, documents that have lots of event descriptions
that are very close to other event descriptions will be paired much more effectively (because in these cases people typically delimit event descriptions with paragraph breaks).
After the information is paired, the array of potential free food events is passed to the classifier routine which will go through each event and decide if it's free food using
the calculations described above. Each event that is classified as free food is inserted into the database, and displayed on the homepage for you to see and go to!
The final thing I will briefly discuss is something I'm sure you're all wondering about. How do I compute something like "awkwardness?" I mean, how could you
possibly take an arbitrary document and accurately compute the
exact value of how awkward it will be for someone to just show up to it? My response is that I used a highly advanced and strictly proprietary technique to compute the precise awkwardness value for each event rounded to the nearest micro-awkwardian. And I
definitely did
not just hash words that probably indicate an event will be awkward to a corresponding awkwardness value and take the average of all the awkwardness terms found associated with that event. I just disclosed my strictly proprietary algorithm didn't I? *face palm*
I hope this has given you a better understanding and appreciation of some of what happens behind the scenes with food-bot. Enjoy the website, and I hope this effort will prove useful in finding you lots of free food in the near future!