Theory

in the beginning
There was a Gmail filter. And it was good.


Or it was decent anyway. What you're looking at above is Food-Bot 1.0. In Fall 2009, out of sheer desperation for free food, I created a Gmail account and subscribed to several thousand mailing lists on campus. Using a giant filter, I deleted everything not containing a food-related word. The inbox was then left with emails that were of interest to me (potentially containing free food). This method actually ended up being pretty effective, as it led me to finding lots of free food. But being a student in computer science, I couldn't help but ask myself, "Can we do better?"
the free food problem
And so I set out to solve the ultimate problem, a problem that, if solved effectively, could revolutionize the lives of thousands of college students across the country. As I began formalizing and exploring the problem, I realized it is far less simple than it might first appear, and not unlike a famous challenging problem in computer science. This page explains various aspects of the free food problem as well as various strategies for solving it.


problem statement
Given an arbitrary document, d, determine whether d contains information about a free food event, and if so, return an array of correctly-associated information about each event (date/time, location, and food type).

That's not so bad, right? After all, it's pretty easy for us humans to identify free food events when we see them in emails. Can it really be that hard to write a program that solves this problem?

Let's break apart the problem to see what all needs to be solved in order to accomplish this goal.



classification theory

An important aspect of this problem is the correct classification of an arbitrary document as either free food or non-free food. The obvious solution to this problem would be to hard-code various properties that you deem telling of the correct classification (for example, we know "pizza" and "free" are both words that will generally have a high occurrence in free food documents, and words like "donation" and "sell" will have a higher occurrence in non-free food documents). But can we do better?

My solution uses basic ideas from AI, especially the idea of Maximum Likelihood Estimation. The program classifies documents using a method which is almost exactly like the method many spam filters use, called Naive Bayes Classification. Basically, we treat the document as an array of words, assume independence of words, and determine the probability of the document belonging to a particular category (in our case, free food or non-free food) based on learned data. Because of my Gmail account, I have a huge database of emails to use for the training set. I can gather the information I need from this training set to be able to classify a new document of unknown classification. In our model, we're going to use two different parameters to help us classify the document: the words themselves, and also the source of the event text (email address or URL).

Let:
= Free Food category
= Non-Free Food category
= number of documents used in training
= number of "free food words" used in training
= number of "free food sources" used in training
= number of words in document j
= source of document (email or URL)
= set of all documents given by source s

d = an arbitrary document of text

1() is the indicator function (Returns 1 if the condition specified is met, 0 otherwise).


Using Bayes' rule:


Because we’re using Maximum Likelihood Estimation, we’re only interested in relative probabilities for classification purposes, and so the denominator can be ignored. So, we have:



Unrolling conditional probability:



Because the document is simply an array of n words, d = [w0, w1, w2, ... , wn], and we'll assume that the occurrence of each word in the document is an independent event, we can multiply the probability of each of these words occurring together:



To find probability of word i occurring given that the document belongs to the sets FF and s, we need to go to our training set and count the number of occurrences of word i in all of the documents classified as FF and obtained from source s over the number of occurrences of all trained words in all documents classified as FF and obtained from source s.



And so:



To find the probability of a document coming from source s given that the document is classified as FF, we count the number of documents observed from source s over the number of occurrences of all sources for all documents classified as FF.



And so:



The probability of a document being classified as free food is just the number of observed documents that were classified as free food over total number of observed documents:



And so:



And:



When the program receives an arbitrary document of text, it uses the databases of word probabilities to apply the above equations, and (with a few slight modifications) compares the resulting likelihood of the document belonging to FF with the likelihood of the document belonging to NFF, and classifies the document as belonging to the category which had the higher likelihood.

If a word being considered does not occur in any of the learned word databases, a penalty of (1/total number of words in that database)*factor is used as the probability of that word occurring given that category. In order to avoid learning words that are probably insignificant and might cause undesirable results due to the limited size of the training set, I made a further approximation by only counting these probabilities on a variety of words I have selected from having seen a number of documents, and words that appear to be at least remotely significant.




event-parsing theory

Ok, so let's say we now we know that the document contains information about free food. All that's left is to take the document and strip out all of the necessary data. Easy right? It would be. If English wasn't so unpredictable. The problem is made extremely complex by the simple fact that we can make very little assumptions about the way people will choose to communicate information about a free food event. For example, people can say 5 o'clock, 5:00, 5pm, 5, or even five. And they all mean that there's free food going down at 5:00 PM. What's a poor food-bot to do??

While I don't intend to disclose all of my event-parsing techniques, I will make known a few of the interesting techniques I used. There are essentially two steps in associating information that is relevant to particular event after the classification phase is over. The first step is finding all potentially relevant information in the document. The second step is to associate some of the potentially relevant information that seems to be describing the same free food event.

The first step is basically accomplished by feeding (hehe) the word array of the document into several subroutines that find all dates, times, locations, classification words (words that might be useful in determining if a particular event is one with free food or not), and food types within a document. Each subroutine then uses arrays of regular expressions to determine where in the document any of these items occur. In date/time and location logic, I have a large series of logic blocks covering various cases of how a date, time, or location might be expressed. The blocks of logic can be quite complicated in some cases.

For the food and classification word-parsing logic, I use arrays of arrays of regular expressions. The reason for this is that in order to maximize chances of correctly locating a food or classification word in the document, it is necessary to be very precise. This necessitates a mechanism to match a series of consecutive words, each matching a particular regular expression. For example, for the food 'ice cream' it isn't enough to go through each word in the document and look for 'ice', but it is necessary to look for 'ice' followed by 'cream', and so for each food term there must be an array of regular expressions expressing what to match for on each consecutive word in order for it to be considered a food term.

OK, so we have found all of the locations, dates, times, and classification words in the document. How do we know how to pair these pieces of information up in order to create an accurate event description? Because of the interesting complication of the English language permitting few assumptions about the way knowledge is expressed, I mostly rely on euclidean distances of words within a document to make assumptions about which information belongs with which event. By this I mean words that are "close" to the event get paired with that event, regardless of whether they occur before or after the event.

But you may be wondering how does one even begin finding the events in the first place? This question is also non-trivial, and has caused me to change my algorithm design multiple times. I initially created events around each food word that was found in the document, and then paired the other information to these events, but before long it became obvious that this was a bad idea, since many free food documents have many food words that are all describing the same event (so using this method you are left with the incorrect conclusion that there are many free food events taking place, when in fact it was just one event with many kinds of food).

I remember when I thought of the solution I currently use: it was while I was in Seattle, literally right after a day of interviews with Microsoft, that I was struck with inspiration (maybe because of all those coding interview questions): why not start with dates? The reason for this is that in most cases people only mention the date of the event once during the event description. When people mention dates more than once I can use various techniques to determine if they are the same event, and if so merge them. After coding this new idea, I found that this was actually a far more effective way to start finding the event in the document!

The next step is to merge all of the other information to these events. I start by pairing words to events that are closer in the same paragraph, and then pair the remaining unpaired words to events that are closer in the entire document. By looking at Euclidean distance within a paragraph first, documents that have lots of event descriptions that are very close to other event descriptions will be paired much more effectively (because in these cases people typically delimit event descriptions with paragraph breaks).



After the information is paired, the array of potential free food events is passed to the classifier routine which will go through each event and decide if it's free food using the calculations described above. Each event that is classified as free food is inserted into the database, and displayed on the homepage for you to see and go to!

The final thing I will briefly discuss is something I'm sure you're all wondering about. How do I compute something like "awkwardness?" I mean, how could you possibly take an arbitrary document and accurately compute the exact value of how awkward it will be for someone to just show up to it? My response is that I used a highly advanced and strictly proprietary technique to compute the precise awkwardness value for each event rounded to the nearest micro-awkwardian. And I definitely did not just hash words that probably indicate an event will be awkward to a corresponding awkwardness value and take the average of all the awkwardness terms found associated with that event. I just disclosed my strictly proprietary algorithm didn't I? *face palm*

I hope this has given you a better understanding and appreciation of some of what happens behind the scenes with food-bot. Enjoy the website, and I hope this effort will prove useful in finding you lots of free food in the near future!