As a data scientist here at Loggly, my most recent work has been on optimization and performance. Our complex data pipeline has a lot of moving parts and an infrastructure that needs to be managed so that we can make the most out of our resources. While I have spent most of my time on performance-related tasks, I also have a few side projects that I work on with various teams at Loggly, but there is one activity I particularly like, which is participating in hackathons.
One of the advantages of working in San Francisco is the access we have to a wide variety of meetups. When it comes to building our skill set, meetups are like going to the gym, whereas a hackathon is like doing a half-marathon. That’s where you actually get to learn about yourself, get to know your limits, and perform at a deeper level within a time constraint – all in the context of a team.
Deciding on a Hackathon
There is a hackathon in San Francisco almost every week. Earlier this year, I participated in DeveloperWeek’s hackathon, pushing myself during the weekend to solve one of the hackathon challenges available at the event. The one that looked most interesting was “Building a Fraud Detector”.
The task was to build not only the machine learning model but also a web app to determine whether a web request was a bot or not, with the ability to take 1,000 requests per second and reply within 100 milliseconds.
Bot Detectors Have Multiple Use Cases
When you think of bot detection, advertising use cases may come to mind. For example, advertisers want to know that real people, not bots, are watching their ads. Likewise, publishers and especially video service providers don’t want to waste infrastructure costs to serve bots. Bots are a nuisance for a wide variety of cloud-centric businesses.
Here at Loggly, we also want to prevent people from abusing our API service by making queries that are really like bots, perhaps because someone left on an automated job created for testing and forgot to stop it. These kind of queries can consume infrastructure resources without benefiting anyone.
My One-Person Hackathon Grows into a Team Effort
So I went to Developer Week, which was located about 6 blocks from the Loggly office in downtown San Francisco. After I signed up for the Fraud Detector challenge, I decided to work out of our office because it was much more comfortable and had a lot more food!
Here I was in the office trying to solve my challenge, thinking, as a data scientist, “How on earth could I build a high performance web app-like web service, which was not my specialty?” Then my colleague Vincent happened to stop by. When he heard what I was trying to do, he decided it would be a fun way to spend his weekend. So we became a team of two. I was glad to have his expertise on building web applications using a REST API, and applying concepts with which I wasn’t as familiar.
Why Hack in Python?
We could have used any language of choice for the hackathon, but Python had quite a few advantages:
- It can be used across a stack, from the web application to the underlying machine learning model. We were able to build the entire fraud detector in Python.
- It is a REPL language, which means Read, Evaluate, Print, Loop. It works like an interactive session. That’s one of the greatest beauties of Python, especially for analytics. It enables you to load your data, look at your data, and interact with it.
Our Process for Creating a Fraud Detector
So how did we hack together a fraud detector in a weekend?
First, we created a set of functions to parse the fields from a log file provided by the sponsors, which classified events as being fraudulent or not. The fields are standard in many log files: IP address, user agent, version of the operating system, referrer, etc.
The art here consists of transforming fields into useful predictors, or “features”, and this can be done by exploring and visualizing the parsed data in order to find signals in it. For example, we observed that the length of the referrer field was correlated to the probability of the event being from a bot or not. We then incorporated those features in a model which is starts as a matrix with each feature as one of its columns and each log event as a row.
Here is an example of a log file:
220.127.116.11 Bot+1.0 http%3A%2F%2Fwww.recipecorner.com%2Frecipe%2Farticle%2Fdrinks%2Falcohol%2Fbon-appetit-checks-mustard-lover-seth-meyers false
To make this parsed data useful in the type of model that we created, technically known as a Support Vector Machine, we had to categorize all of the strings and give them numerical values. So we ended up with a matrix of numerical values, which we then normalized so that they had a mean of zero and similar magnitude. Python has a very neat library for machine learning which has been improving by leaps and bounds: scikit-learn.
After several iterations on the test data set, I felt I had a model that stood a good chance against real data. The next challenge was how to implement this model into a fast performing web app with a REST endpoint so that we could meet the criteria of the challenge. We “pickled” the model into binary objects and used the same library that I used to create the model to parse the incoming log events as they came in. The only difference was that training data also requires parsing a status field, whereas with real data, the status is exactly what we are trying to predict.
When building the REST endpoint, I first thought about Flask, which is a common Python web framework. However, it didn’t perform at the required scale. My Loggly colleague Ivan Tam had recommended Tornado in the past, but I hadn’t had a chance to use it. At this point, I had Sunday morning to learn about Tornado. Fortunately, it’s easy to use, but the time pressure was mounting. Vincent helped me configure it properly.
Tornado worked out well, but it works best if it runs on separate processes because Python is not multi-threaded. This approach meant that we would have several instances of Python running and would need something between Python and the end user to round-robin each request to one of the Python processes. So we added Nginx to our architecture. In our application, Nginx was just routing, acting as load-balancing server for requests, calling the 16 Python processes that were running with Tornado in a round-robin fashion. Each process was:
- Loading the model and keeping it in memory
- Classifying each of the requests
- Responding with a header indicating whether or not it was of a bot
Code samples can be found at my github account in https://github.com/MauricioRoman/FraudDetector
A Nod to Tableau
I found it a lot easier to create my fraud detection model by first visualizing the data. Tableau is my favorite visualization tool. In this case, I exported the data out into a comma separated value (CSV) file and looked for the “features”, which is a technical term for the fields, native to the data or derived from it, that could contain signals for the model. In this particular case, I found that the following pieces of data were significant in detecting bots:
- The URL, and whether that URL is found in Alexa’s top million websites
- The length of the parameters in the URL
- The IP address, broken into four components that were independent variables
Logging Made the Debugging Process Much Easier
One of the nice things about my job is that I work closely with developers, so it’s inevitable that some of their best practices have rubbed off on me. My work with another colleague, Vinh Nguyen, has really helped me understand not only how to structure Python production code but also how to use logging to aid in debugging. When I first started coding in Python, I could write code fairly quickly. However, when I looked at my code two weeks later, it was hard for me to remember exactly what it did. With the best practices I have learned from my colleagues, including logging, I find that my code is easier to maintain.
The Hackathon Results
There were eight teams participating in the hackathon. At some point late in the weekend we were winning, but another team beat us out at the end, with a model based on a naïve Bayes predictor. It was a really exciting experience, just like a half-marathon with a really close finish.
I encourage all of you to participate in some hackathons if you get the opportunity. Push yourselves to the limit and push Python to the limit. And don’t forget to use logging to make life much easier for yourself!
How to Find Hackathons?
If you are interested in participating in a Hackathon in San Francisco, check out this calendar. If you are coming from out of town, let us know in advance and we will help you!