Category Archives: string-matching

Using ACORA to process hundreds of stopwords at once

“80% of data analysis is spent cleaning data, 20% of it is spent complaining about cleaning data” – Chang She and BigDataBorat

This is one of the best quotes I heard at PyData 2013. When dealing with huge amounts of data, often only a fraction of it is usually relevant to one’s analysis and it can be a total pain trying to clean it. But this is also an essential stage, so let’s make it as painless as possible.

One example is with gigantic log files. Say we’re dealing with a multi-terabyte apache log files as follows:

This is useful data with thousands of lines, and we’d like to analyze it using the big file processing script I mentioned before. However, there are certain lines that you’re not concerned about – so you can write a simple conditional:

What if you have 2 things that you don’t want in each line?

What if you have 3 things that you don’t want in each line?

But this is getting super inefficient and a bit silly. Each extra keyword requires yet another pass through the line. With this code basically everything is a worst case scenario.

Bring on ACORA!

ACORA is Stefan Behnel’s library based on the Aho-Corasick string matching algorithm. Without diving too deep into the maths behind it, it basically compiles all the stopwords you have into a single über-stopword, meaning one scan of this stopword over your log-file line will check for all stopwords. For example:

But how do we integrate this into the line scanner from before? Just like this!

We’ve replaced the entire multiple stopword matching for-loop with a single ACORA matcher.

A note on performance

ACORA is fantastic, but performance may dip if there are only a few stopwords, or only a few lines. It has best performance when you have about 20+ stopwords and at least 1000 or so log file lines to scan through.