“80% of data analysis is spent cleaning data, 20% of it is spent complaining about cleaning data” – Chang She and BigDataBorat
This is one of the best quotes I heard at PyData 2013. When dealing with huge amounts of data, often only a fraction of it is usually relevant to one’s analysis and it can be a total pain trying to clean it. But this is also an essential stage, so let’s make it as painless as possible.
One example is with gigantic log files. Say we’re dealing with a multi-terabyte apache log files as follows:
1 2 3 4 5 6 7 8 9 10 |
127.0.0.1 - - [17/Jul/2013:07:22:38 -0500] "GET /static/js/bootstrap-dropdown.js HTTP/1.1" 304 - "http://127.0.0.1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36" 127.0.0.1 - - [17/Jul/2013:07:22:38 -0500] "GET /static/js/bootstrap-scrollspy.js HTTP/1.1" 304 - "http://127.0.0.1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36" 127.0.0.1 - - [17/Jul/2013:07:22:38 -0500] "GET /static/js/bootstrap-tab.js HTTP/1.1" 304 - "http://127.0.0.1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36" 127.0.0.1 - - [17/Jul/2013:07:22:38 -0500] "GET /static/js/bootstrap-tooltip.js HTTP/1.1" 304 - "http://127.0.0.1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36" 127.0.0.1 - - [17/Jul/2013:07:22:38 -0500] "GET /static/js/bootstrap-popover.js HTTP/1.1" 304 - "http://127.0.0.1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36" 127.0.0.1 - - [17/Jul/2013:07:22:38 -0500] "GET /static/js/bootstrap-button.js HTTP/1.1" 304 - "http://127.0.0.1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36" 127.0.0.1 - - [17/Jul/2013:07:22:38 -0500] "GET /static/js/bootstrap-collapse.js HTTP/1.1" 304 - "http://127.0.0.1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36" 127.0.0.1 - - [17/Jul/2013:07:22:38 -0500] "GET /static/js/bootstrap-carousel.js HTTP/1.1" 304 - "http://127.0.0.1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36" 127.0.0.1 - - [17/Jul/2013:07:22:38 -0500] "GET /static/js/bootstrap-typeahead.js HTTP/1.1" 304 - "http://127.0.0.1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36" 127.0.0.1 - - [17/Jul/2013:07:22:39 -0500] "GET /favicon.ico HTTP/1.1" 404 4025 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36" |
This is useful data with thousands of lines, and we’d like to analyze it using the big file processing script I mentioned before. However, there are certain lines that you’re not concerned about – so you can write a simple conditional:
1 2 3 4 |
with open('mylogfile.txt', 'r') as f: for line in f: if "some_keyword" not in line: do_something() |
What if you have 2 things that you don’t want in each line?
1 2 3 4 |
with open('mylogfile.txt', 'r') as f: for line in f: if ("some_keyword" not in line) and ("other_keyword" not in line): do_something() |
What if you have 3 things that you don’t want in each line?
1 2 3 4 5 6 7 8 9 |
with open('mylogfile.txt', 'r') as f: for line in f: line_is_ok = True for stopword in ["some_keyword", "other_keyword", "nope"]: if stopword in line: line_is_ok = False break #exit the loop if line_is_ok: #only process line if line_is_ok hasn't been tampered with do_something() |
But this is getting super inefficient and a bit silly. Each extra keyword requires yet another pass through the line. With this code basically everything is a worst case scenario.
Bring on ACORA!
ACORA is Stefan Behnel’s library based on the Aho-Corasick string matching algorithm. Without diving too deep into the maths behind it, it basically compiles all the stopwords you have into a single über-stopword, meaning one scan of this stopword over your log-file line will check for all stopwords. For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from acora import AcoraBuilder #create a list of stopwords - these are the entries #that we'd like to ignore stopwords = ['.js', '.ico', '.css', '.jpg', ...] #enter them into a new Acora Builder stopword_matcher = AcoraBuilder() for stopword in stopwords: stopword_matcher.add(stopword) #now let it construct the uber-stopword stopword_matcher = stopword_matcher.build() #this uses generator which iterates through matches #it will stop whenever it finds something for match in stopword_matcher.finditer(line_from_log_file): print "Match found: {0}".format(match) |
But how do we integrate this into the line scanner from before? Just like this!
1 2 3 4 5 6 7 8 |
with open('mylogfile.txt', 'r') as f: for line in f: line_is_ok = True for match in stopword_matcher.finditer(line): line_is_ok = False break #exit the loop if line_is_ok: do_something() |
We’ve replaced the entire multiple stopword matching for-loop with a single ACORA matcher.
A note on performance
ACORA is fantastic, but performance may dip if there are only a few stopwords, or only a few lines. It has best performance when you have about 20+ stopwords and at least 1000 or so log file lines to scan through.