Monthly Archives: July 2013

Using ACORA to process hundreds of stopwords at once

“80% of data analysis is spent cleaning data, 20% of it is spent complaining about cleaning data” – Chang She and BigDataBorat

This is one of the best quotes I heard at PyData 2013. When dealing with huge amounts of data, often only a fraction of it is usually relevant to one’s analysis and it can be a total pain trying to clean it. But this is also an essential stage, so let’s make it as painless as possible.

One example is with gigantic log files. Say we’re dealing with a multi-terabyte apache log files as follows:

This is useful data with thousands of lines, and we’d like to analyze it using the big file processing script I mentioned before. However, there are certain lines that you’re not concerned about – so you can write a simple conditional:

What if you have 2 things that you don’t want in each line?

What if you have 3 things that you don’t want in each line?

But this is getting super inefficient and a bit silly. Each extra keyword requires yet another pass through the line. With this code basically everything is a worst case scenario.

Bring on ACORA!

ACORA is Stefan Behnel’s library based on the Aho-Corasick string matching algorithm. Without diving too deep into the maths behind it, it basically compiles all the stopwords you have into a single über-stopword, meaning one scan of this stopword over your log-file line will check for all stopwords. For example:

But how do we integrate this into the line scanner from before? Just like this!

We’ve replaced the entire multiple stopword matching for-loop with a single ACORA matcher.

A note on performance

ACORA is fantastic, but performance may dip if there are only a few stopwords, or only a few lines. It has best performance when you have about 20+ stopwords and at least 1000 or so log file lines to scan through.

Extracting TLDs (top-level domains) – and weird quirks that will upset you

I’ve been using John Kurkowski‘s excellent Python domain extraction library “tldextract” recently. TLDextract can extract the domain name from a URL very easily, for example:

Why is this useful?

This has many applications – for example, if you want to create a summary of the top domains linking to your site, you might have a very large list of referring URLs:

And you could write some simple code to output the domain:

And use the word frequency calculator from my previous post to compile a list of the top referring domains! See that I’ve modified line 10 to instead add the domain as the key:

Which returns:

Why can’t you just split by fullstops at the third slash and take what’s before?

This is what I tried to do at the start:

But since the domain name system is a miasma of top level (e.g. .com), second level (e.g. .gov.uk), standard sub domains (e.g. i.imgur.com) and people with too many fullstops (e.g. www.dnr.state.oh.us) this becomes much more tricky and it becomes impossible to accommodate for everything. So TLDextract actually maintains a local copy of Mozilla’s list of ICANN domains on your system, downloaded from: 

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

And basically finds matches on the ends of URLs from that. Very nice!

So what’s the problem mentioned in the title?

Unfortunately, the caveat of using Mozilla’s list is that you get some seemingly odd behavior. There are a bunch of sites and companies who have requested that their subdomains are TLDs, and are included in the list, from Amazon:

To DynDNS stuff:

And more… So you’ll trip up if you put in something like:

Rather than the expected “.com” as the tld.

Succinct way to build a frequency table of a Python iterable

This is an interesting and often tackled task in programming, and especially prevalent in NLP and Data Science. Often one has a list of “things” with many repeats and wants to know which ones are the most popular.

Data Examples

or:

Which is the most popular number? Which is the most popular letter?

The Code

For our data, we now get:

So 5 is the most popular item, with 9 appearances.

So the space is the most popular, followed by a, e and t.