Category Archives: top-level domains

Extracting TLDs (top-level domains) – and weird quirks that will upset you

I’ve been using John Kurkowski‘s excellent Python domain extraction library “tldextract” recently. TLDextract can extract the domain name from a URL very easily, for example:

Why is this useful?

This has many applications – for example, if you want to create a summary of the top domains linking to your site, you might have a very large list of referring URLs:

And you could write some simple code to output the domain:

And use the word frequency calculator from my previous post to compile a list of the top referring domains! See that I’ve modified line 10 to instead add the domain as the key:

Which returns:

Why can’t you just split by fullstops at the third slash and take what’s before?

This is what I tried to do at the start:

But since the domain name system is a miasma of top level (e.g. .com), second level (e.g. .gov.uk), standard sub domains (e.g. i.imgur.com) and people with too many fullstops (e.g. www.dnr.state.oh.us) this becomes much more tricky and it becomes impossible to accommodate for everything. So TLDextract actually maintains a local copy of Mozilla’s list of ICANN domains on your system, downloaded from: 

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

And basically finds matches on the ends of URLs from that. Very nice!

So what’s the problem mentioned in the title?

Unfortunately, the caveat of using Mozilla’s list is that you get some seemingly odd behavior. There are a bunch of sites and companies who have requested that their subdomains are TLDs, and are included in the list, from Amazon:

To DynDNS stuff:

And more… So you’ll trip up if you put in something like:

Rather than the expected “.com” as the tld.