Author Archives: admin

WNYC Radio: “Are Hackathons Worth It?”

I was recently contacted by Jeff Coltin, a journalist at WNYC Radio, who asked me to participate in a show about hackathons in NYC.

He featured a snippet from our conversation, specifically about problems that the hacker community could solve. I said (vaguely accurate transcription):

“…There are so many problems that hackathons could fix. I think some big issues at the moment in the media, things like the NSA spying scandals and stuff like that. I think one thing the tech community has slightly failed to do is to make encryption really easy. There’s a sort-of inverse relationship between simplicity and security, so the more secure an app, often the more inconvenient it is to use. So we have things like TOR, extra-long passwords (TOR slows down your connection a lot), VPNs and a lot of very secure services are incompatible with mainstream services. So this level of security and privacy that users want or need is just so inconvenient to achieve its really up to the hacker community to make them much easier to use…”

There have been efforts such as Cryptocat but its adoption rate still needs to grow. HTTPS would probably be the best example of seamless encryption but this often fails when people either ignore or are at loss as to what to do when HTTPS certificates are flagged as invalid by the browser.

Cryptography is an incredibly tough field of Computer Science, so creating reliably secure apps is hard. Educating oneself about this can require a fairly super-human effort and I have a lot of respect for people who contribute modules in this field to PyPI. I’m hoping to start the Crypto course on Coursera once I have some more free time, but beating the security-simplicity inverse relationship I mentioned is certainly easier said than done.

Teaching Python at Harvard with Software Carpentry

BScO7AGIMAAvVZj.jpg-medium

Mike teaching Hamlet in Python. Photo copyright Chris Erdmann: https://twitter.com/libcce/status/371281901191196672

I’m part of an organization called Software Carpentry in NYC. This uses volunteers to teach programming at varying levels to universities, large governmental organizations and other interested groups of people. I previously taught at Columbia and this past weekend it was held at Harvard, organized by Chris Erdmann, the head librarian at the Harvard-Smithsonian Center for Astrophysics.

Before Software Carpentry, my teaching experience was limited to explaining aspects of programming to friends and family, as well as part of a year spent teaching English and French to children and adults in Japan. Teaching is hard. It’s very easy to be critical of a teacher – I’ve often found myself being so without thinking about the effort and stress behind conveying a complex concept to a group of students all with varying backgrounds and motivations. I’ve come up with a few conclusions about how to optimize teaching style from my last 2 SWC events:

Saturday’s Teacher line-up

Things that worked well

  • Humor. Mike sprinkled his tutorial with funny anecdotes which kept the class very lively.
  • Relevant and interesting subject matter. Hamlet was a good choice, as was the theme of cheating at scrabble due to the librarian-oriented audience. The dictionary brought up several amusing entries for searches like:  grep ".*s.*s.*s.*s.*s.*s" words | less
  • Adding anecdotes to save people googling things. I reckon that a large amount of any programmer’s activities are in simply finding someone who’s done what you want to do before, and slightly modifying things – or connecting up the building blocks. So at the end of talking about the benefits of things like append()  vs concatenating with plus signs like first+second , I mentioned things like deque()  and  format() .

Things to remember for next time

  • Typing SLOWLY. I work a lot with MongoDB, so end up typing from pymongo import Connection; c = Connection()  20+ times a day into the terminal. This can become so fast, things like that can seem bewildering to newcomers.
  • Using a high contrast terminal with large font and dimmed lights, to make it super easy to see from the back of the room.

What can advanced programmers get out of teaching such basic things?

  • You’ll learn a lot from the instructors and student’s questions
  • Community involvement is a great asset on your resume and shows potential employers that you have the ability/drive to train future co-workers
  • It helps to have on-hand analogies and anecdotes developed during teaching when explaining technical matters to non-technical people, socially or business-wise.
  • You’ll meet many like minded people and it feels great to get involved in the community.

What did I learn?

  • The requests library. I normally use urllib2 to grab html from web pages. Urllib2, it turns out, is simply a more extensible library for HTTP requests as shown in this stackoverflow explanation.
  • More about Git. I use SVN at work and thus don’t really submit anything to github. Git is HARD. Erik was an excellent instructor and calmly went from the basics right through to the minutiae of things like .gitignore and diff.
  • What “immutable” really means. I hear this thrown around quite a lot and it basically just means things can’t be assigned to an object. E.g. the . split()  of myString.split()  can’t become a variable. Very simple.

Review of Data Science for Business (O’Reilly, 2013)

Book cover

I’m currently participating in the O’Reilly Blogger Review Program – where bloggers are given ebooks of recent publications. 

Data Science for Business fits an interesting gap in the market – managers who want to be able to understand what Data Science is, how to recruit Data Scientists or how to manage a data-oriented team. It says it is also for aspiring Data Scientists, but I would probably recommend Andrew Ng’s Machine Learning course and Codecademy’s intro Python course instead if you’re serious about getting your teeth into the field.

Somewhere between an introduction and an encyclopedia, it gives fairly comprehensive overviews of each sub-field, including distinctions that I hadn’t previously thought of so clearly. The authors are mostly unafraid to explain the maths behind the subjects. It dips into some probability and linear algebra – admittedly with simplified notation. There’s no real mention of implementation (i.e. programming the examples) as one would usually expect with O’Reilly; but most competent readers will now at least know what they’re “looking for” perhaps in terms of packages to install or if they want to try and implement a system from scratch. It is certainly designed for the intelligent, professional and far from popular science.

Whilst it is very thorough and interesting it could touch a nerve among Data Scientists, since should a manager of a Data Scientist really have to read a book such as this – surely in such a position of authority they should know of these techniques already? (an extreme example would be one footnote which even contains a description of what Facebook is, and what it is used for). Often, such unbalanced hierarchies are the cause of much unnecessary stress and complication in the workplace. However, this is often the case so perhaps this will be useful in that context.

I think, overall, I was hoping for a slightly different book – with more in-depth case studies of how to implement existing Data Science knowledge into Business scenarios. Nevertheless, it’s an interesting, intelligent guide in an encyclopedic sense and fairly unique in its clarity of explanation and accessibility – I highly doubt I could write a better guide in that respect. Existing Data Scientists will find many clear analogies to explain their craft to those less technical than themselves and I reckon that by itself justifies taking a look 🙂

When literal_eval fails: using asterisk notation to read in a datetime object

One great function in python is the ast  (Abstract Syntax Tree) library’s literal_eval . This lets you read in a string version of a python datatype:

Importing a dictionary such as this is similar to parsing JSON using Python’s  json.loads decoder. But it also comes with the shortcoming’s of JSON’s restrictive datatypes, as we can see here when the dictionary contains, for example, a datetime object:

So you might try and write some code to parse the dictionary data-type yourself. This gets very tricky, but eventually you could probably accommodate for all common data-types:

But this still doesn’t truly fix our datetime object problem:

Which is where we get to the crux of this post. I thought at first that I could deal with datetime’s formatting by extracting the class  datetime.datetime(2013, 8, 10, 21, 46, 52, 638649) as a tuple by spotting the brackets, then feeding the tuple back into datetime like: 

But apparently not. The tuple must be extracted – not by a lambda or perhaps list comprehension, but in fact by using asterisk notation:

Asterisk ( * ) unpacks an iterable such as x into positional arguments for the function. Simple!

Using ACORA to process hundreds of stopwords at once

“80% of data analysis is spent cleaning data, 20% of it is spent complaining about cleaning data” – Chang She and BigDataBorat

This is one of the best quotes I heard at PyData 2013. When dealing with huge amounts of data, often only a fraction of it is usually relevant to one’s analysis and it can be a total pain trying to clean it. But this is also an essential stage, so let’s make it as painless as possible.

One example is with gigantic log files. Say we’re dealing with a multi-terabyte apache log files as follows:

This is useful data with thousands of lines, and we’d like to analyze it using the big file processing script I mentioned before. However, there are certain lines that you’re not concerned about – so you can write a simple conditional:

What if you have 2 things that you don’t want in each line?

What if you have 3 things that you don’t want in each line?

But this is getting super inefficient and a bit silly. Each extra keyword requires yet another pass through the line. With this code basically everything is a worst case scenario.

Bring on ACORA!

ACORA is Stefan Behnel’s library based on the Aho-Corasick string matching algorithm. Without diving too deep into the maths behind it, it basically compiles all the stopwords you have into a single über-stopword, meaning one scan of this stopword over your log-file line will check for all stopwords. For example:

But how do we integrate this into the line scanner from before? Just like this!

We’ve replaced the entire multiple stopword matching for-loop with a single ACORA matcher.

A note on performance

ACORA is fantastic, but performance may dip if there are only a few stopwords, or only a few lines. It has best performance when you have about 20+ stopwords and at least 1000 or so log file lines to scan through.

Extracting TLDs (top-level domains) – and weird quirks that will upset you

I’ve been using John Kurkowski‘s excellent Python domain extraction library “tldextract” recently. TLDextract can extract the domain name from a URL very easily, for example:

Why is this useful?

This has many applications – for example, if you want to create a summary of the top domains linking to your site, you might have a very large list of referring URLs:

And you could write some simple code to output the domain:

And use the word frequency calculator from my previous post to compile a list of the top referring domains! See that I’ve modified line 10 to instead add the domain as the key:

Which returns:

Why can’t you just split by fullstops at the third slash and take what’s before?

This is what I tried to do at the start:

But since the domain name system is a miasma of top level (e.g. .com), second level (e.g. .gov.uk), standard sub domains (e.g. i.imgur.com) and people with too many fullstops (e.g. www.dnr.state.oh.us) this becomes much more tricky and it becomes impossible to accommodate for everything. So TLDextract actually maintains a local copy of Mozilla’s list of ICANN domains on your system, downloaded from: 

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

And basically finds matches on the ends of URLs from that. Very nice!

So what’s the problem mentioned in the title?

Unfortunately, the caveat of using Mozilla’s list is that you get some seemingly odd behavior. There are a bunch of sites and companies who have requested that their subdomains are TLDs, and are included in the list, from Amazon:

To DynDNS stuff:

And more… So you’ll trip up if you put in something like:

Rather than the expected “.com” as the tld.

Succinct way to build a frequency table of a Python iterable

This is an interesting and often tackled task in programming, and especially prevalent in NLP and Data Science. Often one has a list of “things” with many repeats and wants to know which ones are the most popular.

Data Examples

or:

Which is the most popular number? Which is the most popular letter?

The Code

For our data, we now get:

So 5 is the most popular item, with 9 appearances.

So the space is the most popular, followed by a, e and t.

Really simple multi-processing in Python – and then linking to MongoDB

Multiprocessing in Python is really easy! You can spawn processes by using the Pool()  class.

Here, we spawn 4 processes, and use the map() function to send a number to each of them.

This is a trivial example, but gets much more powerful when each process does something like making a remote connection:

Markov Transition Matrix Equilibrium made simple in Python

As a refresher on Markov Models, I’ve been watching Professor Scott E. Page’s excellent videos on YouTube. He does all the computation by hand – and now I’ve written some code to perform it faster.

In video 2, we are given a transition matrix of 1 hypothetical student in a classroom transitioning between alertness and boredom:

Alert: t Bored: t
Alert: t+1  0.8  0.25
Bored: t+1  0.2  0.75

This can be represented in Python as:

The vector of students is again, a simple list:

Let’s calculate one stage:

Calling this produces:

Or, let’s try to find the equilibrium state by looping  markov_stage until the values basically stop changing (to a certain decimal place accuracy):

This produces, with an accuracy set at 2 decimal places:

I probably could have written this in Numpy – which would calculate faster using less memory (and probably has built-in functions for the vector-matrix row multiplication), but it was fun just doing this. I’ll try and extend the markov_equilibrium  to give some more detailed stats such as the “churn” as mentioned by Prof. Page.

Custom DefaultDicts with Lambdas in Python for creating detailed frequency tables, or anything else

In Python, there is the dictionary datatype. This is basically a look-up table:

Let’s try and create a frequency table for words though:

This bad code will eventually return a dictionary with entries like:

thing: how_many_times_it_occurred

However, we have to do this try-except statement in case the key doesn’t exist. Defaultdicts get rid of this stage by returning a blank entry (0, empty string, empty list) instead, which is really awesome!

If the key didn’t already exist in our look-up table, then the defaultdict returns an <int> to write the new value! This defaultdict(int)  could be replaced with defaultdict(list)  or any data type.

And now to the crux of the post! We can replace this variable type with a lambda instead, like this:

Now, when the key doesn’t exist, the dictionary will create a new dictionary within! So we can bring another metric into our analysis:

Now our function will return a dictionary that not only lets you know how many times something occurred, but also when it last occurred! Try it out with the following data: