Category Archives: Python

Avoiding multiple reads with top-level imports

Recently I’ve been working with various applications that require importing large JSON definition files which detail complex application settings. Often, these files are required by multiple auxiliary modules in the codebase. All principles of software engineering point towards importing this sort of file only once, regardless of how many secondary modules it is used in.

My instinctive approach to this would be to have a main handler module read in the file and then pass its contents as a class initialization argument:

The problem with this is that if you have an elaborate import process, and multiple files to import, it could start to look messy. I recently discovered that this multiple initialization argument approach isn’t actually necessary.

In Python, you can actually import the same settings loader module in the two auxiliary modules (module1 and module2), and python will only load it once:

Now when we test this out in the terminal:

Despite calling import settings_loader  twice, Python actually only called it once. This is extremely useful but also could cause headaches if you actually wanted to import the file twice. If so, then I would include the settings importer inside the __init__()  of each ClassX and instantiate it twice.

Mocking out an API call deep in your code

With any actively developed (python) coding project, you’re and your team are going to be running the same set of tests sometimes hundreds of times per week. If there’s an HTTP request to any 3rd-party source in there, this can cause problems. API calls can be expense, excessive scraping of the same source can cause IP blacklisting and the calls could just slow down your whole test process, adding extra baggage to the code deployment process.

To fix this, we can use Python’s mock library. Mock is really useful for creating fake function calls, fake Classes and other fake objects which can return fake values. In most cases when testing, you are really just testing how the application parses data rather than the reliability of the 3rd party service. The API’s response is generally the same. Mock can let you simulate the API’s response and parse its data rather than actually have to make the call each time.

It’s quite tricky to set up so I thought I would write a tutorial. The situation set up has a few components but I’ll try and explain it as well as possible. Let’s say there is a service that provides some useful API response. There’s a site, HTTPBin, set up by Kenneth Reitz to test HTTP libraries, which we will use here. Check out: https://httpbin.org/ip. The content is as follows:

Let’s say our program wants to grab the IP address in the origin field. Yup – a fairly pointless program but this will be analogous to many situations you’ll encounter.

Here’s a totally over-engineered class to get data from this service. When the class is initialized in __init__, it creates a base_url variable pointing to HTTPBin. The main handler function is the get_ip function, which simply grabs that field’s content. This first makes a call to api_call which uses requests.get to grab that HTTP data.

To run this code its simply (in a Python Shell):

What if we want to mock out requests.get? The Mock module documentation is quite unclear on how to target a specific function deep within a class. It turns out the easiest way to do this is not MagicMock or return_value but instead to use the counter-intuitively named “side_effect” feature. This is the testing module pre-mocking:

As you can see, this is a standard set of tests to check that the ip_grabber function returns a valid-ish IP address. It is run as follows:

However, the problem here is that it is going to call the actual API each time you run the tests. To stop this, let’s integrate the mock module:

Here we’ve:

  1. Imported the mock module. Note: if you get an error about “wraps” in the “six” module then it is almost certainly because you have more than one installation of six or mock and one needs to be deleted.
  2. Create a fake function fake_get to replace requests.get with. This actually returns just “123” for now so you can see how it makes the test fail below.
  3. Added the mock.patch wrapper around the test_ip_grabber function. Very important here is specifying the function name as it is imported in my_module NOT as it appears in the Python standard library; i.e. we are doing “my_module.get” rather than “requests.get”. The side_effect= then says to replace that with whatever function we want.
  4. The fake function specified by side effect must now be added as an argument to the function.

Running this, we get:

Mock’s side_effect has replaced requests.get 🙂 To make this pass, just replace  return {'origin': '123'} with return {'origin': '123.123.123.123'}  and run again:

Tests pass and zero HTTP traffic! 🙂

Converting an HTML table to an Excel download HTTP Response: A hack for slow OLAP DB connections

Overview

Zenko (“Good fox” in Japanese) is a reporting system (see code on Github here) I’ve created over the last couple of weeks at Mozilla. Basically my non-technical coworkers were getting so frustrated by Tableau (“what the heck is the difference between INNER JOIN and OUTER JOIN?”) that I decided to create a simple dashboard interface for them.

Its a simple bootstrap front-end to a database containing campaign stats for sponsored tiles. You can drill down to each tile or client/partner and pivot by things like locale, country and date.

Zenko’s stack (high to low)

A new feature

When loading one of the analyses pages, a table will be shown. My coworker wanted to be able to download the data to Excel. I came up with 4 possible ways to implement this:

  1. Simply rerun the query, format the results as a csv on the backend, save it and window.open() the file location.
  2. Automatically save the data from each analysis request server request and periodically clear old files.
  3. Use a javascript library like ExcelBuilder
  4. Send the data back to the server, format it, and then back to the client via an iframe

Which is the best solution?

  1. This is problematic because our sticking point is the query speed. The redshift database is an OLAP Column Oriented database, and append-only. This means that it is insanely fast to add data to, but quite slow (often 6+ seconds) to query. Yes, it is dealing with billions of rows so excusable, but its not so great in terms of user experience to wait so long.The user doesn’t want to wait another 6 seconds for the analysis to rerun when they have the data already.
  2. This sounds like it could just end up storing a lot of data on the client, but it could work quite well. In terms of security though, I’m not sure that the data should be lingering on the user’s PC unrequested though.
  3. This didn’t work out so well – in Firefox, the file is incorrectly named. In the future, I’d like to name the files according to the parameters of the analysis e.g. <client>-<date>-<country>.xls
  4. This is the weirdest solution, but it works! Flask is running locally so it is actually very fast. There are no huge JQuery/JavaScript complications with file permissions and the fact that you can manipulate the data easily on the server is nice too.

Solution 4

The process is as follows when the “Download for Excel” button is clicked:

  1. Reference the HTML table using JavaScript and convert it to an array of arrays
  2. Append an iframe to the DOM
  3. Append a form with a POST action and hidden field to the iframe
  4. Insert the table contents into the hidden field’s value
  5. Submit the form
  6. Let Flask receive the POST request and format the information as a CSV
  7. Return an HTTP response with a file attachment containing the CSV

Let’s implement it

There were various ways to do this in JQuery with iterable.each()  but I ran into complications and simply referencing cells using .children was much easier.

The (locally running) Flask will then recieve a POST request at /download_excel . Let’s set up the route:

Now, when the user clicks the button:

Download Link for Excel

They instantly get:

Download popup

Sorry, I can’t show what it looks like in Excel because the data isn’t public at the moment. All code is however available here on github!

One bizarre thing, however, is that the form doesn’t appear in the inspector (in either Chrome or Firefox):

Invisible in the inspector

Though, you can access it with some fairly lengthy getters:

why.jpg

Future features 

  • The files could be named something more intuitive than data.csv  – perhaps a combination of various things seen in the URL’s query string
  • Accommodate for a table wider than 6 rows. This could be done easily by stringifying the array using a different delimiter such as a “###”.
  • Create an .xls file rather than a CSV, if there is any advantage

MongoDB aspirin

Basically this is a list in progress of common errors/tasks/gripes I get when using Mongo. I’ve noted down what usually works. Maybe you’ll find it useful 🙂

Why has my remote mongodb connection been refused?

    • Delete the mongod.lock  file from your main mongodb storage folder and restart [SO]
    • If on redhat, check  sestatus
    • Modify /etc/sysconfig/iptables   to have the correct firewall rules according to the mongodb docs.

How do can you iterate through all mongoDB collections in pymongo? 

I normally access collections as object attributes like:   conn.Database.Collection.find_one() but actually databases and connections can be accessed as keys in a dictionary as well:

Why is mongod terminating whenever I close the shell? Even when using &  at the end

When starting mongod, use mongod --fork (note: fork must be right after the word mongod) and it will start as a background process instead. Or just add fork = true  to your config.

I just created an authenticated database and can’t even use show dbs !

Create a new user with all 4 of the following permissions: userAdminAnyDatabase, readWriteAnyDatabase, dbAdminAnyDatabase, clusterAdmin:

A quick solution to OrderedDict’s limitations in Python with O(1) index lookups

Background to the Problem

I work regularly with gigantic machine learning datasets. One very versatile format, for use in WEKA is the “ARFF” (Attribute Relation File Format). This essentially creates a nicely structured, rich CSV file which can easily be used in Logistic Regression, Decision Trees, SVMs etc. In order to solve the problem of very sparse CSV data, there is a sparse ARFF format that lets users convert sparse lines in each file such as:

f0 f1 f2 f3 fn
1 0 1 0 0

Into a more succint version where you have a list of features and simply specify the feature’s index and value (if any):

@ATTRIBUTE f0 NUMERIC
@ATTRIBUTE f1 NUMERIC
@ATTRIBUTE f2 NUMERIC
@ATTRIBUTE f3 NUMERIC

@ATTRIBUTE fn NUMERIC
@DATA
{0 1, 2 1}

i.e. {feature-index-zero is 1, feature-index-two is 1}, simply omitting all the zero-values.

The Implementation Problem

This is easy enough if you have, say 4 features, but what if you have over 1 million features and need to find the index of each one? Searching for a feature in a list is O(n), and if your training data is huge too, then creating the sparse ARFF is going to be hugely inefficient:

I thought I could improve this by using an OrderedDict. This is, very simply, a dictionary that maintains the order of its items – so you can pop() items from the end in a stack-like manner. However, after some research on StackOverflow, this disappointingly this doesn’t contain any efficient way to calculate the index of key:

The solution

What can we do about this? Enter my favorite thing ever, defaultdicts with lambdas:

Assigning items values in addition to the index is fairly straightforward with a slightly modified lambda:

 Limitations

This is a fun fix, but doesn’t support full dictionary functionality – deleting items won’t reorder the index and you can’t iterate in order through this easily. However, since in creating this ARFF file, there’s no need for deletions or iteration that’s not a problem.

Teaching Python at Harvard with Software Carpentry

BScO7AGIMAAvVZj.jpg-medium

Mike teaching Hamlet in Python. Photo copyright Chris Erdmann: https://twitter.com/libcce/status/371281901191196672

I’m part of an organization called Software Carpentry in NYC. This uses volunteers to teach programming at varying levels to universities, large governmental organizations and other interested groups of people. I previously taught at Columbia and this past weekend it was held at Harvard, organized by Chris Erdmann, the head librarian at the Harvard-Smithsonian Center for Astrophysics.

Before Software Carpentry, my teaching experience was limited to explaining aspects of programming to friends and family, as well as part of a year spent teaching English and French to children and adults in Japan. Teaching is hard. It’s very easy to be critical of a teacher – I’ve often found myself being so without thinking about the effort and stress behind conveying a complex concept to a group of students all with varying backgrounds and motivations. I’ve come up with a few conclusions about how to optimize teaching style from my last 2 SWC events:

Saturday’s Teacher line-up

Things that worked well

  • Humor. Mike sprinkled his tutorial with funny anecdotes which kept the class very lively.
  • Relevant and interesting subject matter. Hamlet was a good choice, as was the theme of cheating at scrabble due to the librarian-oriented audience. The dictionary brought up several amusing entries for searches like:  grep ".*s.*s.*s.*s.*s.*s" words | less
  • Adding anecdotes to save people googling things. I reckon that a large amount of any programmer’s activities are in simply finding someone who’s done what you want to do before, and slightly modifying things – or connecting up the building blocks. So at the end of talking about the benefits of things like append()  vs concatenating with plus signs like first+second , I mentioned things like deque()  and  format() .

Things to remember for next time

  • Typing SLOWLY. I work a lot with MongoDB, so end up typing from pymongo import Connection; c = Connection()  20+ times a day into the terminal. This can become so fast, things like that can seem bewildering to newcomers.
  • Using a high contrast terminal with large font and dimmed lights, to make it super easy to see from the back of the room.

What can advanced programmers get out of teaching such basic things?

  • You’ll learn a lot from the instructors and student’s questions
  • Community involvement is a great asset on your resume and shows potential employers that you have the ability/drive to train future co-workers
  • It helps to have on-hand analogies and anecdotes developed during teaching when explaining technical matters to non-technical people, socially or business-wise.
  • You’ll meet many like minded people and it feels great to get involved in the community.

What did I learn?

  • The requests library. I normally use urllib2 to grab html from web pages. Urllib2, it turns out, is simply a more extensible library for HTTP requests as shown in this stackoverflow explanation.
  • More about Git. I use SVN at work and thus don’t really submit anything to github. Git is HARD. Erik was an excellent instructor and calmly went from the basics right through to the minutiae of things like .gitignore and diff.
  • What “immutable” really means. I hear this thrown around quite a lot and it basically just means things can’t be assigned to an object. E.g. the . split()  of myString.split()  can’t become a variable. Very simple.

When literal_eval fails: using asterisk notation to read in a datetime object

One great function in python is the ast  (Abstract Syntax Tree) library’s literal_eval . This lets you read in a string version of a python datatype:

Importing a dictionary such as this is similar to parsing JSON using Python’s  json.loads decoder. But it also comes with the shortcoming’s of JSON’s restrictive datatypes, as we can see here when the dictionary contains, for example, a datetime object:

So you might try and write some code to parse the dictionary data-type yourself. This gets very tricky, but eventually you could probably accommodate for all common data-types:

But this still doesn’t truly fix our datetime object problem:

Which is where we get to the crux of this post. I thought at first that I could deal with datetime’s formatting by extracting the class  datetime.datetime(2013, 8, 10, 21, 46, 52, 638649) as a tuple by spotting the brackets, then feeding the tuple back into datetime like: 

But apparently not. The tuple must be extracted – not by a lambda or perhaps list comprehension, but in fact by using asterisk notation:

Asterisk ( * ) unpacks an iterable such as x into positional arguments for the function. Simple!

Using ACORA to process hundreds of stopwords at once

“80% of data analysis is spent cleaning data, 20% of it is spent complaining about cleaning data” – Chang She and BigDataBorat

This is one of the best quotes I heard at PyData 2013. When dealing with huge amounts of data, often only a fraction of it is usually relevant to one’s analysis and it can be a total pain trying to clean it. But this is also an essential stage, so let’s make it as painless as possible.

One example is with gigantic log files. Say we’re dealing with a multi-terabyte apache log files as follows:

This is useful data with thousands of lines, and we’d like to analyze it using the big file processing script I mentioned before. However, there are certain lines that you’re not concerned about – so you can write a simple conditional:

What if you have 2 things that you don’t want in each line?

What if you have 3 things that you don’t want in each line?

But this is getting super inefficient and a bit silly. Each extra keyword requires yet another pass through the line. With this code basically everything is a worst case scenario.

Bring on ACORA!

ACORA is Stefan Behnel’s library based on the Aho-Corasick string matching algorithm. Without diving too deep into the maths behind it, it basically compiles all the stopwords you have into a single über-stopword, meaning one scan of this stopword over your log-file line will check for all stopwords. For example:

But how do we integrate this into the line scanner from before? Just like this!

We’ve replaced the entire multiple stopword matching for-loop with a single ACORA matcher.

A note on performance

ACORA is fantastic, but performance may dip if there are only a few stopwords, or only a few lines. It has best performance when you have about 20+ stopwords and at least 1000 or so log file lines to scan through.

Extracting TLDs (top-level domains) – and weird quirks that will upset you

I’ve been using John Kurkowski‘s excellent Python domain extraction library “tldextract” recently. TLDextract can extract the domain name from a URL very easily, for example:

Why is this useful?

This has many applications – for example, if you want to create a summary of the top domains linking to your site, you might have a very large list of referring URLs:

And you could write some simple code to output the domain:

And use the word frequency calculator from my previous post to compile a list of the top referring domains! See that I’ve modified line 10 to instead add the domain as the key:

Which returns:

Why can’t you just split by fullstops at the third slash and take what’s before?

This is what I tried to do at the start:

But since the domain name system is a miasma of top level (e.g. .com), second level (e.g. .gov.uk), standard sub domains (e.g. i.imgur.com) and people with too many fullstops (e.g. www.dnr.state.oh.us) this becomes much more tricky and it becomes impossible to accommodate for everything. So TLDextract actually maintains a local copy of Mozilla’s list of ICANN domains on your system, downloaded from: 

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

And basically finds matches on the ends of URLs from that. Very nice!

So what’s the problem mentioned in the title?

Unfortunately, the caveat of using Mozilla’s list is that you get some seemingly odd behavior. There are a bunch of sites and companies who have requested that their subdomains are TLDs, and are included in the list, from Amazon:

To DynDNS stuff:

And more… So you’ll trip up if you put in something like:

Rather than the expected “.com” as the tld.

Succinct way to build a frequency table of a Python iterable

This is an interesting and often tackled task in programming, and especially prevalent in NLP and Data Science. Often one has a list of “things” with many repeats and wants to know which ones are the most popular.

Data Examples

or:

Which is the most popular number? Which is the most popular letter?

The Code

For our data, we now get:

So 5 is the most popular item, with 9 appearances.

So the space is the most popular, followed by a, e and t.