Avoiding multiple reads with top-level imports

Recently I’ve been working with various applications that require importing large JSON definition files which detail complex application settings. Often, these files are required by multiple auxiliary modules in the codebase. All principles of software engineering point towards importing this sort of file only once, regardless of how many secondary modules it is used in.

My instinctive approach to this would be to have a main handler module read in the file and then pass its contents as a class initialization argument:

The problem with this is that if you have an elaborate import process, and multiple files to import, it could start to look messy. I recently discovered that this multiple initialization argument approach isn’t actually necessary.

In Python, you can actually import the same settings loader module in the two auxiliary modules (module1 and module2), and python will only load it once:

Now when we test this out in the terminal:

Despite calling import settings_loader  twice, Python actually only called it once. This is extremely useful but also could cause headaches if you actually wanted to import the file twice. If so, then I would include the settings importer inside the __init__()  of each ClassX and instantiate it twice.

Mocking out an API call deep in your code

With any actively developed (python) coding project, you’re and your team are going to be running the same set of tests sometimes hundreds of times per week. If there’s an HTTP request to any 3rd-party source in there, this can cause problems. API calls can be expense, excessive scraping of the same source can cause IP blacklisting and the calls could just slow down your whole test process, adding extra baggage to the code deployment process.

To fix this, we can use Python’s mock library. Mock is really useful for creating fake function calls, fake Classes and other fake objects which can return fake values. In most cases when testing, you are really just testing how the application parses data rather than the reliability of the 3rd party service. The API’s response is generally the same. Mock can let you simulate the API’s response and parse its data rather than actually have to make the call each time.

It’s quite tricky to set up so I thought I would write a tutorial. The situation set up has a few components but I’ll try and explain it as well as possible. Let’s say there is a service that provides some useful API response. There’s a site, HTTPBin, set up by Kenneth Reitz to test HTTP libraries, which we will use here. Check out: https://httpbin.org/ip. The content is as follows:

Let’s say our program wants to grab the IP address in the origin field. Yup – a fairly pointless program but this will be analogous to many situations you’ll encounter.

Here’s a totally over-engineered class to get data from this service. When the class is initialized in __init__, it creates a base_url variable pointing to HTTPBin. The main handler function is the get_ip function, which simply grabs that field’s content. This first makes a call to api_call which uses requests.get to grab that HTTP data.

To run this code its simply (in a Python Shell):

What if we want to mock out requests.get? The Mock module documentation is quite unclear on how to target a specific function deep within a class. It turns out the easiest way to do this is not MagicMock or return_value but instead to use the counter-intuitively named “side_effect” feature. This is the testing module pre-mocking:

As you can see, this is a standard set of tests to check that the ip_grabber function returns a valid-ish IP address. It is run as follows:

However, the problem here is that it is going to call the actual API each time you run the tests. To stop this, let’s integrate the mock module:

Here we’ve:

  1. Imported the mock module. Note: if you get an error about “wraps” in the “six” module then it is almost certainly because you have more than one installation of six or mock and one needs to be deleted.
  2. Create a fake function fake_get to replace requests.get with. This actually returns just “123” for now so you can see how it makes the test fail below.
  3. Added the mock.patch wrapper around the test_ip_grabber function. Very important here is specifying the function name as it is imported in my_module NOT as it appears in the Python standard library; i.e. we are doing “my_module.get” rather than “requests.get”. The side_effect= then says to replace that with whatever function we want.
  4. The fake function specified by side effect must now be added as an argument to the function.

Running this, we get:

Mock’s side_effect has replaced requests.get 🙂 To make this pass, just replace  return {'origin': '123'} with return {'origin': ''}  and run again:

Tests pass and zero HTTP traffic! 🙂

Fixing a Django/Vagrant error in ifup/ifup-eth

I normally test out a django project using a local vagrant instance. Vagrant creates a virtual machine running the django project which instantly recognizes changes to code. However, I was running a test recently and the VM suddenly started outputting error messages as follows. Note: I’ve redacted some parts of these terminal outputs.

(Scroll to the bottom of this post to skip to the solution)

No translation files? I’ve never heard of them. Thinking it could be a random error, I tried to run the test again and got:

I tried logging out of the instance and got an I/O error:

Attempting to ssh into the VM again is refused:

Sometimes the “turning it off and on again” solution can work, so let’s try vagrant reload:

A timeout error this round. The error seems related to authentication, but it’s not the whole story. The VM is running according to global-status, despite not being correctly set up, which is a bit strange.

I also run a GUI called VirtualBox which is sometimes handy for visualizing all VMs on your laptop. Checking there, it also seems to be present – so global-status wasn’t lying to us. Let’s try halting it:

A forced shutdown this time. So far debugging this error isn’t going well. It is totally strange because the VM was working perfectly beforehand. Vagrant up after this produces the same timeouts:

Let’s try destroying the vagrant instance completely….

And starting from scratch:

That’s a very weird error. Perhaps something failed randomly?

Apparently it still created the VM. SSHing into it gives this:

Bizarrely there are no files! Something still didn’t startup properly.

At this point I was getting a bit frustrated and started to Google the error. The most relevant blog post I could find was Mike Berggren’s solution here: http://mikeberggren.com/post/100289806126/if-up. He fixed this issue and reports: “I’ll spare you the rest of the gory details but suffice it to say, we eventually circumvented that check and ran that same command from the host. It came back with another MAC address claiming ownership“. Does this mean he literally commented out the lines of code that make that check? He may have had a very different problem and I’m no dev-ops expert but perhaps there’s a better solution – maybe that check is there for a reason.

Let’s look back at that network error mentioned before:

Cat the vagrant file and it reveals the same IP address. This means that this particular vagrant is always assigned that IP address. If this is the only instance, how is it that something else on this virtual network is using it? Something inside my mac is conflicting with it.

We can check to see what’s really running (aside from what global-status and VirtualBox say) by checking running processes:

Aha! There are actually 2 of them. Let’s destroy the non-functioning blank VM that we created before and see if anything has changed:

Yup, it has gone:

Use kill -9 [pid] to remove it.

Now let’s try recreating the VM:

It works!


  • I’m still not sure why this happened in the first place. The errors at the start came totally out of the blue – I’d been running that VM for several weeks without issues. Perhaps it was the fact that I’d been running it for so long?
  • Vagrant global-status and VirtualBox seem to not entirely accurately report running VMs so definitely check all running vbox processes using ps and remove any extras that didn’t shut down properly. This reminds me of the –prune option in global-status (https://www.vagrantup.com/docs/cli/global-status.html) which can fix persistent old entries.

Big thanks to Tony (http://blog.tonns.org/) for helping work this issue out!

Extending MongoDB findOne() functionality to findAnother()

Given a new MongoDB instance to work with, its often difficult to understand the structure since, unlike SQL, the database is inherently schema-less. A popular initial tool is Variety.js, which finds all possible fields and tells you the types, occurrences and percentages of them in the db.

However, it is often useful to see the contents of the fields as well. For this, you can use findOne(), which grabs what seems to be the last inserted record from the collection and pretty-prints it to the terminal. This is great, but sometimes you want to see more than one record to get a better feel for field contents. In this post I’ll show you how I extended Mongo to do this.

(Assuming you’ve already installed mongo and git – I use  brew install mongo from the homebrew package manager and git came with XCode on my Mac)

First off, let’s get a copy of mongo’s code:

This will have created a folder in your Documents, so let’s inspect it:

There’s quite a lot of files there, it will take a while to find where the findOne() function is located. Github’s repository search returns far too many results so let’s use a unix file search instead. MongoDB is written in JavaScript (and compiled) so we need to look for a function definition like findOne = function( or function findOne( . Try this search:

The two flags here are -l  (lowercase L) which shows filenames, and R  which searches all subdirectories. The fullstop at the end means to search the current directory. You can see it found a javascript file there, “collection.js”. If you open it up in a text editor, the findOne() function is listed:

This code can also be found here on the mongodb github page: https://github.com/mongodb/mongo/blob/master/src/mongo/shell/collection.js#L207

Our function is going to extend findOne instead to find another record. This can be implemented by doing a “find” using exactly the same code, but then skipping a random number of records ahead. The skip amount has to be less than the number of records listed which unfortunately means we have to run the query twice. First to count the number of results, and second to actually skip some amount.

Copy the findOne function and rename it to findAnother, with these lines at the top instead:

  1. Gets a count of the records the query returns (stored in total_records)
  2. Generates a random number in a range from 1 to the count (stored in randomNumber)
  3. Queries again using that number as a skip (stored in cursor)

Generating random numbers in a range is a little obscure in JavaScript but I found a helpful tip on StackOverflow to do it: http://stackoverflow.com/a/7228322 in one line. I’m used to Python’s ultra simple random.randrange().

Let’s test this out first. You’ll notice that all code can be returned in mongo’s javascript shell:

You can actually replace this code live in the terminal, though it won’t be saved once you close it. Try first replacing findOne with a Hello World:

You can test out our new function first in the terminal by copying everything after the first equals. Open a mongoDB shell to your favourite db and type db.my_collection.findOne =  then paste in the function. Try calling it and it should return different results each time.

Let’s patch mongodb now with our function. We have to compile mongo from the source we just downloaded.

  1. Save the collection.js file you just edited with the new findAnother() function
  2. Make sure you have a C++ compiler installed. I’m using gcc, which was installed using  brew install gcc
  3. Check how many cores your processor has. According to Google, my 2.7 GHz Intel i5 has 4 cores.
  4. In the same terminal window (still in the mongo folder), type: scons -j4 mongo  and press enter. The 4  here is the number of cores I have, and seriously speeds things up. scons  is a program that handles compilation, and putting mongo  at the end of this specifies that we only want to patch mongo’s client.

You’ll see a huge amount of green text as scons checks everything, and then:

Its compiled! We now have to replace the existing mongo application with our modified version. Let’s do a unix find:

We want to replace the mongo application in /usr/local/Cellar/ (where brew installed it to). Let’s back it up then copy across:

Now, open up a MongoDB shell to your favourite DB:


  • This could be really slow since it has to run the query twice. However, typically I don’t add super taxing queries to findOne().
  • This function may be replaced if you reinstall/upgrade mongo with brew – they release new versions fairly frequently.


  • A function that acts like a python generator, keeping state and cycling forward until it reaches the end of the record set. This would fix any slowness above, but be less random.

Converting an HTML table to an Excel download HTTP Response: A hack for slow OLAP DB connections


Zenko (“Good fox” in Japanese) is a reporting system (see code on Github here) I’ve created over the last couple of weeks at Mozilla. Basically my non-technical coworkers were getting so frustrated by Tableau (“what the heck is the difference between INNER JOIN and OUTER JOIN?”) that I decided to create a simple dashboard interface for them.

Its a simple bootstrap front-end to a database containing campaign stats for sponsored tiles. You can drill down to each tile or client/partner and pivot by things like locale, country and date.

Zenko’s stack (high to low)

A new feature

When loading one of the analyses pages, a table will be shown. My coworker wanted to be able to download the data to Excel. I came up with 4 possible ways to implement this:

  1. Simply rerun the query, format the results as a csv on the backend, save it and window.open() the file location.
  2. Automatically save the data from each analysis request server request and periodically clear old files.
  3. Use a javascript library like ExcelBuilder
  4. Send the data back to the server, format it, and then back to the client via an iframe

Which is the best solution?

  1. This is problematic because our sticking point is the query speed. The redshift database is an OLAP Column Oriented database, and append-only. This means that it is insanely fast to add data to, but quite slow (often 6+ seconds) to query. Yes, it is dealing with billions of rows so excusable, but its not so great in terms of user experience to wait so long.The user doesn’t want to wait another 6 seconds for the analysis to rerun when they have the data already.
  2. This sounds like it could just end up storing a lot of data on the client, but it could work quite well. In terms of security though, I’m not sure that the data should be lingering on the user’s PC unrequested though.
  3. This didn’t work out so well – in Firefox, the file is incorrectly named. In the future, I’d like to name the files according to the parameters of the analysis e.g. <client>-<date>-<country>.xls
  4. This is the weirdest solution, but it works! Flask is running locally so it is actually very fast. There are no huge JQuery/JavaScript complications with file permissions and the fact that you can manipulate the data easily on the server is nice too.

Solution 4

The process is as follows when the “Download for Excel” button is clicked:

  1. Reference the HTML table using JavaScript and convert it to an array of arrays
  2. Append an iframe to the DOM
  3. Append a form with a POST action and hidden field to the iframe
  4. Insert the table contents into the hidden field’s value
  5. Submit the form
  6. Let Flask receive the POST request and format the information as a CSV
  7. Return an HTTP response with a file attachment containing the CSV

Let’s implement it

There were various ways to do this in JQuery with iterable.each()  but I ran into complications and simply referencing cells using .children was much easier.

The (locally running) Flask will then recieve a POST request at /download_excel . Let’s set up the route:

Now, when the user clicks the button:

Download Link for Excel

They instantly get:

Download popup

Sorry, I can’t show what it looks like in Excel because the data isn’t public at the moment. All code is however available here on github!

One bizarre thing, however, is that the form doesn’t appear in the inspector (in either Chrome or Firefox):

Invisible in the inspector

Though, you can access it with some fairly lengthy getters:


Future features 

  • The files could be named something more intuitive than data.csv  – perhaps a combination of various things seen in the URL’s query string
  • Accommodate for a table wider than 6 rows. This could be done easily by stringifying the array using a different delimiter such as a “###”.
  • Create an .xls file rather than a CSV, if there is any advantage

Using Web Workers in Firefox Extensions

Web Workers allow you to run code in the background in browsers such as Firefox. This is how to build one into a Firefox Extension, which is slightly different than from just creating one as normal on a page. The documentation for doing this is basically non-existent, so hopefully you’ll find this useful.

Please make sure you have a development environment set up similar to the one described here in my previous post.

How do workers work?

  • Workers in /data/ are not directly connected to scripts in /lib/
  • However, they can communicate by sending messages to each other
  • These messages are text only, so could contain serialized JSON, but nothing else
  • You’ll notice below that we are basically just slinging messages between two scripts

The code for the worker

Navigate to the /data/ directory and create a file called hello_world.js

Now paste the following in there (new users of vim, press i  to start typing and Esc  followed by :wq  to save):

This says that whenever the worker receives a message from the client, then send a message back with the word “Hello” prepended.

One note here: In workers, you can’t use the useful function  console.log("message") , instead use  dump("message")

Let’s call the worker from the main code

Let’s navigate back to the /lib/  folder and edit the main.js  file, which is the first thing that runs in the extension.

Paste in the following code:

And run  cfx run . You’ll notice a messy error:

Aha! The key line here is:  ReferenceError: Worker is not defined . This is because Firefox Extensions use something called a ChromeWorker instead. We need to import this in main.js by pasting this at the top:

and changing the line that references the hello_world.js file to call a ChromeWorker instead:

Ok let’s try running it again! Try  cfx run . Wtf another error?!

The key line here is:  Malformed script URI: hello_world.js . This cryptic error is because firefox can’t yet access anything in the /data/  folder. We have to use another part of the SDK to enable access to it.

Open main.js  and put this at the top:

Now we can use the function self.data.url() . When you put a filename as the first argument, it will return a string like  resource://jid1-zmowxggdley0aa-at-jetpack/test/data/whatever_file.js which properly refers to it in the case of extensions. Modify the worker import line as follows:

Now let’s run the extension again using cfx run :

Yay it works! The Worker returned the message “Hello Matthew“.


  • What does this {notation}  mean?

It is shorthand for:

Basically this means that  require("chrome") returns an Object, and we just need the value that is referenced by the key “ChromeWorker”. This is a very succinct way of extracting things from JavaScript Objects that will come in handy in the future.

  • Why is Worker now called ChromeWorker? Are we doing something with Google Chrome?

This is a naming coincidence and nothing to do with Chrome as in the browser. Chrome in this case refers to Firefox Addon internals.

Setting up a development environment for Firefox Extensions.

This is the method I use to create simple firefox extensions. This tutorial is a precursor to the next one which is about using Web Workers (i.e. allowing code to run on background threads).

Setting up the environment

We’re going to need the Firefox Addon SDK. This is a collection of python files that will let you run a test (optionally blank) version of Firefox. To download it:

Now extract it and remove the tarball:

Go to the directory and startup the special shell:

Now you can see that the shell has prepended (addon-sdk-1.17) in brackets to the prompt. This means that the window is probably half filled with text so we can reduce that with the command:

Much cleaner! 🙂

Setting up the extension template

Now that we have this special addon-sdk shell, navigate back to your documents and create a new folder for our extension.

This special shell has various useful commands included, which all look like  cfx xyz . For more about them see here. In this case we use  cfx init

Let’s inspect what was created:

  •  lib  contains a file called main.js  which is the main handler file for all extension code
  • data  is empty but can be used to store things like workers (which we will come to later) or large data files
  • test  can contain unit tests (quite hard to set up but useful for test driven development later)
  • package.json  contains metadata about the extension – version number, name of the creator, description, licensing etc

You can start writing code in main.js and it will run in the browser. Once finished, use  cfx run to test it!

See the next tutorial on how to write a firefox extension using web workers!

MongoDB aspirin

Basically this is a list in progress of common errors/tasks/gripes I get when using Mongo. I’ve noted down what usually works. Maybe you’ll find it useful 🙂

Why has my remote mongodb connection been refused?

    • Delete the mongod.lock  file from your main mongodb storage folder and restart [SO]
    • If on redhat, check  sestatus
    • Modify /etc/sysconfig/iptables   to have the correct firewall rules according to the mongodb docs.

How do can you iterate through all mongoDB collections in pymongo? 

I normally access collections as object attributes like:   conn.Database.Collection.find_one() but actually databases and connections can be accessed as keys in a dictionary as well:

Why is mongod terminating whenever I close the shell? Even when using &  at the end

When starting mongod, use mongod --fork (note: fork must be right after the word mongod) and it will start as a background process instead. Or just add fork = true  to your config.

I just created an authenticated database and can’t even use show dbs !

Create a new user with all 4 of the following permissions: userAdminAnyDatabase, readWriteAnyDatabase, dbAdminAnyDatabase, clusterAdmin:

Shared Variables in Python Multiprocessing to pre-map/reduce

I’ve been using the multiprocessing library in Python quite a bit recently and started using the shared variable functionality. It can change something like this from my previous post:

Into a much nicer:

Thus eliminating the reduce stage. This is especially useful if you have a shared dictionary which you’re updating from multiple servers. There’s another possible shared datatype called Array, which, as it suggests, is a shared array. Note: One pitfall (that I fell for) is thinking that the "i"  in Value("i", 0)  is the name of the variable. Actually, its a typecode which stands for “integer”.

There are other ways to do this, however, each of which has its own trade offs:

# Solution Advantages Disadvantages
1 Shared file Easy to implement and access after Very slow
2 Shared mongoDB document Easy to implement Slow to constantly query for it
3 Multiprocessing Value/Array (this example) Very fast, easy to implement On 1 PC only, can’t be accessed after process is killed
4 Memcached Shared Value Networked aspect is useful for big distributed databases, shared.set() function is already available TCP could slow you down a bit

A quick solution to OrderedDict’s limitations in Python with O(1) index lookups

Background to the Problem

I work regularly with gigantic machine learning datasets. One very versatile format, for use in WEKA is the “ARFF” (Attribute Relation File Format). This essentially creates a nicely structured, rich CSV file which can easily be used in Logistic Regression, Decision Trees, SVMs etc. In order to solve the problem of very sparse CSV data, there is a sparse ARFF format that lets users convert sparse lines in each file such as:

f0 f1 f2 f3 fn
1 0 1 0 0

Into a more succint version where you have a list of features and simply specify the feature’s index and value (if any):


{0 1, 2 1}

i.e. {feature-index-zero is 1, feature-index-two is 1}, simply omitting all the zero-values.

The Implementation Problem

This is easy enough if you have, say 4 features, but what if you have over 1 million features and need to find the index of each one? Searching for a feature in a list is O(n), and if your training data is huge too, then creating the sparse ARFF is going to be hugely inefficient:

I thought I could improve this by using an OrderedDict. This is, very simply, a dictionary that maintains the order of its items – so you can pop() items from the end in a stack-like manner. However, after some research on StackOverflow, this disappointingly this doesn’t contain any efficient way to calculate the index of key:

The solution

What can we do about this? Enter my favorite thing ever, defaultdicts with lambdas:

Assigning items values in addition to the index is fairly straightforward with a slightly modified lambda:


This is a fun fix, but doesn’t support full dictionary functionality – deleting items won’t reorder the index and you can’t iterate in order through this easily. However, since in creating this ARFF file, there’s no need for deletions or iteration that’s not a problem.