Category Archives: Big Data

Converting an HTML table to an Excel download HTTP Response: A hack for slow OLAP DB connections


Zenko (“Good fox” in Japanese) is a reporting system (see code on Github here) I’ve created over the last couple of weeks at Mozilla. Basically my non-technical coworkers were getting so frustrated by Tableau (“what the heck is the difference between INNER JOIN and OUTER JOIN?”) that I decided to create a simple dashboard interface for them.

Its a simple bootstrap front-end to a database containing campaign stats for sponsored tiles. You can drill down to each tile or client/partner and pivot by things like locale, country and date.

Zenko’s stack (high to low)

A new feature

When loading one of the analyses pages, a table will be shown. My coworker wanted to be able to download the data to Excel. I came up with 4 possible ways to implement this:

  1. Simply rerun the query, format the results as a csv on the backend, save it and the file location.
  2. Automatically save the data from each analysis request server request and periodically clear old files.
  3. Use a javascript library like ExcelBuilder
  4. Send the data back to the server, format it, and then back to the client via an iframe

Which is the best solution?

  1. This is problematic because our sticking point is the query speed. The redshift database is an OLAP Column Oriented database, and append-only. This means that it is insanely fast to add data to, but quite slow (often 6+ seconds) to query. Yes, it is dealing with billions of rows so excusable, but its not so great in terms of user experience to wait so long.The user doesn’t want to wait another 6 seconds for the analysis to rerun when they have the data already.
  2. This sounds like it could just end up storing a lot of data on the client, but it could work quite well. In terms of security though, I’m not sure that the data should be lingering on the user’s PC unrequested though.
  3. This didn’t work out so well – in Firefox, the file is incorrectly named. In the future, I’d like to name the files according to the parameters of the analysis e.g. <client>-<date>-<country>.xls
  4. This is the weirdest solution, but it works! Flask is running locally so it is actually very fast. There are no huge JQuery/JavaScript complications with file permissions and the fact that you can manipulate the data easily on the server is nice too.

Solution 4

The process is as follows when the “Download for Excel” button is clicked:

  1. Reference the HTML table using JavaScript and convert it to an array of arrays
  2. Append an iframe to the DOM
  3. Append a form with a POST action and hidden field to the iframe
  4. Insert the table contents into the hidden field’s value
  5. Submit the form
  6. Let Flask receive the POST request and format the information as a CSV
  7. Return an HTTP response with a file attachment containing the CSV

Let’s implement it

There were various ways to do this in JQuery with iterable.each()  but I ran into complications and simply referencing cells using .children was much easier.

The (locally running) Flask will then recieve a POST request at /download_excel . Let’s set up the route:

Now, when the user clicks the button:

Download Link for Excel

They instantly get:

Download popup

Sorry, I can’t show what it looks like in Excel because the data isn’t public at the moment. All code is however available here on github!

One bizarre thing, however, is that the form doesn’t appear in the inspector (in either Chrome or Firefox):

Invisible in the inspector

Though, you can access it with some fairly lengthy getters:


Future features 

  • The files could be named something more intuitive than data.csv  – perhaps a combination of various things seen in the URL’s query string
  • Accommodate for a table wider than 6 rows. This could be done easily by stringifying the array using a different delimiter such as a “###”.
  • Create an .xls file rather than a CSV, if there is any advantage

Shared Variables in Python Multiprocessing to pre-map/reduce

I’ve been using the multiprocessing library in Python quite a bit recently and started using the shared variable functionality. It can change something like this from my previous post:

Into a much nicer:

Thus eliminating the reduce stage. This is especially useful if you have a shared dictionary which you’re updating from multiple servers. There’s another possible shared datatype called Array, which, as it suggests, is a shared array. Note: One pitfall (that I fell for) is thinking that the "i"  in Value("i", 0)  is the name of the variable. Actually, its a typecode which stands for “integer”.

There are other ways to do this, however, each of which has its own trade offs:

# Solution Advantages Disadvantages
1 Shared file Easy to implement and access after Very slow
2 Shared mongoDB document Easy to implement Slow to constantly query for it
3 Multiprocessing Value/Array (this example) Very fast, easy to implement On 1 PC only, can’t be accessed after process is killed
4 Memcached Shared Value Networked aspect is useful for big distributed databases, shared.set() function is already available TCP could slow you down a bit

Using ACORA to process hundreds of stopwords at once

“80% of data analysis is spent cleaning data, 20% of it is spent complaining about cleaning data” – Chang She and BigDataBorat

This is one of the best quotes I heard at PyData 2013. When dealing with huge amounts of data, often only a fraction of it is usually relevant to one’s analysis and it can be a total pain trying to clean it. But this is also an essential stage, so let’s make it as painless as possible.

One example is with gigantic log files. Say we’re dealing with a multi-terabyte apache log files as follows:

This is useful data with thousands of lines, and we’d like to analyze it using the big file processing script I mentioned before. However, there are certain lines that you’re not concerned about – so you can write a simple conditional:

What if you have 2 things that you don’t want in each line?

What if you have 3 things that you don’t want in each line?

But this is getting super inefficient and a bit silly. Each extra keyword requires yet another pass through the line. With this code basically everything is a worst case scenario.

Bring on ACORA!

ACORA is Stefan Behnel’s library based on the Aho-Corasick string matching algorithm. Without diving too deep into the maths behind it, it basically compiles all the stopwords you have into a single über-stopword, meaning one scan of this stopword over your log-file line will check for all stopwords. For example:

But how do we integrate this into the line scanner from before? Just like this!

We’ve replaced the entire multiple stopword matching for-loop with a single ACORA matcher.

A note on performance

ACORA is fantastic, but performance may dip if there are only a few stopwords, or only a few lines. It has best performance when you have about 20+ stopwords and at least 1000 or so log file lines to scan through.

Reading really large files in Python

There are many ways to read files in python, which is kinda un-intuitive and against the PEP-8 guidelines. However, this is the best:

It creates a generator which only reads one line at a time into memory. I’ve used this code to process multiple-gigabyte files with ease. Using  with is useful since it automatically closes the file after the code indented below it has finished.

Often people make the mistake of using:

  • for line in  – this loads the entire file at once and reads per character
  • for line in f.readlines()  – this loads the entire file into a list in memory