What comprises a great Data team?

Data teams are very hard to hire. Engineers are low in supply and high in demand, resumés are difficult to vet without time-consuming phone/coding screens, and often without FAANG’s resources it can be hard to find people as passionate as you are about your product. As the hiring manager, you may be lucky enough to afford the services of a professional recruiter. However, often the recruiters send you completely unsuitable candidates.

A very friendly recruiter I know in NYC and I had coffee the other day to discuss these industry issues. We agreed that it would be useful if we defined what makes up a great data team and what the roles actually mean.

Who should I hire first? And how do they work together?

  • Data Analyst
    • Your goal in hiring a data team is probably business insights and getting to a point at which your actions are largely data (rather than intuition) driven. For this, a competent Data Analyst is your starting point. Thankfully, Data Analysts are in plentiful supply. New grads with a passion for digging for insights and a quantitative or semi-quantitative (e.g. Econ) background shouldn’t be too tough to find. On the high end you can find those with experience in SQL and also relatively new BI tools such as Looker. If you have the budget, 75% of the engineering tasks that they will encounter can be solved with plug-and-play pipeline tools such as Segment. Most standard business insights – “what happened yesterday/last month?”, “what’s an underserved demographic in my customer base?” can be answered without an expensive Data Scientist.
  • Data Engineer
    • At some point, your databases and queries are going to become too slow. You’ll also want to scrape a data source or use an API that no-one has built a connector for in Segment. Perhaps you’ll want some custom lead scoring for sales, there’ll be a bunch of security issues or there’s some giant set of queries that you want to have refreshed every morning and notify you in Slack. Not having an programming background, your Data Analyst will feel out of their depth. This is when you need to hire a Data Engineer. Their expertise really lies around getting your datasets to the point at which they can be analyzed. If your tables are getting 50 million events per month then queries will start to take hours overnight and tables may even lock up. The Data Engineer can alleviate this – they will be able to optimize the indexes of your database’s tables for fast lookups, create materialized views of helpful aggregates that refresh every morning, connect custom APIs to your data warehouse and generally go above and beyond that which is covered by plug-and-play tools. 
  • Data Scientist
    • Now that you’re getting standard business insights, and getting them at scale from interesting sources you’re going to almost certainly be curious about the siren song of AI. What if you could predict which customer will churn/convert? What if you could cluster your customer base by behavior? This sort of next-level insight is all possible with a competent Data Scientist. They will use Machine Learning libraries in Python to try to automatically classify your customers and actions in your system. They can predict the future (with certain degrees of accuracy) and tell you stories about your data. This will involve using the large datasets that the Data Engineer has provided. Once these insights have been surfaced, they will have a trained model which can be saved to disk and refreshed. But what if this needs to be refreshed on a daily basis, and those insights added to a table that the Data Analyst or even Salesperson can read? Then you need to go back to the Data Engineer and have them set up a backend system to do this.
  • Data Science Engineer, Machine Learning Engineer
    • Sometimes, your (possibly more old-school) Data Engineer will be 100% pipeline and database admin focused and not particularly experienced at implementing the Python-based models your Data Scientist came up with. For this issue, you need a Data Science Engineer or Machine Learning Engineer. Their expertise is in deploying, scaling and refreshing the model that the Data Scientist came up with. The Data Scientist should be focussing entirely on probabilities, tweaking model parameters and confidence scores; your Data Analyst should be focussing on the higher level narrative, and your DS/ML Engineer can now take the model and make sure it delivers your insights quickly, and with clean, fresh, correct data.
  • Visualization Engineer
    • What if Looker doesn’t have the charts you want? What if you need to visualize a network? What if your data has 16 dimensions and you’ve exhausted all the color, size and shape options in your scatter plot? Then you need the very specialized role of Visualization Engineer to build you custom visualizations and think of better ways to surface the insights that the Data Analyst struggled with.
  • Business Intelligence Engineer
    • This is an ambiguous title – they could be a Data Science Engineer, or Data Engineer.

The Roles

Title Responsbilities Tools/Technologies
Data Analyst  Basic business insights Looker/Tableau, SQL, Excel, Redash
Data Scientist More complicated business predictions Jupyter Notebooks, Python, Scikit-Learn, Numpy, Scipy, Statsmodels
Data Engineer Getting “Big Data” to the point at which it can be analyzed as well as connecting custom data sources SQL, Python, Scrapy, Spacy, Selenium, Kafka, Airflow, Luigi, Spark, AWS, Redshift
Data Science Engineer or Machine Learning Engineer Implementing the Data Scientist’s models at scale SQL, Python, Scikit-Learn, Numpy, Scipy, AWS, TensorFlow
Visualization Engineer (Rare) Making beautiful graphics to augment everyone’s work Looker/Tableau, Javascript, D3.js, Highcharts, HTML

But why does everyone just focus on Data Science?

Data Science is a buzzword that some use to represent anyone who fits in the intersection of Data Engineering and Analytics. One of the biggest mistakes companies make is in hiring too many Data Scientists and not enough Data Engineers. Data Scientists are getting easier to hire recently as many former Mathematicians, Chemists, Statisticians and other quantitative grads have found they can either easily rebrand themselves or attend a 3-month Bootcamp and get hired in a junior role. However, their models and their insights can be limited as they don’t know how to scale/deploy them; and they will spend much of their time cleaning the datasets rather than deeply concentrating on the message and predictions hidden within. Of course, as someone who mostly now focuses on Data Engineering I am biased, but I would say that each Data Scientist should be paired with at least 1 Data Engineer. The reason I actually moved into Data Engineering from Data Science was really out of necessity for my own work.

I hope this guide was successful in explaining today’s data teams. Please comment or email me if you have any ideas about how to improve it.

Open sourcing DIMA – a handy tool for database monitoring

Data at Frame.io

At a growing startup like Frame.io, we have a huge volume of data coming in — over 67 million events per month. Efficient analysis is crucial to remain the market leader, and we use a whole host of tools:

  • GUIs like LookerRedash and PSequel
  • Internal analytics and query parallelization microservices
  • Custom AI slackbots we’ve built for spam/intrusion detection and sales lead spotting

This means that our data warehouse is being constantly hit by tons of queries — both on Cron schedules and from employees exploring our datasets.

Volume issues

Managing this volume of queries can be tough.

  • PostgreSQL tables can lock up — trying to swap/delete a table while someone else is querying it, for example, will cause them to lock up. Ideally, analytics tables should only ever be incrementally updated, but sometimes design constraints prevent this.
  • SQL GUIs can randomly time-out and its useful to quickly inspect the DB to see if they are still running or not.
  • Using a combination of screen and psql to run queries in the background requires monitoring. Dima is great for that.
  • An accidental SELECT without a LIMIT on a multi-million row table can crash a SQL GUI.
  • Sometimes a big query just isn’t what you wanted. Psequel doesn’t have a cancel button, and hosted GUIs like Redash and Looker have Cancel buttons that often become unresponsive for long running queries.


To help deal with this, I built a small command-line tool called dima. This originally stood for “Database Integrity Monitoring & Analysis”, but now its just really “dima”. It’s very straightforward and has turned out to be super useful, so I got the OK to open source it.

Dima is a Python script that basically lets you quickly summarize or inspect the pg_stat_activity internal table in Postgres via the psycopg2 library. If you want to kill a query it will call pg_terminate on the PID.

Check out https://github.com/Frameio/homebrew-dima.

Install using pip install dima-db or brew tap Frameio/homebrew-dima && brew install dima


dima — a summary of running queries

dima show [PID] —inspect a specific running query

dima rm [-f] Some PID or Keyword — kills query/queries

How we attribute signups at Frame.io

Attributing 800M clicks to 600K users

At Frame.io we use Segment to power our app usage analytics. Every action in both our app and on our site can be logged by making a call with an event name such as “signed-up” or “viewed-video”.

This data is essential for our A/B testing and Analytics. Combined with Segment’s great integrations with services like Clearbit, we can enrich profiles with company information, and then trace the first time a user accessed any of our services. This data lets us determine which blog authors drive the highest LTV customers, which prospect audience in YouTube generates the least sticky users, etc.

We leverage a variety of different channels for growth — drip email campaigns, Facebook, YouTube, Twitter, Google AdWords and they all have separate platforms, some of which don’t sync into our data warehouse. A system is needed to homogenize this information so we can analyze client, channel, campaign and medium all in one place.

Attribution is really about having a 360-degree view of a company’s market fit, and its ability to succinctly explain its service to the correct audience.

Planning an attribution system

Conceptually Attribution is quite simple — just find the first result in your database for the user’s ID and extract a UTM right? However, there’s a ton of considerations before you can get to that single row of data that shows where a user came from:

  • Even with great indexing, searching for thousands of users per day in millions of rows, and maintaining the partitions and indexes behind those rows can be a big load on the database.
  • When users first came to the site, they didn’t have user_ids — but rather anonymous_ids assigned by Segment. A user ID is only associated with a user once the user creates an account and the “identify” call is made with both an anonymous_id and a user_id. Sometimes we spot a user’s anonymous_id from over a year ago. Once we have a user’s global ID (assigned from our system), then we have to backfill all of the anonymous sessions prior to our ID being assigned.
  • We have to sessionize every set of user actions and aggregate attributes about their first/last session before signup. Sessionization is the process of aggregating discrete user events into runs of page loads with no more than a 20 minute gap in between. This enables us to do things like analyze the overall intent of a visit and attribute it in bulk. Re-sessionizing involves huge nested window functions and the alias table, as often the same user has several historical anonymous_ids to merge.
  • We will want to maintain some quick lookup tables for this to be useful — user session counts, useful sessions, maybe some aggregates to see how we are doing in a visualization tool like Looker.
  • Note also that Segment actually has a native attribution system with a snippet you can install; but it can’t look back to before you installed it — and we have data going back to 2014.

Implementation Stages

The system design settled on the following 8 query groups/stages:

  1. Find the latest timestamp in the existing sessionization data
    1. Get all the latest visit data from the Segment tracks table
    2. Merge this data with the most recent monthly partition, with some code to check that the columns are still the same (new fields could have been added).
  2. Sessionize the most recent partition (there could be overlap from the last 24 hours so make sure the gap is covered)
    1. Update the aliases table with any newly identified anon/user pairs.
    2. Go back through all historical sessions and re-alias — i.e swap the anonymous id for any newly identified user_ids .
    3. Re-sessionize all historical sessions, since adjacent sessions may have historically different anon IDs but now the same user_id
  3. Add an extra column to indicate if the user signed up in a session
    1. Merge the resulting sessions tables into a single all_sessions table
    2. Create a lookup table for useful sessions — that is the first session, and the signup session only per user
  4. Create a useful attribution table with things like UTM parameters pre-extracted and session counts
  5. Re-index any tables and check that the partitions are in order (especially over a new-month gap).

Table structure


  • aliasing
    • known_pairs
  • sessions
    • attribution
    • all_sessions
    • useful_sessions
  • sessions_stages
    • all_tracks_2018_01
    • all_tracks_2018_01_aliased
    • all_tracks_2018_01_re_merged
    • all_tracks_2018_01_signup
    • …etc
  • partitioning
    • app_tracks_2018_01
    • app_tracks_2018_02
    • app_tracks_2018_03
    • …etc

Columns in the final sessions.attribution table (many are just quick lookups for easy aggregation, and the table is heavily indexed):

  • alias_id
  • session_count_before_signup
  • signup_timestamp
  • blog_url
  • first_page_ever_visited
  • first touch (i.e. first interaction with any frame.io property)
    • timestamp
    • source, campaign, medium, referrer
  • last touch (i.e. last interaction before signup)
    • timestamp
    • source, campaign, medium, referrer
  • first attributed touch (as above, but with paid referrer)
    • timestamp
    • source, campaign, medium, referrer
  • last attributed touch (as above, but with paid referrer)
    • timestamp
    • source, campaign, medium, referrer

Deployment and monitoring

The system took about 1.5 months to get working well, and is about 800 lines of code (Python, SQL, Shell and YAML).

Cloud computing ecosystems such as AWS, Google Cloud and Azure make deploying microservices like this straightforward. The code is Dockerized (on Alpine) and deployed on an ECS Fargate cluster. New builds are processed by Codepipeline, which monitors the attribution Github repo. The underlying database is Aurora Postgres.

This system takes about 4 hours to run every morning and can come across difficulties such as underlying table slowness or unavailability. To mitigate this, each query has a 2 hour retry system.

Performance is logged in AWS Cloudwatch, synced to Datadog. If any query totally fails after retries then a Pagerduty ping is sent to my phone.

The results

  • Cool stats — 71% of our users signed up in their first session!
  • 800 million actions have been taken on Frame.io (row count of our Segment tracks table partitions combined 🤯) and we now have over 600,000 users.
  • We no longer have to dig into specific platforms (e.g. Facebook Business, Google Analytics) but instead have all our analysis in one place and can visualize it in Looker!
  • We can easily calculate the average LTV per campaign or channel with a simple SQL group-by.
  • We can gauge the stickiness of a source — e.g. do our YouTube ads give us customers who last for a long time or quickly churn?
  • As a gut-check, we ask users to report where they heard about Frame.io (50% direct, 14% YouTube, 10% Facebook), and these numbers match up perfectly with our “First Touch” column.

Where do we go from here?

  • This attribution system is only the piping behind the actual analytics. It gets harder when we have to make more philosophical decisions. How can we decide how to attribute “multi-touch” users? E.g. if a user has a first-touch on one blog article and a last touch several months later on a different one, then who gets the credit?
  • We are closely monitoring the performance right now, and it runs in about 4 hours. However our traffic is increasing by a huge amount every month — how much longer will this last? Thankfully the system is easily parallelizable apart from the last stages.

 With thanks to Kyle Gesuelli and Ammon Brown for proof reading, edits and suggestions. I originally posted on the Frame.io blog at: https://medium.com/frame-io-engineering/marketing-attribution-at-frame-io-8c2dbde14b37

Avoiding multiple reads with top-level imports

Recently I’ve been working with various applications that require importing large JSON definition files which detail complex application settings. Often, these files are required by multiple auxiliary modules in the codebase. All principles of software engineering point towards importing this sort of file only once, regardless of how many secondary modules it is used in.

My instinctive approach to this would be to have a main handler module read in the file and then pass its contents as a class initialization argument:

The problem with this is that if you have an elaborate import process, and multiple files to import, it could start to look messy. I recently discovered that this multiple initialization argument approach isn’t actually necessary.

In Python, you can actually import the same settings loader module in the two auxiliary modules (module1 and module2), and python will only load it once:

Now when we test this out in the terminal:

Despite calling import settings_loader  twice, Python actually only called it once. This is extremely useful but also could cause headaches if you actually wanted to import the file twice. If so, then I would include the settings importer inside the __init__()  of each ClassX and instantiate it twice.

Mocking out an API call deep in your code

With any actively developed (python) coding project, you’re and your team are going to be running the same set of tests sometimes hundreds of times per week. If there’s an HTTP request to any 3rd-party source in there, this can cause problems. API calls can be expense, excessive scraping of the same source can cause IP blacklisting and the calls could just slow down your whole test process, adding extra baggage to the code deployment process.

To fix this, we can use Python’s mock library. Mock is really useful for creating fake function calls, fake Classes and other fake objects which can return fake values. In most cases when testing, you are really just testing how the application parses data rather than the reliability of the 3rd party service. The API’s response is generally the same. Mock can let you simulate the API’s response and parse its data rather than actually have to make the call each time.

It’s quite tricky to set up so I thought I would write a tutorial. The situation set up has a few components but I’ll try and explain it as well as possible. Let’s say there is a service that provides some useful API response. There’s a site, HTTPBin, set up by Kenneth Reitz to test HTTP libraries, which we will use here. Check out: https://httpbin.org/ip. The content is as follows:

Let’s say our program wants to grab the IP address in the origin field. Yup – a fairly pointless program but this will be analogous to many situations you’ll encounter.

Here’s a totally over-engineered class to get data from this service. When the class is initialized in __init__, it creates a base_url variable pointing to HTTPBin. The main handler function is the get_ip function, which simply grabs that field’s content. This first makes a call to api_call which uses requests.get to grab that HTTP data.

To run this code its simply (in a Python Shell):

What if we want to mock out requests.get? The Mock module documentation is quite unclear on how to target a specific function deep within a class. It turns out the easiest way to do this is not MagicMock or return_value but instead to use the counter-intuitively named “side_effect” feature. This is the testing module pre-mocking:

As you can see, this is a standard set of tests to check that the ip_grabber function returns a valid-ish IP address. It is run as follows:

However, the problem here is that it is going to call the actual API each time you run the tests. To stop this, let’s integrate the mock module:

Here we’ve:

  1. Imported the mock module. Note: if you get an error about “wraps” in the “six” module then it is almost certainly because you have more than one installation of six or mock and one needs to be deleted.
  2. Create a fake function fake_get to replace requests.get with. This actually returns just “123” for now so you can see how it makes the test fail below.
  3. Added the mock.patch wrapper around the test_ip_grabber function. Very important here is specifying the function name as it is imported in my_module NOT as it appears in the Python standard library; i.e. we are doing “my_module.get” rather than “requests.get”. The side_effect= then says to replace that with whatever function we want.
  4. The fake function specified by side effect must now be added as an argument to the function.

Running this, we get:

Mock’s side_effect has replaced requests.get 🙂 To make this pass, just replace  return {'origin': '123'} with return {'origin': ''}  and run again:

Tests pass and zero HTTP traffic! 🙂

Fixing a Django/Vagrant error in ifup/ifup-eth

I normally test out a django project using a local vagrant instance. Vagrant creates a virtual machine running the django project which instantly recognizes changes to code. However, I was running a test recently and the VM suddenly started outputting error messages as follows. Note: I’ve redacted some parts of these terminal outputs.

(Scroll to the bottom of this post to skip to the solution)

No translation files? I’ve never heard of them. Thinking it could be a random error, I tried to run the test again and got:

I tried logging out of the instance and got an I/O error:

Attempting to ssh into the VM again is refused:

Sometimes the “turning it off and on again” solution can work, so let’s try vagrant reload:

A timeout error this round. The error seems related to authentication, but it’s not the whole story. The VM is running according to global-status, despite not being correctly set up, which is a bit strange.

I also run a GUI called VirtualBox which is sometimes handy for visualizing all VMs on your laptop. Checking there, it also seems to be present – so global-status wasn’t lying to us. Let’s try halting it:

A forced shutdown this time. So far debugging this error isn’t going well. It is totally strange because the VM was working perfectly beforehand. Vagrant up after this produces the same timeouts:

Let’s try destroying the vagrant instance completely….

And starting from scratch:

That’s a very weird error. Perhaps something failed randomly?

Apparently it still created the VM. SSHing into it gives this:

Bizarrely there are no files! Something still didn’t startup properly.

At this point I was getting a bit frustrated and started to Google the error. The most relevant blog post I could find was Mike Berggren’s solution here: http://mikeberggren.com/post/100289806126/if-up. He fixed this issue and reports: “I’ll spare you the rest of the gory details but suffice it to say, we eventually circumvented that check and ran that same command from the host. It came back with another MAC address claiming ownership“. Does this mean he literally commented out the lines of code that make that check? He may have had a very different problem and I’m no dev-ops expert but perhaps there’s a better solution – maybe that check is there for a reason.

Let’s look back at that network error mentioned before:

Cat the vagrant file and it reveals the same IP address. This means that this particular vagrant is always assigned that IP address. If this is the only instance, how is it that something else on this virtual network is using it? Something inside my mac is conflicting with it.

We can check to see what’s really running (aside from what global-status and VirtualBox say) by checking running processes:

Aha! There are actually 2 of them. Let’s destroy the non-functioning blank VM that we created before and see if anything has changed:

Yup, it has gone:

Use kill -9 [pid] to remove it.

Now let’s try recreating the VM:

It works!


  • I’m still not sure why this happened in the first place. The errors at the start came totally out of the blue – I’d been running that VM for several weeks without issues. Perhaps it was the fact that I’d been running it for so long?
  • Vagrant global-status and VirtualBox seem to not entirely accurately report running VMs so definitely check all running vbox processes using ps and remove any extras that didn’t shut down properly. This reminds me of the –prune option in global-status (https://www.vagrantup.com/docs/cli/global-status.html) which can fix persistent old entries.

Big thanks to Tony (http://blog.tonns.org/) for helping work this issue out!

Extending MongoDB findOne() functionality to findAnother()

Given a new MongoDB instance to work with, its often difficult to understand the structure since, unlike SQL, the database is inherently schema-less. A popular initial tool is Variety.js, which finds all possible fields and tells you the types, occurrences and percentages of them in the db.

However, it is often useful to see the contents of the fields as well. For this, you can use findOne(), which grabs what seems to be the last inserted record from the collection and pretty-prints it to the terminal. This is great, but sometimes you want to see more than one record to get a better feel for field contents. In this post I’ll show you how I extended Mongo to do this.

(Assuming you’ve already installed mongo and git – I use  brew install mongo from the homebrew package manager and git came with XCode on my Mac)

First off, let’s get a copy of mongo’s code:

This will have created a folder in your Documents, so let’s inspect it:

There’s quite a lot of files there, it will take a while to find where the findOne() function is located. Github’s repository search returns far too many results so let’s use a unix file search instead. MongoDB is written in JavaScript (and compiled) so we need to look for a function definition like findOne = function( or function findOne( . Try this search:

The two flags here are -l  (lowercase L) which shows filenames, and R  which searches all subdirectories. The fullstop at the end means to search the current directory. You can see it found a javascript file there, “collection.js”. If you open it up in a text editor, the findOne() function is listed:

This code can also be found here on the mongodb github page: https://github.com/mongodb/mongo/blob/master/src/mongo/shell/collection.js#L207

Our function is going to extend findOne instead to find another record. This can be implemented by doing a “find” using exactly the same code, but then skipping a random number of records ahead. The skip amount has to be less than the number of records listed which unfortunately means we have to run the query twice. First to count the number of results, and second to actually skip some amount.

Copy the findOne function and rename it to findAnother, with these lines at the top instead:

  1. Gets a count of the records the query returns (stored in total_records)
  2. Generates a random number in a range from 1 to the count (stored in randomNumber)
  3. Queries again using that number as a skip (stored in cursor)

Generating random numbers in a range is a little obscure in JavaScript but I found a helpful tip on StackOverflow to do it: http://stackoverflow.com/a/7228322 in one line. I’m used to Python’s ultra simple random.randrange().

Let’s test this out first. You’ll notice that all code can be returned in mongo’s javascript shell:

You can actually replace this code live in the terminal, though it won’t be saved once you close it. Try first replacing findOne with a Hello World:

You can test out our new function first in the terminal by copying everything after the first equals. Open a mongoDB shell to your favourite db and type db.my_collection.findOne =  then paste in the function. Try calling it and it should return different results each time.

Let’s patch mongodb now with our function. We have to compile mongo from the source we just downloaded.

  1. Save the collection.js file you just edited with the new findAnother() function
  2. Make sure you have a C++ compiler installed. I’m using gcc, which was installed using  brew install gcc
  3. Check how many cores your processor has. According to Google, my 2.7 GHz Intel i5 has 4 cores.
  4. In the same terminal window (still in the mongo folder), type: scons -j4 mongo  and press enter. The 4  here is the number of cores I have, and seriously speeds things up. scons  is a program that handles compilation, and putting mongo  at the end of this specifies that we only want to patch mongo’s client.

You’ll see a huge amount of green text as scons checks everything, and then:

Its compiled! We now have to replace the existing mongo application with our modified version. Let’s do a unix find:

We want to replace the mongo application in /usr/local/Cellar/ (where brew installed it to). Let’s back it up then copy across:

Now, open up a MongoDB shell to your favourite DB:


  • This could be really slow since it has to run the query twice. However, typically I don’t add super taxing queries to findOne().
  • This function may be replaced if you reinstall/upgrade mongo with brew – they release new versions fairly frequently.


  • A function that acts like a python generator, keeping state and cycling forward until it reaches the end of the record set. This would fix any slowness above, but be less random.

Converting an HTML table to an Excel download HTTP Response: A hack for slow OLAP DB connections


Zenko (“Good fox” in Japanese) is a reporting system (see code on Github here) I’ve created over the last couple of weeks at Mozilla. Basically my non-technical coworkers were getting so frustrated by Tableau (“what the heck is the difference between INNER JOIN and OUTER JOIN?”) that I decided to create a simple dashboard interface for them.

Its a simple bootstrap front-end to a database containing campaign stats for sponsored tiles. You can drill down to each tile or client/partner and pivot by things like locale, country and date.

Zenko’s stack (high to low)

A new feature

When loading one of the analyses pages, a table will be shown. My coworker wanted to be able to download the data to Excel. I came up with 4 possible ways to implement this:

  1. Simply rerun the query, format the results as a csv on the backend, save it and window.open() the file location.
  2. Automatically save the data from each analysis request server request and periodically clear old files.
  3. Use a javascript library like ExcelBuilder
  4. Send the data back to the server, format it, and then back to the client via an iframe

Which is the best solution?

  1. This is problematic because our sticking point is the query speed. The redshift database is an OLAP Column Oriented database, and append-only. This means that it is insanely fast to add data to, but quite slow (often 6+ seconds) to query. Yes, it is dealing with billions of rows so excusable, but its not so great in terms of user experience to wait so long.The user doesn’t want to wait another 6 seconds for the analysis to rerun when they have the data already.
  2. This sounds like it could just end up storing a lot of data on the client, but it could work quite well. In terms of security though, I’m not sure that the data should be lingering on the user’s PC unrequested though.
  3. This didn’t work out so well – in Firefox, the file is incorrectly named. In the future, I’d like to name the files according to the parameters of the analysis e.g. <client>-<date>-<country>.xls
  4. This is the weirdest solution, but it works! Flask is running locally so it is actually very fast. There are no huge JQuery/JavaScript complications with file permissions and the fact that you can manipulate the data easily on the server is nice too.

Solution 4

The process is as follows when the “Download for Excel” button is clicked:

  1. Reference the HTML table using JavaScript and convert it to an array of arrays
  2. Append an iframe to the DOM
  3. Append a form with a POST action and hidden field to the iframe
  4. Insert the table contents into the hidden field’s value
  5. Submit the form
  6. Let Flask receive the POST request and format the information as a CSV
  7. Return an HTTP response with a file attachment containing the CSV

Let’s implement it

There were various ways to do this in JQuery with iterable.each()  but I ran into complications and simply referencing cells using .children was much easier.

The (locally running) Flask will then recieve a POST request at /download_excel . Let’s set up the route:

Now, when the user clicks the button:

Download Link for Excel

They instantly get:

Download popup

Sorry, I can’t show what it looks like in Excel because the data isn’t public at the moment. All code is however available here on github!

One bizarre thing, however, is that the form doesn’t appear in the inspector (in either Chrome or Firefox):

Invisible in the inspector

Though, you can access it with some fairly lengthy getters:


Future features 

  • The files could be named something more intuitive than data.csv  – perhaps a combination of various things seen in the URL’s query string
  • Accommodate for a table wider than 6 rows. This could be done easily by stringifying the array using a different delimiter such as a “###”.
  • Create an .xls file rather than a CSV, if there is any advantage

Using Web Workers in Firefox Extensions

Web Workers allow you to run code in the background in browsers such as Firefox. This is how to build one into a Firefox Extension, which is slightly different than from just creating one as normal on a page. The documentation for doing this is basically non-existent, so hopefully you’ll find this useful.

Please make sure you have a development environment set up similar to the one described here in my previous post.

How do workers work?

  • Workers in /data/ are not directly connected to scripts in /lib/
  • However, they can communicate by sending messages to each other
  • These messages are text only, so could contain serialized JSON, but nothing else
  • You’ll notice below that we are basically just slinging messages between two scripts

The code for the worker

Navigate to the /data/ directory and create a file called hello_world.js

Now paste the following in there (new users of vim, press i  to start typing and Esc  followed by :wq  to save):

This says that whenever the worker receives a message from the client, then send a message back with the word “Hello” prepended.

One note here: In workers, you can’t use the useful function  console.log("message") , instead use  dump("message")

Let’s call the worker from the main code

Let’s navigate back to the /lib/  folder and edit the main.js  file, which is the first thing that runs in the extension.

Paste in the following code:

And run  cfx run . You’ll notice a messy error:

Aha! The key line here is:  ReferenceError: Worker is not defined . This is because Firefox Extensions use something called a ChromeWorker instead. We need to import this in main.js by pasting this at the top:

and changing the line that references the hello_world.js file to call a ChromeWorker instead:

Ok let’s try running it again! Try  cfx run . Wtf another error?!

The key line here is:  Malformed script URI: hello_world.js . This cryptic error is because firefox can’t yet access anything in the /data/  folder. We have to use another part of the SDK to enable access to it.

Open main.js  and put this at the top:

Now we can use the function self.data.url() . When you put a filename as the first argument, it will return a string like  resource://jid1-zmowxggdley0aa-at-jetpack/test/data/whatever_file.js which properly refers to it in the case of extensions. Modify the worker import line as follows:

Now let’s run the extension again using cfx run :

Yay it works! The Worker returned the message “Hello Matthew“.


  • What does this {notation}  mean?

It is shorthand for:

Basically this means that  require("chrome") returns an Object, and we just need the value that is referenced by the key “ChromeWorker”. This is a very succinct way of extracting things from JavaScript Objects that will come in handy in the future.

  • Why is Worker now called ChromeWorker? Are we doing something with Google Chrome?

This is a naming coincidence and nothing to do with Chrome as in the browser. Chrome in this case refers to Firefox Addon internals.

Setting up a development environment for Firefox Extensions.

This is the method I use to create simple firefox extensions. This tutorial is a precursor to the next one which is about using Web Workers (i.e. allowing code to run on background threads).

Setting up the environment

We’re going to need the Firefox Addon SDK. This is a collection of python files that will let you run a test (optionally blank) version of Firefox. To download it:

Now extract it and remove the tarball:

Go to the directory and startup the special shell:

Now you can see that the shell has prepended (addon-sdk-1.17) in brackets to the prompt. This means that the window is probably half filled with text so we can reduce that with the command:

Much cleaner! 🙂

Setting up the extension template

Now that we have this special addon-sdk shell, navigate back to your documents and create a new folder for our extension.

This special shell has various useful commands included, which all look like  cfx xyz . For more about them see here. In this case we use  cfx init

Let’s inspect what was created:

  •  lib  contains a file called main.js  which is the main handler file for all extension code
  • data  is empty but can be used to store things like workers (which we will come to later) or large data files
  • test  can contain unit tests (quite hard to set up but useful for test driven development later)
  • package.json  contains metadata about the extension – version number, name of the creator, description, licensing etc

You can start writing code in main.js and it will run in the browser. Once finished, use  cfx run to test it!

See the next tutorial on how to write a firefox extension using web workers!