Category Archives: Uncategorized

What comprises a great Data team?

Data teams are very hard to hire. Engineers are low in supply and high in demand, resumés are difficult to vet without time-consuming phone/coding screens, and often without FAANG’s resources it can be hard to find people as passionate as you are about your product. As the hiring manager, you may be lucky enough to afford the services of a professional recruiter. However, often the recruiters send you completely unsuitable candidates.

A very friendly recruiter I know in NYC and I had coffee the other day to discuss these industry issues. We agreed that it would be useful if we defined what makes up a great data team and what the roles actually mean.

Who should I hire first? And how do they work together?

  • Data Analyst
    • Your goal in hiring a data team is probably business insights and getting to a point at which your actions are largely data (rather than intuition) driven. For this, a competent Data Analyst is your starting point. Thankfully, Data Analysts are in plentiful supply. New grads with a passion for digging for insights and a quantitative or semi-quantitative (e.g. Econ) background shouldn’t be too tough to find. On the high end you can find those with experience in SQL and also relatively new BI tools such as Looker. If you have the budget, 75% of the engineering tasks that they will encounter can be solved with plug-and-play pipeline tools such as Segment. Most standard business insights – “what happened yesterday/last month?”, “what’s an underserved demographic in my customer base?” can be answered without an expensive Data Scientist.
  • Data Engineer
    • At some point, your databases and queries are going to become too slow. You’ll also want to scrape a data source or use an API that no-one has built a connector for in Segment. Perhaps you’ll want some custom lead scoring for sales, there’ll be a bunch of security issues or there’s some giant set of queries that you want to have refreshed every morning and notify you in Slack. Not having an programming background, your Data Analyst will feel out of their depth. This is when you need to hire a Data Engineer. Their expertise really lies around getting your datasets to the point at which they can be analyzed. If your tables are getting 50 million events per month then queries will start to take hours overnight and tables may even lock up. The Data Engineer can alleviate this – they will be able to optimize the indexes of your database’s tables for fast lookups, create materialized views of helpful aggregates that refresh every morning, connect custom APIs to your data warehouse and generally go above and beyond that which is covered by plug-and-play tools. 
  • Data Scientist
    • Now that you’re getting standard business insights, and getting them at scale from interesting sources you’re going to almost certainly be curious about the siren song of AI. What if you could predict which customer will churn/convert? What if you could cluster your customer base by behavior? This sort of next-level insight is all possible with a competent Data Scientist. They will use Machine Learning libraries in Python to try to automatically classify your customers and actions in your system. They can predict the future (with certain degrees of accuracy) and tell you stories about your data. This will involve using the large datasets that the Data Engineer has provided. Once these insights have been surfaced, they will have a trained model which can be saved to disk and refreshed. But what if this needs to be refreshed on a daily basis, and those insights added to a table that the Data Analyst or even Salesperson can read? Then you need to go back to the Data Engineer and have them set up a backend system to do this.
  • Data Science Engineer, Machine Learning Engineer
    • Sometimes, your (possibly more old-school) Data Engineer will be 100% pipeline and database admin focused and not particularly experienced at implementing the Python-based models your Data Scientist came up with. For this issue, you need a Data Science Engineer or Machine Learning Engineer. Their expertise is in deploying, scaling and refreshing the model that the Data Scientist came up with. The Data Scientist should be focussing entirely on probabilities, tweaking model parameters and confidence scores; your Data Analyst should be focussing on the higher level narrative, and your DS/ML Engineer can now take the model and make sure it delivers your insights quickly, and with clean, fresh, correct data.
  • Visualization Engineer
    • What if Looker doesn’t have the charts you want? What if you need to visualize a network? What if your data has 16 dimensions and you’ve exhausted all the color, size and shape options in your scatter plot? Then you need the very specialized role of Visualization Engineer to build you custom visualizations and think of better ways to surface the insights that the Data Analyst struggled with.
  • Business Intelligence Engineer
    • This is an ambiguous title – they could be a Data Science Engineer, or Data Engineer.

The Roles

Title Responsbilities Tools/Technologies
Data Analyst  Basic business insights Looker/Tableau, SQL, Excel, Redash
Data Scientist More complicated business predictions Jupyter Notebooks, Python, Scikit-Learn, Numpy, Scipy, Statsmodels
Data Engineer Getting “Big Data” to the point at which it can be analyzed as well as connecting custom data sources SQL, Python, Scrapy, Spacy, Selenium, Kafka, Airflow, Luigi, Spark, AWS, Redshift
Data Science Engineer or Machine Learning Engineer Implementing the Data Scientist’s models at scale SQL, Python, Scikit-Learn, Numpy, Scipy, AWS, TensorFlow
Visualization Engineer (Rare) Making beautiful graphics to augment everyone’s work Looker/Tableau, Javascript, D3.js, Highcharts, HTML

But why does everyone just focus on Data Science?

Data Science is a buzzword that some use to represent anyone who fits in the intersection of Data Engineering and Analytics. One of the biggest mistakes companies make is in hiring too many Data Scientists and not enough Data Engineers. Data Scientists are getting easier to hire recently as many former Mathematicians, Chemists, Statisticians and other quantitative grads have found they can either easily rebrand themselves or attend a 3-month Bootcamp and get hired in a junior role. However, their models and their insights can be limited as they don’t know how to scale/deploy them; and they will spend much of their time cleaning the datasets rather than deeply concentrating on the message and predictions hidden within. Of course, as someone who mostly now focuses on Data Engineering I am biased, but I would say that each Data Scientist should be paired with at least 1 Data Engineer. The reason I actually moved into Data Engineering from Data Science was really out of necessity for my own work.

I hope this guide was successful in explaining today’s data teams. Please comment or email me if you have any ideas about how to improve it.

Open sourcing DIMA – a handy tool for database monitoring

Data at

At a growing startup like, we have a huge volume of data coming in — over 67 million events per month. Efficient analysis is crucial to remain the market leader, and we use a whole host of tools:

  • GUIs like LookerRedash and PSequel
  • Internal analytics and query parallelization microservices
  • Custom AI slackbots we’ve built for spam/intrusion detection and sales lead spotting

This means that our data warehouse is being constantly hit by tons of queries — both on Cron schedules and from employees exploring our datasets.

Volume issues

Managing this volume of queries can be tough.

  • PostgreSQL tables can lock up — trying to swap/delete a table while someone else is querying it, for example, will cause them to lock up. Ideally, analytics tables should only ever be incrementally updated, but sometimes design constraints prevent this.
  • SQL GUIs can randomly time-out and its useful to quickly inspect the DB to see if they are still running or not.
  • Using a combination of screen and psql to run queries in the background requires monitoring. Dima is great for that.
  • An accidental SELECT without a LIMIT on a multi-million row table can crash a SQL GUI.
  • Sometimes a big query just isn’t what you wanted. Psequel doesn’t have a cancel button, and hosted GUIs like Redash and Looker have Cancel buttons that often become unresponsive for long running queries.


To help deal with this, I built a small command-line tool called dima. This originally stood for “Database Integrity Monitoring & Analysis”, but now its just really “dima”. It’s very straightforward and has turned out to be super useful, so I got the OK to open source it.

Dima is a Python script that basically lets you quickly summarize or inspect the pg_stat_activity internal table in Postgres via the psycopg2 library. If you want to kill a query it will call pg_terminate on the PID.

Check out

Install using pip install dima-db or brew tap Frameio/homebrew-dima && brew install dima


dima — a summary of running queries

dima show [PID] —inspect a specific running query

dima rm [-f] Some PID or Keyword — kills query/queries

How we attribute signups at

Attributing 800M clicks to 600K users

At we use Segment to power our app usage analytics. Every action in both our app and on our site can be logged by making a call with an event name such as “signed-up” or “viewed-video”.

This data is essential for our A/B testing and Analytics. Combined with Segment’s great integrations with services like Clearbit, we can enrich profiles with company information, and then trace the first time a user accessed any of our services. This data lets us determine which blog authors drive the highest LTV customers, which prospect audience in YouTube generates the least sticky users, etc.

We leverage a variety of different channels for growth — drip email campaigns, Facebook, YouTube, Twitter, Google AdWords and they all have separate platforms, some of which don’t sync into our data warehouse. A system is needed to homogenize this information so we can analyze client, channel, campaign and medium all in one place.

Attribution is really about having a 360-degree view of a company’s market fit, and its ability to succinctly explain its service to the correct audience.

Planning an attribution system

Conceptually Attribution is quite simple — just find the first result in your database for the user’s ID and extract a UTM right? However, there’s a ton of considerations before you can get to that single row of data that shows where a user came from:

  • Even with great indexing, searching for thousands of users per day in millions of rows, and maintaining the partitions and indexes behind those rows can be a big load on the database.
  • When users first came to the site, they didn’t have user_ids — but rather anonymous_ids assigned by Segment. A user ID is only associated with a user once the user creates an account and the “identify” call is made with both an anonymous_id and a user_id. Sometimes we spot a user’s anonymous_id from over a year ago. Once we have a user’s global ID (assigned from our system), then we have to backfill all of the anonymous sessions prior to our ID being assigned.
  • We have to sessionize every set of user actions and aggregate attributes about their first/last session before signup. Sessionization is the process of aggregating discrete user events into runs of page loads with no more than a 20 minute gap in between. This enables us to do things like analyze the overall intent of a visit and attribute it in bulk. Re-sessionizing involves huge nested window functions and the alias table, as often the same user has several historical anonymous_ids to merge.
  • We will want to maintain some quick lookup tables for this to be useful — user session counts, useful sessions, maybe some aggregates to see how we are doing in a visualization tool like Looker.
  • Note also that Segment actually has a native attribution system with a snippet you can install; but it can’t look back to before you installed it — and we have data going back to 2014.

Implementation Stages

The system design settled on the following 8 query groups/stages:

  1. Find the latest timestamp in the existing sessionization data
    1. Get all the latest visit data from the Segment tracks table
    2. Merge this data with the most recent monthly partition, with some code to check that the columns are still the same (new fields could have been added).
  2. Sessionize the most recent partition (there could be overlap from the last 24 hours so make sure the gap is covered)
    1. Update the aliases table with any newly identified anon/user pairs.
    2. Go back through all historical sessions and re-alias — i.e swap the anonymous id for any newly identified user_ids .
    3. Re-sessionize all historical sessions, since adjacent sessions may have historically different anon IDs but now the same user_id
  3. Add an extra column to indicate if the user signed up in a session
    1. Merge the resulting sessions tables into a single all_sessions table
    2. Create a lookup table for useful sessions — that is the first session, and the signup session only per user
  4. Create a useful attribution table with things like UTM parameters pre-extracted and session counts
  5. Re-index any tables and check that the partitions are in order (especially over a new-month gap).

Table structure


  • aliasing
    • known_pairs
  • sessions
    • attribution
    • all_sessions
    • useful_sessions
  • sessions_stages
    • all_tracks_2018_01
    • all_tracks_2018_01_aliased
    • all_tracks_2018_01_re_merged
    • all_tracks_2018_01_signup
    • …etc
  • partitioning
    • app_tracks_2018_01
    • app_tracks_2018_02
    • app_tracks_2018_03
    • …etc

Columns in the final sessions.attribution table (many are just quick lookups for easy aggregation, and the table is heavily indexed):

  • alias_id
  • session_count_before_signup
  • signup_timestamp
  • blog_url
  • first_page_ever_visited
  • first touch (i.e. first interaction with any property)
    • timestamp
    • source, campaign, medium, referrer
  • last touch (i.e. last interaction before signup)
    • timestamp
    • source, campaign, medium, referrer
  • first attributed touch (as above, but with paid referrer)
    • timestamp
    • source, campaign, medium, referrer
  • last attributed touch (as above, but with paid referrer)
    • timestamp
    • source, campaign, medium, referrer

Deployment and monitoring

The system took about 1.5 months to get working well, and is about 800 lines of code (Python, SQL, Shell and YAML).

Cloud computing ecosystems such as AWS, Google Cloud and Azure make deploying microservices like this straightforward. The code is Dockerized (on Alpine) and deployed on an ECS Fargate cluster. New builds are processed by Codepipeline, which monitors the attribution Github repo. The underlying database is Aurora Postgres.

This system takes about 4 hours to run every morning and can come across difficulties such as underlying table slowness or unavailability. To mitigate this, each query has a 2 hour retry system.

Performance is logged in AWS Cloudwatch, synced to Datadog. If any query totally fails after retries then a Pagerduty ping is sent to my phone.

The results

  • Cool stats — 71% of our users signed up in their first session!
  • 800 million actions have been taken on (row count of our Segment tracks table partitions combined 🤯) and we now have over 600,000 users.
  • We no longer have to dig into specific platforms (e.g. Facebook Business, Google Analytics) but instead have all our analysis in one place and can visualize it in Looker!
  • We can easily calculate the average LTV per campaign or channel with a simple SQL group-by.
  • We can gauge the stickiness of a source — e.g. do our YouTube ads give us customers who last for a long time or quickly churn?
  • As a gut-check, we ask users to report where they heard about (50% direct, 14% YouTube, 10% Facebook), and these numbers match up perfectly with our “First Touch” column.

Where do we go from here?

  • This attribution system is only the piping behind the actual analytics. It gets harder when we have to make more philosophical decisions. How can we decide how to attribute “multi-touch” users? E.g. if a user has a first-touch on one blog article and a last touch several months later on a different one, then who gets the credit?
  • We are closely monitoring the performance right now, and it runs in about 4 hours. However our traffic is increasing by a huge amount every month — how much longer will this last? Thankfully the system is easily parallelizable apart from the last stages.

 With thanks to Kyle Gesuelli and Ammon Brown for proof reading, edits and suggestions. I originally posted on the blog at:

Setting up a development environment for Firefox Extensions.

This is the method I use to create simple firefox extensions. This tutorial is a precursor to the next one which is about using Web Workers (i.e. allowing code to run on background threads).

Setting up the environment

We’re going to need the Firefox Addon SDK. This is a collection of python files that will let you run a test (optionally blank) version of Firefox. To download it:

Now extract it and remove the tarball:

Go to the directory and startup the special shell:

Now you can see that the shell has prepended (addon-sdk-1.17) in brackets to the prompt. This means that the window is probably half filled with text so we can reduce that with the command:

Much cleaner! 🙂

Setting up the extension template

Now that we have this special addon-sdk shell, navigate back to your documents and create a new folder for our extension.

This special shell has various useful commands included, which all look like  cfx xyz . For more about them see here. In this case we use  cfx init

Let’s inspect what was created:

  •  lib  contains a file called main.js  which is the main handler file for all extension code
  • data  is empty but can be used to store things like workers (which we will come to later) or large data files
  • test  can contain unit tests (quite hard to set up but useful for test driven development later)
  • package.json  contains metadata about the extension – version number, name of the creator, description, licensing etc

You can start writing code in main.js and it will run in the browser. Once finished, use  cfx run to test it!

See the next tutorial on how to write a firefox extension using web workers!

A quick solution to OrderedDict’s limitations in Python with O(1) index lookups

Background to the Problem

I work regularly with gigantic machine learning datasets. One very versatile format, for use in WEKA is the “ARFF” (Attribute Relation File Format). This essentially creates a nicely structured, rich CSV file which can easily be used in Logistic Regression, Decision Trees, SVMs etc. In order to solve the problem of very sparse CSV data, there is a sparse ARFF format that lets users convert sparse lines in each file such as:

f0 f1 f2 f3 fn
1 0 1 0 0

Into a more succint version where you have a list of features and simply specify the feature’s index and value (if any):


{0 1, 2 1}

i.e. {feature-index-zero is 1, feature-index-two is 1}, simply omitting all the zero-values.

The Implementation Problem

This is easy enough if you have, say 4 features, but what if you have over 1 million features and need to find the index of each one? Searching for a feature in a list is O(n), and if your training data is huge too, then creating the sparse ARFF is going to be hugely inefficient:

I thought I could improve this by using an OrderedDict. This is, very simply, a dictionary that maintains the order of its items – so you can pop() items from the end in a stack-like manner. However, after some research on StackOverflow, this disappointingly this doesn’t contain any efficient way to calculate the index of key:

The solution

What can we do about this? Enter my favorite thing ever, defaultdicts with lambdas:

Assigning items values in addition to the index is fairly straightforward with a slightly modified lambda:


This is a fun fix, but doesn’t support full dictionary functionality – deleting items won’t reorder the index and you can’t iterate in order through this easily. However, since in creating this ARFF file, there’s no need for deletions or iteration that’s not a problem.

WNYC Radio: “Are Hackathons Worth It?”

I was recently contacted by Jeff Coltin, a journalist at WNYC Radio, who asked me to participate in a show about hackathons in NYC.

He featured a snippet from our conversation, specifically about problems that the hacker community could solve. I said (vaguely accurate transcription):

“…There are so many problems that hackathons could fix. I think some big issues at the moment in the media, things like the NSA spying scandals and stuff like that. I think one thing the tech community has slightly failed to do is to make encryption really easy. There’s a sort-of inverse relationship between simplicity and security, so the more secure an app, often the more inconvenient it is to use. So we have things like TOR, extra-long passwords (TOR slows down your connection a lot), VPNs and a lot of very secure services are incompatible with mainstream services. So this level of security and privacy that users want or need is just so inconvenient to achieve its really up to the hacker community to make them much easier to use…”

There have been efforts such as Cryptocat but its adoption rate still needs to grow. HTTPS would probably be the best example of seamless encryption but this often fails when people either ignore or are at loss as to what to do when HTTPS certificates are flagged as invalid by the browser.

Cryptography is an incredibly tough field of Computer Science, so creating reliably secure apps is hard. Educating oneself about this can require a fairly super-human effort and I have a lot of respect for people who contribute modules in this field to PyPI. I’m hoping to start the Crypto course on Coursera once I have some more free time, but beating the security-simplicity inverse relationship I mentioned is certainly easier said than done.

Hackathon Report – Greener Neighbor – Neighborhood Green-ness rankings in NYC



In July 2012 I attended and won the Judge’s Pick at a Hackathon called “Reinvent Green“, organized by at NYU Poly. There I met and teamed up with Anton Granik, a great graphic designer and digital director.

ReinventGreen Hackathon Contestants Presenting GreenerNeighbor

Idea – “Green, Greener, Greenest”!

Our idea was to create a site which could promote competitiveness between boroughs, zip codes or city blocks to be the greenest. I basically imported all the data we needed from NYCGov – that is electrical consumption, trees planted etc into a MySQL database and then indexed it by zip code. Then, Anton designed a front end, and I created some django functionality to let users explore an individual zip code, or see a heat map and ranking of the top zip codes. It took about 15 hours to build.

What could make this idea turn into a real service?

Fresh, reliable data is the biggest problem for this service. We were using NYC OpenData’s datasets, such as “Electric Consumption by Zip Code 2010“. This data contains numerous duplicates and, at the time, was 2 years old. In order for this to work, we would have to had regularly (perhaps monthly) fresh data. The site is currently offline to conserve processing power on my server, but if you want to take a look, tweet to me and I’ll re-enable it 🙂


The hackathon was featured on Channel 25,, NYC Digital, TheNextWeb and in the Huffington Post.

Meeting Mayor Bloomberg