Monthly Archives: January 2019

How we attribute signups at

Attributing 800M clicks to 600K users

At we use Segment to power our app usage analytics. Every action in both our app and on our site can be logged by making a call with an event name such as “signed-up” or “viewed-video”.

This data is essential for our A/B testing and Analytics. Combined with Segment’s great integrations with services like Clearbit, we can enrich profiles with company information, and then trace the first time a user accessed any of our services. This data lets us determine which blog authors drive the highest LTV customers, which prospect audience in YouTube generates the least sticky users, etc.

We leverage a variety of different channels for growth — drip email campaigns, Facebook, YouTube, Twitter, Google AdWords and they all have separate platforms, some of which don’t sync into our data warehouse. A system is needed to homogenize this information so we can analyze client, channel, campaign and medium all in one place.

Attribution is really about having a 360-degree view of a company’s market fit, and its ability to succinctly explain its service to the correct audience.

Planning an attribution system

Conceptually Attribution is quite simple — just find the first result in your database for the user’s ID and extract a UTM right? However, there’s a ton of considerations before you can get to that single row of data that shows where a user came from:

  • Even with great indexing, searching for thousands of users per day in millions of rows, and maintaining the partitions and indexes behind those rows can be a big load on the database.
  • When users first came to the site, they didn’t have user_ids — but rather anonymous_ids assigned by Segment. A user ID is only associated with a user once the user creates an account and the “identify” call is made with both an anonymous_id and a user_id. Sometimes we spot a user’s anonymous_id from over a year ago. Once we have a user’s global ID (assigned from our system), then we have to backfill all of the anonymous sessions prior to our ID being assigned.
  • We have to sessionize every set of user actions and aggregate attributes about their first/last session before signup. Sessionization is the process of aggregating discrete user events into runs of page loads with no more than a 20 minute gap in between. This enables us to do things like analyze the overall intent of a visit and attribute it in bulk. Re-sessionizing involves huge nested window functions and the alias table, as often the same user has several historical anonymous_ids to merge.
  • We will want to maintain some quick lookup tables for this to be useful — user session counts, useful sessions, maybe some aggregates to see how we are doing in a visualization tool like Looker.
  • Note also that Segment actually has a native attribution system with a snippet you can install; but it can’t look back to before you installed it — and we have data going back to 2014.

Implementation Stages

The system design settled on the following 8 query groups/stages:

  1. Find the latest timestamp in the existing sessionization data
    1. Get all the latest visit data from the Segment tracks table
    2. Merge this data with the most recent monthly partition, with some code to check that the columns are still the same (new fields could have been added).
  2. Sessionize the most recent partition (there could be overlap from the last 24 hours so make sure the gap is covered)
    1. Update the aliases table with any newly identified anon/user pairs.
    2. Go back through all historical sessions and re-alias — i.e swap the anonymous id for any newly identified user_ids .
    3. Re-sessionize all historical sessions, since adjacent sessions may have historically different anon IDs but now the same user_id
  3. Add an extra column to indicate if the user signed up in a session
    1. Merge the resulting sessions tables into a single all_sessions table
    2. Create a lookup table for useful sessions — that is the first session, and the signup session only per user
  4. Create a useful attribution table with things like UTM parameters pre-extracted and session counts
  5. Re-index any tables and check that the partitions are in order (especially over a new-month gap).

Table structure


  • aliasing
    • known_pairs
  • sessions
    • attribution
    • all_sessions
    • useful_sessions
  • sessions_stages
    • all_tracks_2018_01
    • all_tracks_2018_01_aliased
    • all_tracks_2018_01_re_merged
    • all_tracks_2018_01_signup
    • …etc
  • partitioning
    • app_tracks_2018_01
    • app_tracks_2018_02
    • app_tracks_2018_03
    • …etc

Columns in the final sessions.attribution table (many are just quick lookups for easy aggregation, and the table is heavily indexed):

  • alias_id
  • session_count_before_signup
  • signup_timestamp
  • blog_url
  • first_page_ever_visited
  • first touch (i.e. first interaction with any property)
    • timestamp
    • source, campaign, medium, referrer
  • last touch (i.e. last interaction before signup)
    • timestamp
    • source, campaign, medium, referrer
  • first attributed touch (as above, but with paid referrer)
    • timestamp
    • source, campaign, medium, referrer
  • last attributed touch (as above, but with paid referrer)
    • timestamp
    • source, campaign, medium, referrer

Deployment and monitoring

The system took about 1.5 months to get working well, and is about 800 lines of code (Python, SQL, Shell and YAML).

Cloud computing ecosystems such as AWS, Google Cloud and Azure make deploying microservices like this straightforward. The code is Dockerized (on Alpine) and deployed on an ECS Fargate cluster. New builds are processed by Codepipeline, which monitors the attribution Github repo. The underlying database is Aurora Postgres.

This system takes about 4 hours to run every morning and can come across difficulties such as underlying table slowness or unavailability. To mitigate this, each query has a 2 hour retry system.

Performance is logged in AWS Cloudwatch, synced to Datadog. If any query totally fails after retries then a Pagerduty ping is sent to my phone.

The results

  • Cool stats — 71% of our users signed up in their first session!
  • 800 million actions have been taken on (row count of our Segment tracks table partitions combined 🤯) and we now have over 600,000 users.
  • We no longer have to dig into specific platforms (e.g. Facebook Business, Google Analytics) but instead have all our analysis in one place and can visualize it in Looker!
  • We can easily calculate the average LTV per campaign or channel with a simple SQL group-by.
  • We can gauge the stickiness of a source — e.g. do our YouTube ads give us customers who last for a long time or quickly churn?
  • As a gut-check, we ask users to report where they heard about (50% direct, 14% YouTube, 10% Facebook), and these numbers match up perfectly with our “First Touch” column.

Where do we go from here?

  • This attribution system is only the piping behind the actual analytics. It gets harder when we have to make more philosophical decisions. How can we decide how to attribute “multi-touch” users? E.g. if a user has a first-touch on one blog article and a last touch several months later on a different one, then who gets the credit?
  • We are closely monitoring the performance right now, and it runs in about 4 hours. However our traffic is increasing by a huge amount every month — how much longer will this last? Thankfully the system is easily parallelizable apart from the last stages.

 With thanks to Kyle Gesuelli and Ammon Brown for proof reading, edits and suggestions. I originally posted on the blog at: