STOP Loosing your Historical Google Analytics Data When Moving to a Hit/Event Data Warehouse!

Google Analytics Data collected before Segment was installed are not available – what can I do?

Are you hesitant about using Segment because you will loose your Google Analytics data when you move your reporting to Redshift, BigQuery etc.?
DON’T BE – let me tell you why…

The idea for this post began when I researched Segment’s available data sources.
I was wondering about the use-cases where Segment is implemented to gather data. People must either experience an overlap period while both using Segment and e.g. Google Analytics at the same time, or they’re simply loosing their historical data. Both scenarios would bother me!

So, I browsed through Segment’s available data sources in order to find a Google Analytics source connector – allowing me to get started with the data already collected in my Google Analytics. To my surprise – this possibility was not available in Segment.

At least 273 persons has up-voted this source as a desired new feature. That alone is enough reason for me to write this post. ๐Ÿ™‚

ME

I know the text says “To find or request new Destination, use…” – but the results are actually for Sources. As Segment do offer Google Analytics as a destination. And note that the Tab high-lighted says “Sources”. Now we have established that I’m sane – let’s, move on. ๐Ÿ™‚

Why not push old data through to Segment if I can?

What do I mean by “old data”? Or historical data? My benchmark!
I mean data that has already been collected to your Google Analytics property before you decided to introduce Segment to your tool stack.
Is it possible to fetch and push this data to Segment you ask? Yes, that’s what I’m telling you.

Multiple scenarios comes to mind, but I’ll focus on these 2 scenarios:
Scenario 1: I’m moving away from Google Analytics to some other tool – using Segment – and want to bring my historical data with me?
Scenario 2: I want my Google Analytics/tracking data in a data warehouse but I also need the data collected before I introduced Segment to my stack.

Raw data – NOT aggregates

To solve either of the two scenarios you need to get your hands on the original (or similar) hits sent to Google Analytics.
This means we need data as a fact table. Just like this screenshot.

The above screenshot is taken from the Google BigQuery console. The data is produced using Scitylana’s solution.

Scitylana + AWS S3 + Segment

Life as a data engineer and analyst today is more of less like being a LEGO builder. The bricks available to having Google Analytics data as a data source in Segment are the following.

  1. Scitylana
    • can move raw hit/event Google Analytics data to AWS S3
  2. Segment
    • can ingest AWS S3 data files
  3. AWS S3
    • can trigger a AWS Lambda function on specific events in your S3 bucket.

Since I will present you a ready-made AWS lambda function to accomplish step 2+3 the task is all about stacking the bricks in the right order.

Setup

I’ll walk you thru the steps involved. Before we start you need the following at hand

  1. A new AWS S3 bucket
  2. A Scitylana account (just create a trial)
  3. A Segment account

Set-up a bucket in S3 to receive data

Create a new empty S3 bucket to receive the Google Analytics data from Scitylana.
In this post I’m using the name scitylana-segment-blogpost
This is important since Scitylana will have read and write access to the bucket.

Set up a Amazon S3 source in Segment

Add S3 Source

  1. Go to Segment.com
  2. Login
  3. Click Sources
  4. Click Add Source
  5. Search for S3
  6. Select Amazon S3

Get the Write Key

  1. Click Connect
  2. Click Add Source
  3. Copy the Write Key for later use

Setup AWS Lambda function

  1. Go to AWS Lambda functions
  2. Click Create function

Basic function settings

  1. Name the function scitylana-segment
  2. Choose Node.js 12.x runtime
  3. We are using the default setting for execution role. We need to add S3 read acces to the role later.
  4. Click Create function

Trigger on new objects in S3

  1. Select S3 in the list of triggers
  2. Choose the bucket we created in a previous step. I’m my case, scitylana-segment-blogpost
  3. Select All object create events
  4. Set Suffix to .txt
  5. Click Add

Upload code

  1. Download lambda .zip file
  2. Click Upload
  3. Select aws_scitylana2segment_lambda.zip

Function timeout

  1. Set Timeout to 3 min and 0 Sec

Update Role with S3 permission

  1. Click the link View the scitylana-segm… this will open a new tab with IAM
  1. Click Attach policies
  2. Write S3 in the search as in screenshot
  3. Check AmazonS3ReadOnlyAccess
  4. Click Attach policy
  5. Go back to Lambda function tab

Segment write key

  1. Write write_key as key name
  2. Paste the Write key from Segment section above as value

Finish up

  1. Scroll to the top of the page
  2. Click Save
Use Scitylana to send Google Analytics data to S3

Move Google Analytics data to the bucket to trigger the AWS function

We use Scitylana to extract data from Google Analytics then process data these data to a better more streamline raw dataset and finally load it to S3
First set-up a free Scitylana account

Set-up Scitylana

  1. Go to frontpage and enter email and click “Get your data now”
  2. Create account

Setup Source

  1. Click Connect and then click the Green button
  2. Follow the Google guiding on-screen
  3. When you are redirected back to Scitylana, select Google Analytics Property
  4. Select View
  5. The green text indicates that the Google Analytics view is compatible and you shouldn’t run into API quota issues when Scitylana generates the data set.

Setup Destination

  1. Click Amazon S3 as destination
  2. Follow the steps 1,2,3 in the UI.
  3. Click Save
  4. When the first day of data has been extracted you will get an email notification.
  5. Depending on how much data you have in your Google Analytics property. 30 mins later (this can vary a lot) you will get another email that you have received data in your S3 bucket

That’s that

The Scitylana trial includes 30 days of data and 14 days refresh daily. The 30 days will be sent, day by day, to Segment. Then the process stops.
You can buy more data.
Just write in the chat and we will help you.

Segment has a free tier which isn’t quite enough in the long run.
S3 is very cheap and has a big free tier that will take you far.

Thanks for hanging in and reading the full post. I hope it was useful.
All feedback is very welcome!

Leave a Comment

Your email address will not be published. Required fields are marked *

// Type writer effect