Clickstream events collectors are applications that let you collect raw clickstream data from the front-end of an application. There are multiple reasons why you should rely on these event collectors, and setting them up isn’t that complex.
There is at the same time an overlap and a complimentary value between these solutions. Snowplow and Divolte are event collectors, collecting the raw data that is needed to do deep analysis. Raw data can be used to increase the depth of analysis, it can for instance be leveraged for conversion rate optimization (CRO) for instance, where it allows for a granular analysis of the different customer touch point using click path analysis.
Exporting clickstream to raw data is a feature that is offered within Google Analytics 360, but not the free version, and if you are only looking at acquiring the 360 version for this feature, going the snowplow route might turn out to be more cost effective.
The second complimentary value, lies in using them to potentially bypass some of the tracking restriction of ad blockers. Being open-source solutions that need to be self hosted, you can easily set them up on your own domain
Setting them up on your own domain allows to bypass domain black-lists and modifying or having a different tracking script lets it avoid checksum detection. The ability of further customizing the tracking script name, allows it to bypass ad-blockers looking for “track” or “analytics” in their name.
These solutions allow the data to be ingested as a data stream, and creating an application, it is possible to push back this data to Google analytics or other analytics tools.
The clickstream collectors ability to push data to a message broker such as Kafka, allows to include them as an integral part of an application. The type of applications that could rely on this type of data range from real-time reporting, real-time marketing trigger to real-time (prediction) model evaluation.
Tracking Script: The tracking script role is to capture the different actions performed by users browsing the website and pushes these events to the clickstream collector API.
Clickstream collector API: A clickstream collector API, is merely a receiving end point of an API, that might perform 1) request authorization 2) schema validation and push the data to a message broker for ingestion.
Message broker: A message broker is there to allow for the asynchronous processing of the data. One of the most popular message broker for data is Apache Kafka. Applications can directly consume the data stream to compute real-time aggregates or filter the stream.
Data Sink: A data sink will take the incoming data from the message broker and push it to the storage layer. This is usually a S3 bucket on AWS, a DataLake storage on Azure or a plain HDFS file.
Storage Layer: The storage layer provides a long term storage for the incoming data. Most compute engine on Hadoop such as Presto, Spark or their cloud equivalent such as AWS Athena are able to query files on a bucket storage.
At their core, setting up these solutions can be done through setting up “Container” applications, both snowplow and divolte provide docker images. These can be setup on a VM, a container service such as Azure Container instance, or on Kubernetes or docker-swarm.The containers should interact with a load balancer for autoscaling and a messsage queue. The docker-compose file of divolte for example bootstrap a Kafka instance for example.
There are different methods for the implementation of these clickstream loggers from a front end perspective.
Using a clickstream event connector can bring a lot of benefits in terms of how your data is being tracked and in the granularity of the data available or making the data available in real-time to the application. There are open source solutions for it that can easily be deployed based on docker containers.