Insights
Create a Clickstream event collector in 10 minutes using Azure Logic App
5 min read
By Julien Kervizic
Your guide to ecommerce clickstream data - Session AI

I recently wrote about using clickstream events collectors, such as Snowplow or Divolte, to power more reliable and deeper analytics. It is however, possible to create your clickstream event collector in a few clicks using Microsoft’s Cloud.

Components overview

Besides the tracking script, the Azure stack can handle all the functions of a clickstream collector with three components: Logic App, EventHub, and DataLake Storage.

1. Tracking Script: A tracking script is a piece of javascript that will be downloaded by the browser, and that will track and send the different events to the clickstream collector.

2. Logic app: The role of Azure logic is only to capture the message and push it to an event hub “topic”.

3. EventHub: EventHub is there to host the events for real-time processing, a specific setting called data capture allows to export the data onto an Azure Blob Storage or DataLake Storage.

4. DataLake Storage: Provides a long term storage for the data pushed to Event Hub

Once the data is in data lake storage, it is possible to query it using Microsoft’s Data Analytics USQL.

Benefits and drawbacks of the solution

There are a few pros and cons related to leveraging this type of serverless solution to capturing clickstream data.

  • Pro: No need for a load balancer for autoscaling or “application maintenance”
  • Pro: Simple to setup
  • Pro: More control over where your data is captured than if using GA360 (cloud hosting), particularly the case for certain companies in which AWS and GCP aren’t welcome
  • Pro/Cons: Pricing is based on a per event, which can be cheap if low amounts of events, but can end up much more expensive than GA360 if a large amount of data to be collected
  • Cons: More difficult to test than pure code

Setting it the clickstream collector

The Clickstream collector can be set up in a three-step process. First, setting up the storage layer (blog or DataLake), then setting up EventHub with Data Capture and finally create a logic app that will send events back to the event hub.

1. The first step is to create a blob storage or a data lake storage. This is where the data will be hosted in the end.

2. The second step is to create an event hub with data capture turned on. This will export to blob/data lake storage the data ingested at specific (configurable) intervals.

3. The third step relates to the creation of the logic app. The logic app needs to be composed of only two components. An HTTP Request receives and sends event to the event hub.

The HTTP request logic app component provides an option for schema validation. The basic logic app event schema that we are using in this example is provided below:

Testing the flow

The following python code can push data to the logic app, where the URI variable is the hosted HTTP endpoint provided by the logic app. It is worth noting that to be able to push to event-hub, the content needs to be encoded in base64.

A tracking script can be created in Javascript in the same manner; an example javascript implementation is shown below.

If the logic app has been able to push the data to event hub, it should show events as succeeded within the UI as below:

After the data capture interval has elapsed, a new file should appear in the data storage.

Using AVRO reader, we can read the content of the file, and see how the data is being stored:

Making it production-ready …

There are a few things that can be done to productionalize this:

  • Authorizing and Restricting access: It might be worth restricting access only to authorized senders. Here is a tutorial on how to do this using Azure API management.
  • Event Format validation: Enabling event format validation can enhance data-quality by ensuring that only valid events will be collected.
  • Domain hosting: Using a custom domain hosting rather than the native azure URL. This will add flexibility in the ability to rewrite the different endpoints to redirect to various resources.
  • Arm template: Setting up the resources using ARM templates ensures that the configuration is consistent and that it would be easy to recreate the resources should they get deleted.

Wrap Up

Using the combination of Logic App/EventHub/DataLake Storage, provides a quick way to deploy and gather clickstream data. There is still some more work that might be required to having it more production-ready. Still, by default, Logic Apps handle availability, scalability, the retrying logic as well as the logging.

One of the main factors to consider is the pricing logic, which is per execution, sites with a low amount of visits would benefit pricing wise from relying on this type of integration, but sites with a high volume of visits might want to look at a different approach.

Privacy Policy
Sitemap
Cookie Preferences
© 2024 WiseAnalytics