WiseAnalytics | Strategies to effectively propagate Master Data Merges

Insights

Strategies to effectively propagate Master Data Merges

11 min read

By Julien Kervizic

Data Merging And Survivorship - No Code Solution

‍

I have recently written about the approach to handle master data management matching and merging of records into master records. Leveraging SSOT is, however, a bit more complicated than just being able to create these records. It is also about being able to propagate these changes.

(Hard Merge)

Hard merge provides a consolidation of records that do not allow for disassociating the different records that were merged. Propagating the changes coming from a hard merge vs. a soft merge can be easier in some aspect and harder in other. There are quite a few distinctions in the different approaches that can be taken to propagate the unification of records, be it from leveraging an API to pushing actual updates.

Using an API

API Routing

Using the API for hard merge can quite straight forward, most of the work happens under the hood and is abstracted from the other systems, provided they are not doing calls to scrape the API incrementally.

An example of how that would be set up is shown above, where every get request to the different IDs are pointing to the new updated record.

API Scrape

For API that allows to scrape the different changes on the data contained within it, a different strategy needs to be applied. The minimum consists in providing an updated set of information that contains all the records that were part of the merge, and provide a status as to whether the record is active or not. Better practice also provide information as to whether the records have been merged onto a different id.

The example above represents a mockup of how an API allowing scraping could look like. The API is calling a product catalog, where the same product has been set up multiple times for different languages. The same product is being represented in with an English, French and Dutch name with different ids. A merge strategy has been performed to consolidate the records representing the same product onto a master product, representing the English version of the product.

The French (record 2) and Dutch versions (record 3) have been merged into the English version (record 1). Each record contained initially contained one similar product that have been merged using an aggregation merging rule. Records 2 and 3 have been in turn de-activated and provided with the information related to the record (English version/record 1) they have been merged onto.

It is the responsibility of the client system, ie: the one calling the API to reflect the merge changes onto their system based on this information.

The activity status allows the client system to ignore records that are going stale and won’t be any more updated

The merge id provides the source system the information on how to consolidate related events, in our example this could be the case for product reviews for instance. The client system would be responsible for aggregating these reviews for the master product record.

Data Feed Integration

There are different ways to propagate a hard merge as a data feed integration. From an active update that goes back to old records and updates their specific id relates, a passive update that pushes the necessary information required to do a local merge to a Snapshot Update, which provides a full replicas of the different merged entities.

Active Update

In some cases, a system can be actively updated the records and their associated data. It is the role of the integration pattern to updates the different pieces of related integration with the newly merged id.

Let’s have a look at what is happening in the example below. There are different customer records present in the database, each with separate transactions associated with these ids. The IDs 2 and 3 are duplicated records of ID 1. Using an active integration with an external system, after the merge has occurred, the information will get updated onto the other systems. The activity status of the merged records will get updated to inactive, and the ids of the transactions associated with these user ids will be updated to the master record.

This type of integration often relies on the ability to perform an update or upsert operation by another id (in this case, a transaction_id) than the id you want to modify. This usually requires that this other id is set as a primary key in the external system you want to integrate with, and that this system allows for this kind of update.

It is often a better pattern to provide in these associated events, a master_id field rather than directly updating the records ids. This allows us to retain the original data, while at the same time offering the merging functionality.

Passive Update

The passive update relies on a similar approach to the approach taken for API scrapes. The main difference is that instead of getting the information pulled from an API, the information is being pushed to other systems. When offering an incremental API and having taken the API scrape strategy to record propagation, it a natural extension to provide a passive update feed webhook.

Snapshot Update

Another approach to deal with record deduplication is to periodically provide a full sample of the merged dataset and clear any other records. This type of integration pattern however, has some limitations. Using this approach offers some drawbacks:

.Records Routing: Merged records will not re-route to their master record, think of the previous example where we merged three records of a “chair” into 1. If this was pushed to an e-commerce website using a snapshot update records 2 and 3 would result either on a 404 or be doing a fairly generic redirect.
Linked Entities: Snapshot updates don’t usually provide a way to tackle linked-in entities, such as could be the case with product review

One of the ways to mitigate these issues with Snapshot updates is to consolidate a list of all the identities previously used for the record.

In the example below, “chaise” (ID2) and stoel (ID3) have been merged on the record “chair” (ID1), but this maser record still holds a memory of the previously used ids for this master record.

Once the snapshot has been pushed, it is the role of the target record to leverage the information contained into the id_list to re-route and re-link the different entities.

Considerations

To get a sense of which integration pattern to use for Hard merge propagation, the system landscape, and what the different systems can support is the first criteria to check. If the system is not able to support the update of records, for instance, through an API call, the active update strategy would be somewhat challenging to implement.

Another consideration to have is the overall memory needs and transfer size of the different approaches. The Snapshot strategy, for instance, can require a decent amount of memory in the different system it integrates with, and usually involves some clearing of previous snapshots being sent. Sending a full snapshot would also require a certain level of data transfer, which might face limitations for a large volume of data in terms of API throttling issues.

System responsibility is another factor to take into consideration when looking at which strategy to apply. While, in most cases, it is the original merging system that is responsible for applying the merging and authority strategies, it can be the case that some of the data needs to be merged locally. In that particular case, it is vital to ensure that the merging algorithm used in the target system matches the one used for the original merging system.

(Soft Merge)

Soft merging, on the contrary to hard merges, allows us to keep the original records and to remove the associations between the records as needed. Propagating the changes in both the associations and the merged entity records requires a different approach than for hard merges.

API

API Routing

Similarly to the hard merge strategy, in the case of an API routing call, everything works under the hood. The client can extract the different values related to a given id, but is not able to perform scrape on the full dataset. In this context, it is the role of the API system to apply the different merging and authority strategies to aggregate the records.

The diagram above shows how this specific could work. When an API call is made to any of the records to retrieve information, there is a check on an internal association table to check which master id the record belongs to. Then a query is performed to extract all the records having the same master id. An authority strategy is then applied to the resultset, and the API responds with the merged data.

API Scrape Incremental update

In an API Scrape incremental update, there is a couple of data points that should be provided:

1. The original records: The original records need to be provided if there is an expectation that the target system handles the application of the authority strategy themselves.

2. The association records: An association table for every original record to the master record needs to be provided. The association table enable routing from merged ids to the master record and allow for the application of the authority strategy.

3. Merged master records: a merged record that already “materialized” the authority strategy needs to be provided if there is no expectation that the target system applies the authority strategy.

Data Feed Integration

Association Table

While for a hard merge propagation, applying an authority strategy is relatively trivial to do, and the more difficult issue to tackle resolves around record routing and entities linking. For a Soft merge, for the routing and entity linking, it is usually sufficient to provide an update of the association table, either through incremental update(pictured below) or by providing a full snapshot.

The more complicated part for soft merge associations is the application of the authority strategy on the merge records.

One way to deal with this is to provide at the same time the original record and the “authoritative” master record. In this case, the record is only fully soft- merged in the source system. And changes in the association would result in an update in both the association table, and in the concerned master records (master id one in this example).

Another approach would be to reconstruct the same authority strategy on each set of records associated with the same master id. This is something that might be practical to do when dealing with a reasonably simple authority strategy (e.g., time-bound strategy applied at record level), and the target system lets you use that kind of business logic.

Association with filtering

Association with filtering can add an extra layer of complexity concerning both the calculation of the Master Record and the association of linked entities.

Master Record needs to be resolved locally: In our example below, we are running a small sport apps business and have a user signed up for 3 of our apps, frisbee fan, football freak and hoops & hoops. “Hoops & Hoops” was our first app, and we initially didn’t think that we would need to be merging user profile until we came up with our other app “Football Freak.” When Football Freak came out, we revised our terms of service to a V2, which allows us to be merging user profiles.

In this example, only records with agreed tos ≥ 2 can be merged. Yet we would like to provide the fully combined (all three identity) record as soon as John logs onto Hoops & Hoops and agree to the terms of service.

To achieve this, we need to be providing the system with:

1. The Association table

2. For each entry in the association table the Specific Filter conditions to be used

3. The Authority Strategy logic that needs to be applied

Using this set of information, it is possible to re-apply the merging logic in the target system.

Master Record does not need to be resolved locally: In case the master record does not need to be resolved locally, we can use the same approach as for the normal association table integration with master records. The specific associations with filtering conditions can be collapsed and summarized invalid associations (i.e., which passed the filtering conditions).

The example above shows how both the original records have been kept intact, while pre-calculated merged master records have also been provided for “John Smith” ID4 and “John Andrew Smith” ID6.

Linked Entities: In the case of linking entities, we need to consider the fact that we are merging another profile onto another.

The Master record should keep all its associated entities, while the entities linked to a record that is being soft merged, should pass a filtering condition. This can be the case, for instance, when legislation or the term of service only allows to leverage data points after the association of the profiles.

Below is just such an example. John, our user of Frisbee fan and Football Freak app, has agreed to the v2 of terms of service at 12. We have as his master record ID4, the one he uses for Frisbee Fan. Because this is the primary identity, we can leverage all the event data linked to it (Event A, B, and C). We are furthermore able to leverage all the linked events data to his associated profile ID 5 (Football Freak) that are passing the filtering condition (Event E and F).

An example of how the soft merged linked entities could be computed in SQL is shown below:

Considerations

When looking at propagating, there are a few factors to consider, from resource utilization, processing speed, and the advantage and disadvantages of having part of the unification logic deferred to the other systems.

Resource utilization: Processing time for the calculation of the master entities can be quite intensive when dealing with transient identities (eg, session-id), it is important to understand what type of load you would be putting in the system.

Speed: There are two aspects to consider in terms of speed, the processing time to calculate the merged profile, and the time it takes to re-calculate the combined profile based on disassociations.

Logic deferral: Different approaches defer a different amount of logic to the external systems. When integrating with them, it is worth checking whether the systems can apply the strategies needed to unify the records and whether it would be desirable for the systems to do so.

Summary

Propagating record merge changes is a complicated affair, that depends on the chosen merging and authority strategies, target systems capabilities, and overall vision of how the different systems should be interacting.

There is more than one way to approach the problem, and identifying which approach is appropriate requires an intimate knowledge of the landscape, data, and strategies used for unification.

‍