WiseAnalytics | Fake it till you make it, how synthetic data can be used to drive decisions

Insights

Fake it till you make it, how synthetic data can be used to drive decisions

8 min read

By Julien Kervizic

Synthetic data, refers to data that is algorithmically generated. It is often used as a stand-in to test the behavior of systems when real data is not available, to validate mathematical model as well as sometimes to train machine learning models. It turns out synthetic data sometimes can be used to bring a level of control and drive decision making.

Cost data from our logistic service providers was only provided after a full month had closed and sometime with a delay for certain cost elements such as outbound shipments. The data itself was usually provided as a multi-tab excel file. Within it costs were mixed among different projects, fixed costs were mixed with variable costs.

Getting a clear understanding of what was our cost structure and marginal cost was difficult to say the least, and we needed visibility for a growing number of reasons, from pricing decisions, financial control, to performance measurement and tracking against our operational plan. The decision was made that we needed to virtualize our logistic cost.

Deep diving into our invoices, we could see that there was very different rate card structure among the four different Logistic service providers (LSP) that we were using. What also stood out was that not only we did not have visibility on the cost we did not have direct visibility in some of the underlying variables that was impacting our costs, ie: the cost drivers.

Invoices from LSPs could be split into five main categories, each with its own set of cost drivers:

Inbound: related to the cost of bringing the different items into the warehouse of the LSP, the main cost drivers for which was the amount of space the item were taking.
Storage: related to the cost of storing an item at the warehouse of a LSP, the main cost drivers were therefore space and the number of days an item was staying in the warehouse:
Pick & Pack: related to the preparation of orders for expedition, cost drivers were usually at item level and involved the type of items, amount of items and number of different skus.
Packaging: The cost of the box itself and any extra material (tape, filling) used.
Outbound: The cost of the expedition itself, typically handled by package delivery providers such as UPS, Fedex, .. typically having cost drivers related to the number of boxes, weight …

We were pushing data related to order and items, and were getting charged in colis and daily cubic meters. We needed to not only virtualize our cost structure but also some of their underlying variables. Ultimately the cost drivers could sometimes be even more complex variables such as the number of unique items by box beyond the first one.

We needed a way to compensate the gap in our data and to be able to handle the variety of cost drivers used within our logistical landscape. The first step was to build a data pipeline out of the available information:

Each step of the data pipeline would augment the information provided by the previous step.

Our raw data was based on an order feed that we were getting from our e-commerce website, augmented with some limited meta-data related to the type of items that were present within the order. At the time we were mostly selling 3 different types of products:

The Sub, home draft appliance
Torps, 2L mini-keg to be used with The Sub
Accessories, which consisted mostly of 6 packs glasses

Each set of product category was mostly consistent with certain attributes and dimensions that was required to estimate the logistic costs. The Sub was a fairly bulky product taking much larger space than a Torp. Torps at the time, contained beer exclusively, which requires an age-check by the shipping company, glasses sometimes required a surcharge for containing a fragile item.

Our Order aggregation step, which consisted in consolidating the important characteristic at an order level needed to reflect this split of different product types, it also needed to retain any extra information needed to convert into virtualized variables or cost drivers.

Virtualized variables, were in turns created after this order aggregation step, one of the most important variable to simulate was the number of shipments that would be contained in an order. Our box size was mostly consistent within a given LSP but could vary by LSP. Given the fairly consistent dimensions within a product type, we could use some simple heuristic to calculate the number of shipment we would be creating from an order. For instance if we said that a box could contain up to 10 unit of measure, a Sub could be the equivalent of 5 units of measure, a Torp of 1 and a 6 pack of glasses as the equivalent of 3 unit of measure. We could then apply a simple formula:

To get an estimate of the number of shipment we will be getting.

Cost Drivers, our cost drivers sometimes differed from the virtualized variables, negotiated rates for certain process path could be quite different than being charged for a shipment. For instance during a pick and pack process we could have been charged for the cost of putting together a box, but we could also be a cost by the number of skus that needed to be picked. The cost of handling a box in itself giving a certain allow for 2 SKUs. The logic from which we were charged could be quite complex and our virtualized variables needed to be decomposed into cost drivers that could then be applied to rates provided by the different LSP. For this we needed to create unit tests and scenarios to double check the logic of the cost drivers.

Let’s take an example order of a Sub, 3 Heineken Torp and a pack of 6 Heineken glasses. Based on the previous constrain of 10 unit of measures that can fit in a box, there are multiple ways the box could be packed.

In scenario 1, we pack in one box 1 Sub and the three Heineken Torps and in a second box, the pack of glasses. In each of the boxes we are still within our allowance of two SKUs, we don’t get charged extra.

In scenario 2, we pack in the first box 3 SKU, the Sub, two Torps and the pack of glasses, in the second box we still have one SKU. Since we have more than 2 SKUs in the first box, we are not anymore within our allowance and get charged extra.

Since we were not able to directly know which way a warehouse worker would pack the order, we needed to decide on some logic to apply. The creation of unit test /scenarios made this decision explicit. If we had an agreement with the LSP that they needed to pack in the most cost effective way this modeling could be used to identify if we are overcharged due to some ineffective process. If however we wanted use this for pricing decision we might decide to go for a more conservative approach.

Once we knew that we had the necessary information available to model a the cost of an order we needed to build some simple logic to apply rate cards to our synthetic order data.

Rate card files: The different rate card files are created in JSON in the following format:

The information provided by the rate card file covers for each of the five categories each of their own component, each having specific data by country. The split of the cost driver rate by country was due to the large discrepancy that country of destination can have on outbound cost. Within the individual components, the unit refers to the variable that will be fetched from the order data, the “cost driver units” , that will be used in the calculation of the cost.

Router: We had different rate card structure by LSPs which were mapped to specific countries, we knew however that this logic could change and that a LSP could provide a different rate card at some point in time, that we could decide to ship items to a country from a different shop by a different LSP, … We therefore needed a separate component to handle whatever logic was needed to provide the applicable rate card for a given order and allow it to be flexible enough so that new routing parameters could be added at a later stage.

Scorer: The role of the scorer, was for every order to retrieve the applicable rate card from the router, and for each rate card component retrieve the associated cost driver and perform a simple multiplication, rate card component value * cost driver units, and provide that as extra output of our order data. In the current implementation the scorer is quite limited in scope of operation it performs, further improvement could be made by allowing it to handle other type of operations such as for instance weight band matching, which would allow a more granular and accurate estimate of the cost incurred.

Right now, leveraging the rate card scorer needs a certain level of analytical sophistication to be run. It is however setup in our reporting pipelines to provide visibility to the business on accrual levels. And certain order simulation and scenarios can be run from a Jupyter notebook.

By virtue of being visible and embedded into business processes, such as simulation modeling, negotiation discussions, accrual , … a virtuous cycle of data quality exists where potential changes can be noticed and updated.

This rate card scorer is an example of a proof of concept data product that could be taken forward further by a development team:

Investing in a UI interface to setup the different rate card, run simulations and see their impact
Integrate using a micro-service approach into different services such as an offer profitability tool
Integrated into an ERP for invoice approval when deviation levels turn out significant

It is also an example as to how simple analytic concept can drive significant business value.

‍