There is a lot of evolution in the realm of data-science, with some of the evolution such ass AutoML chipping away at some of the core tasks of a data-scientists, freeing valuable time. This has led some to see the evolution of data-science towards a product manager role or more towards engineering:
Every time we ask our guests about the direction data science is heading in, we get one of two answers: either 1) data science is becoming a product/business role, and data scientists need to think like data-savvy product managers; or 2) data science is becoming an engineering problem, and data scientists need to think more like engineers.- TDS blog post
I had already touched on that on some of that evolution in previous article. But never fully detailed why or how data-scientists should be tackling this shift. People often talk about DataScientist being of Type A or B, and to a certain extent that separation is as much a reflection on the data-scientists skills than a reflection on the organization hiring them and how to be best effective in that organizational context.
One of the common myth about data-science is that it is
80% data wrangling / 20% doing analysis/ML
Data-science is not just a set of data manipulation and Machine Learning algorithm or analysis, there is much more to it. Cassie Kozyrkov, notably defined data-science as:
“The discipline of making data useful.”
Making data useful requires more than an analysis or a prediction model. For some it might mean having a good grasp of engineering skills to deploy the model in production, for others, it might mean to be able to leverage the organization to change their processes based on the insights provided.
Data-scientists often struggle with the product and project management aspect of the role, such as establishing the right scope of work, handling stakeholder communication, coordinating with other teams to handle dependencies, advocating and pushing for last mile delivery and constantly making that the work they do add value.
New roles have emerged in data teams focusing in this particular area. This alleviates the problem in larger data team that can afford to have dedicated people in these roles, but smaller teams usually don’t have such luxury and having data-scientists take on a PM role from time to time is usually the difference between failure and success.
Defining the scope of the project is one of the crucial first step to ensure success. The approach taken by many data-labs is to focus on proof of concept (P.O.C), usually tackling “low hanging fruit” in terms of business cases, but often relying on “advanced analytics”, “machine learning” or “AI” while a simple business rules would have sufficed.
This approach allows to get quick success and get executive buy-in around the data topics, however the key to unlock the potential of data in an organization lies on commitment rather than on establishing short term wins and focusing on short term ROI.
A lot of the wins that data-science can enable come from when an organization moves across the data maturity stages from data proficient to data savvy. Data is a pain to master and there needs to be a sense of direction in what initiatives should be tackled.
Equipped with the know of data, data-scientists are well placed to size different opportunities, and can help provide project impact analysis, they can setup what-ifs scenario, estimate potential room to grow through benchmarks and entitlement values and help shape a business case for the initiatives.
The project scoping should always start focusing on delivering a MVP without needing too much involvements. For instance If a SQL query could be enough to create separate segments to use in a marketing automation, it might be enough to use this for a MVP, rather than building a full prediction model early on.
Scoping the project requires handling questions of data assets, measurement, organization structure, and assessing the potential impact of the project. DataScientists are in a good place to fill some of the more product-y scoping questions.
A lot has already been said on the place of data collection in the data science hierarchy of needs, but at the root of the hierarchy of needs lies data collection.
In most industries, there is a sense of what could be a high value use case for data-science, but more often than not, the data necessary to enable the use case doesn’t exists.
If for instance, you are currently getting a feed of transaction information, but do not have product catalogue with sufficient data being surfaced, it wouldn’t really be feasible to deep dive into topics such a customer preference. This type of issues can be tackled in multiple different ways, manually tagging the initial product catalogue with the necessary attributes, outsourcing that type of work to the likes of Amazon Mechanical Turk, or setting up a project to collect that information at the source system.
Or for example, getting information on the type of complaints from clients. This is not something that would be available unless if captured by the customer service team. They usually require to have a nomenclature to tag communication with the customer.
Other cases, could be where logging is not implemented in an application or surfaced on the website to capture this data. In that case, being able to liaise and get the necessary logging implemented either by a development team or an analytics team responsible for the tag management system.
In some case, it is not just about generating raw data, but valuable data triggered by specific occurrences. This can be the case when you need to trigger certain behavior through experimentation. Imagine, that a retail company wanted to get a better understanding on how they should be organizing the store layout, they can get some insights, from historical data, about what people tend to purchase together, but would have trouble to get a true understanding of the what if situation and the impact on other core metrics without at the very least experimenting.
In general, extracting and being able to leverage data, where more questions and deeper deep dive are needed. This creates a virtuous cycle, where the data gets enriched every cycle. Managing the data acquisition component of the project is an integral part of being a data-scientist.
Even when the data exists, it can happen that, as a data-scientist, you do not have access to it. Different reasons can be at play, priorities to source the data, budget, legal requirements or the need to acquire it from third parties.
Take as example obtaining (raw) clickstreams data. Most websites track website events through google analytics (GA). Some of that data is exportable through either the GA UI or through their API. I you need to be able to export the raw data, however, you need to acquire the Google Analytics 360 version at a cost of $150k per year,. The 360 version provides up to 13 months of historical data. The data is technically accessible, but unless your organization gets a license for the tool it wouldn’t be accessible. If your organization is not willing to pay the license fee for Google Analytics 360, your only alternative to collect raw clickstream data would then be to setup a clickstream collector and wait for the data to be collected.
When data need to be ingested from a different systems, especially in large non-technical organizations, resources need to be acquired, budget need to be requested, communication might need to be handled between different parties, interfacing discussions need to be had, a QA process needs to be setup to ensure that the data is as expected …etc. When data needs to be acquired from a third party such as Experian or Nielsen, it can require an RFP, handling procurement…
Some industries can be quite strict with respect to data access, telecom who hold a lot of data related to customers, location data, browsing history, call pattern are in particular sensitive to this topic.
In some case the data exists but is of poor quality. This can be the case when certain input fields are setup as free text or when there is no master data management process. There are numerous variations of what bad quality data looks like and data quality has a significant impact on any analytics process.
Imagine doing a customer retention analysis or building a predictive model, without any master data management process. Without any customer deduplication process, the retention numbers would normally be off, this can be particularly the case when there is no central identity management. For retail customers, the customer would essentially be identified as a new customers at each different store visits, or when they do not mention that they are a previous client.
There are of course some exercise in data cleansing that a data-scientist could do to mitigate the impact of these issues, on analysis or in building predictive models, but the right approach to dealing with these data quality issues is often systemic. Lobbying and pushing for improvement is part of a data-scientist role, not just doing part of the cleansing itself.
In order to be able to lift some of the dependencies in the data side, some project management capabilities in terms of coordination is needed to manage the key dependencies, in terms of data availability and relevance.
Data-Scientists deal with interpretation of data, insights and metrics. They need to be able to interact with different stakeholders and help shed lights on what the data means and act as a translator for the insights and predictions generated. Data-scientists often have to drive a high level of stakeholder engagement and accompany them through the analytical process in order to accomplish these tasks.
A Data-scientist job is to interpret data in context. Seeing that a variable is correlated or seem to be predictive, is not very useful by itself. It needs to be understood within a given context, and for this domain knowledge is essential to be able to properly interpret the data.
For data-scientist context allows to better understand how the models actually work, separate what’s causal from what is noise, better detect apparent anomalies within the data and identify datasources that might need to be acquired.This domain knowledge needs to be acquired from domain experts or from user directly.
Establishing a process of review with the different stakeholders and sharing the progress and current insights help get feedback and better bring about these insights and prediction within the context, thus avoiding certain pitfalls and setting the work on the right track.
DataScientists don’t just have to take in context, they also need to provide context with respect to their interpretation of data or models they built.
DataScientists and statistics often rely on typical measures for performance in statistics and machine learning to communicate how a model performs, but is not something that provides a tangible context to what could be the business impact or that can be communicated to external stakeholders. For most (business focused) stakeholder, an R2, AUC or F1 score has very little tangible significance. There needs to be a translation of what an offline scoring metrics brings as tangible benefit to the product, ie: is it a milestone that we set to hit before starting to do an online test, is there a potential revenue uplift or risk associated with leveraging the prediction or insights.
Being able to acquire and provide context is an essential need for data-scientists, they need, like product manager, to do stakeholder management and handle the communication to sort out dependencies and communicate results and progress.
Producing insights or prediction is a very fine thing to do, but by itself, it doesn’t have much business value. The work produced by data-scientists needs to be actioned to produced business value. In engineer driven organizations, this is something that can be tackled by injecting predictions and decision models in core parts of the application. There a simple code commit might be sufficient to take into account new data and insights.
In other types of organizations, there might need to be some lobbying, arranging for process or a specific LIVE experiment to be setup or trialed, before being to move ahead with the productionization of the process.
To create value from data, it is important to leverage the insight or prediction within a business or product process.
For CRM activities for instance, this means being able to tie offers and campaigns to the insights or predictions being created, tailoring the communication to address specific customers etc.. In other domains, leveraging the data means something completely different in terms of business process, it might mean for instance for insurance companies not to cater to certain types of risk or for retailers to modify their pricing strategy.
Data-Scientists are the provider of the insights or predictions, and are therefore in good position to know their potential pitfalls. They are also in the best place to help define how these business/product processes should be changed to effectively leverage data.
In order to ensure that a data project or product delivers value, there needs to be a proper evaluation quantifying this value. This means being able to setup a proper measurement plan, agree on metrics and, if feasible, setup an experiment.
Setting up proper evaluation allows to avoid a lot of mistakes and iterate until the desired outcome comes to fruition, or pivot to other alternatives. The prediction/insight is often, just one bolt in all that is required to derive business value. Measuring the value may shed light on issues such as data quality issues, or in-appropriate fit for a business process, in-effective communication or a wrong hypothesis.
Success criteria for a data initiative need to be clearly stated and the measurement enforced to ensure that value is extracteds and that the initiative doesn’t have a negative impact.
In order to reach the last mile goal, data-scientists need to handle some inter-team team coordination to make sure that the insights or prediction provided is actually used, used effectively and actually provides values.
It is often a problem that in some organization the investment in data is seen as a one time investment. The truth is that data-science provides insights and predictions that need to be constantly maintained and refreshed for them to keep delivering the same amount of value. Six Sigma/DMAIC methodology introduces the concept of a control phase, a phase that ensure that the process changes keep on providing the same value until the process becomes stable.
Some of the issues applying data-science in a company is that data-science products can be seen just like any other IT project and taken as a one off investment, rather than the start of a virtuous continuous improvement cycle.
This can lead to negative consequences, such as the project level use of third parties, ignoring completely the level of company / area specific domain knowledge required to make data-science work, nor considering the maintenance needs for the model & insights.
Data deliverable are often not fully and constantly evaluated to measure their overall performance. They are often just deployed and potentially single check is done to just see that it delivers value.
Taking data product within an improvement stream for certain key area of processes make it possible to measure the impact on key business metrics in weekly, monthly or quarterly business review meetings.
This is not as accurate as doing a full experimental evaluation, but does permit to at least keep the pulse on how the data product might be contributing to the metrics. Particularly if there is a thorough analysis of the key business drivers.
Data initiative needs to be considered as a continuous investment, with continuous effort on resources, maintenance and recurring check on the value it is creating.
There is strong needs for product and project management in the Data Realm. DataScientist can bring significant impact to their organization by taking on these task rather than limiting themselves to a pure description of the role in terms of analysis and prediction.