Data-Mesh introduced quite a few concepts that enhanced organizational agility with respect to data but after being introduced more than five years in a data platform context, some of its limitations are surfacing.
When the Data Mesh got introduce systems such as Kubernetes or Spark were still nascent (both introduced mid-2014), the cloud infrastructure not nearly as developed, and the data infrastructure landscape not nearly as specialized.
The data mesh came into existence to tackle one of the biggest challenges currently, how to deal with the data landscape and an increasing degree of complexity.
“The only thing that gets in the way of linear scalability is coordination” — Daniel Abadi @ Starburst
The data mesh introduces the concept of data decentralization and federation to allow effective scaling:
“Centralization and parallelization are antonyms. Scalability requires independent units working in parallel, while centralization introduces coordination, resistance, and inertia.” — Daniel Abadi @ Starburst
The way this decentralization ends up being implemented in practice leads to different perspectives, from more logical separation to more physical demarcations.
The data mesh provides an interesting perspective on how to manage data in a domain-driven context. It provides a rationale for decentralized data ownership and architecture and introduces looser coupling compared to more traditional data architecture paradigms.
This decentralized model has had a number of positive effects such as increased agility and increased domain knowledge embedding onto the data products.
This is a new role within the data landscape, that is often advocated as key to leveraging Data Mesh. It is often viewed as being more specialized towards data than a traditional product owner role. There are a number of shortcomings with the specificity of the role.
The main difficulty in finding an adequate “data product owner”, with the right experience leveraging data. This has lead to the development of a number of very short programs aimed at converting people to a “Data Product owner” or “Analytics Translator” in a matter of days, similar to the agile boom which has lead to the instant conversion of hundreds if not thousands of people into Scrum Product Owners and Scrum Masters of dubious quality.
The second difficulty is within a given domain to have the ability to justify having a dedicated data product owner. It is quite rare for domains to have the critical size to justify a dedicated and specialized Data Product Owner.
Another challenge is that of responsibility. Within a DDD context, the Domain Product owner should be responsible for the data produced by the domain. Data products should not be seen as separate deliverables for the domain but as part of their key responsibilities. There is a tight interconnection between a normal product evolution and how its associated data products should be built and maintained.
Data products are a composition of the three codes, data, and infrastructure. It sometimes advocated that with respect to data infrastructure the “skills needed to provide this infrastructure is specialized and would be difficult to replicate in each domain”. Thoughtworks advocates, for this reason, the creation of “Shared infrastructure” outside of the domains.
To a certain extent, it is true, but the same could be said with respect to the role of the Data engineer in justifying the existence of a central team; this is however one of the main problems the Data Mesh is trying to solve. It is also not fully considering the evolution of the role of the data engineer that is taking on an increasing amount of DataOps skillset and duties.
The challenge of trying to leverage a common shared infrastructure becomes when domain teams have different hiring practices and leverage different toolings.
With that regard, certain extensions of the data mesh have emerged to handle this dichotomy such as Jeffrey’s Pollock “Decentralized modular data mesh”, which sees as a viable option a constellation of Data Mesh deployments each with its own control plane services. Or Microsoft’s Harmonized Mesh, which delegates to the node’s data platform the ability to enrich a based set of blueprint capabilities and policies, increasing the autonomy of the nodes.
One of the key elements of the data mesh is the principle of governance by federation. A governance team is federated through domain representatives.
Similar to the data mesh following a distributed system architecture, so are its’ teams. Domain ownership and decentralization are key to enabling a governance fit for the mesh.
The overall consumption layer can suffer from dataset isolation, particularly when dealing with systems landscapes that include legacy components. When leveraging disparates systems with no clear data ownership, which is typical in legacy landscapes or in landscapes making use of multiple off-the-shelf SaaS products with overlapping responsibilities; the need to leverage consolidation tools such as MDMs arises, forcing the need for data consolidation at a central layer.
The Dataset isolation coming out from adopting this separation at the domain level has some drawbacks when looking at data modeling. Andriy Zabavskyy takes it to the extreme when suggesting that an option to approach the problem of dimensional models in the context of the data mesh is to create a separate domain for each dimension. The data mesh doesn’t require this level of atomicity but does require dimensions to be properly fleshed out and aligned to domains in terms of responsibility and ownership.
The approach does put constraints on what can be done compared to normal data warehouse methodologies, it is becoming harder to leverage surrogate keys compared to natural keys.
With the data re-evolution that we are facing, there is an increasing need for roles to become a bit more cross-cutting roles. Data becomes a product and along with that comes the responsibilities to manage it as such, be it from a software or platform perspective.
Data Engineer role: The evolution of the role of the data engineer has been geared towards embracing software engineering, cloud, and DevOps/data ops practices. It is now increasingly common for Data engineers to not just build ETL pipelines, but to also set up APIs to serve data and deploy services to the cloud. The need to be able to set up APIs to serve data tends to be particularly acute in tech-oriented startups, where developers aren’t only data producers but as well as consumers of this data. This overall need for better software engineering and DevOps skills leads to the rise of Felix Klemm names the “new Fullstack Data engineers”, Data Engineers with T-Shaped profiles mixing broad software engineering skills with deep data skills.
Backend Engineer role: In a similar fashion to the Data Engineering role, the role of backend engineers has also evolved. It is nowadays quite common for backend engineers to undertake data processing tasks. This tends to be especially the case in the area of real-time processing where the needs and tooling between the data side and the more traditional product platform have already somewhat converged.
Data Ops / Platform Engineer role: Increasingly become an extension of a core platform team. Engineers specialized in setting up and manage data platform components should be seen as an extension of the core platform team. When operating in production at scale, there is quite a fuzzy line between a data component and a core platform component. Let’s take the example of a message broker — does it belong to the core platform or a data platform? The same goes for machine learning. The more a platform is looking to integrate data in its core production systems, the more the need of the core platform and the data platform marries.
Focus on integration into core systems: A traditional mistake is to see data as a siloed initiative. To truly derive its’ value the focus of data should be on integration from and to core systems. It is partially to that endeavor that architectural paradigms such as the data mesh arose. As infrastructure and applications are becoming more tightly connected, there needs to be an adaptation of both typical platform components and of data components to facilitate the integration of data products in/out the domain.
Cloud & self-services capability: evolution of cloud platforms is making it easier than ever to provision and leverage data infrastructure components. Serverless capabilities like Lambda are embedding infrastructure and code together, while Docker and Kubernetes in hand have made it much easier to leverage scalable applications at hand. Helm templates, provide an easy way to deploy infrastructure components as if it was just deploying new software. And features such as autoscaling make it easier to operate a cluster at a smaller scale.
Delivery/Data platform convergences: While in the past, operating a data platform required intricate knowledge of deep and specific data infrastructure components, such as a Hadoop distribution. Evolution in data infrastructure, lead by the cloud and Kubernetes brought over a general trend of convergences between the domain/delivery platform and the data platform.
Another factor of this convergence is the evolution of the products provided by the typical domain teams to incorporate an increasingly important share of data product themselves. Take the example of a customer data platform, or a marketing automation tool, it is quite hard in these cases to distinguish there where the marketing product ends and where the data product starts.
More and more data infrastructure are supporting the same orchestration layer for instance. The Microservices from the delivery platform would run on Kubernetes, but so could an Airflow job with a Kubernetes executor or a SageMaker job.
Specialized Data Infrastructure needs: Data infrastructure needs are becoming increasingly specialized and highly modular. Be it from requiring search engine capabilities such as ElasticSearch, in-memory data grid computing like Apache Ignite, large Data Frame processing capabilities like Spark, high write throughput databases like Cassandra, stream processing such as Kafka, or queuing capabilities like RabbitMQ. A diverse ecosystem of infrastructure components becomes particularly important as the use cases for data increase. The infrastructure needs for both the data and platform should marry to full fill these needs. Infrastructure components should become more embedded towards each domain.
It is sometimes advocated that the data platform is out of “domain” ownership and that this requires the existence of a dedicated team. In this respect, the development of a data platform should not be that different from how SRE teams are structured. They should operate based on a wide variety of operating models, depending on what might be most pragmatic in the given organizational context.
Having this set of different operating models is highly relevant in the context of increasing data demand. Some specific needs being driven primarily from specific domains having high or specific infrastructure needs. It is also increasingly relevant in the context of collective ownership of data platforms by data engineers and data ops/platform engineers. Knowledge and experience around infrastructure best practices tend to be on the learning path of data engineering and operating through collective ownership lets them hone these skills.
One way to see this is the bounded context of a platform should not be just data or domain/delivery but encompassing both, with the clear intent of supporting the activities of the domain teams. In this new context, the evolution of the mesh towards decentralized modular data mesh and harmonized meshes shine.
Faced with an increasing systems system complexity, data consumers are faced with more challenges than before. This requires increasing complexity a higher level of abstraction than previously offered in most data platforms.
The importance of APIs: in 2002 Jeff Besos mandated all teams at Amazon to expose their data and functionality through service interfaces. This event became known as “The Bezos API Mandate” and came to mark software development for close to two decades.
In a data engineering and data platform context, APIs allow abstracting both the data’s underlying schema as well as the underlying systems used to store and process the data. They also provide a certain translation layer on top of the local domain and hide some of the technical implementation or business logic layer. In the case of event sourcing, for example, they allow to abstract from the consumer the need to replay the set of events to obtain the picture of an entity (e.g. Consumer) at a given point in time.
In the context of an evolution of the data mesh (Data Mesh 2.0), Jesse Paquette sees APIs as a way not just to interact with datasets, but as well algorithm.
One of the advantages of leveraging APIs is that you can treat every data consumer outside the domain as an external party and ease the transition when intending to expose your services more widely. Specific data-oriented protocols have emerged such as odata and gRPC [1] to support the use case of data sharing.
gRPC especially is on the rise, contrary to systems such Odata or generally REST APIs for that matter, it introduces strict(er) data contracts through Protobuf. The gRPC technologies are getting embedded into more data infrastructure components, such as Apache Pinot, Apache Arrow (Arrow Flight), and are now being used extensively in the area of streaming like Spark (processed with ScalaPB for instance) or when building API.
Analytics Workbench: Analytics workbench such as SageMaker provides a graphical interface for analytical development and a generic computational framework, enabling its users to integrate data from a diverse ecosystem and run their own set of computations on top.
These workbenches make the development of analytical products much less dependent on the actual source system implementation.
There is some drag to move towards an after-data mesh, first and foremost from a mindset and perception.
An organization needs to be willing to decentralize past the point of the data mesh. For most organizations, the data mesh is already a big hurdle to jump through and most implementations tend to have more to do with branding than retaining true to the spirit of the mesh.
The misguided perception that “data” is a very different “product” from the traditional software product is also creating some additional slow down in terms of adoption of a universal mesh platform, encompassing a merging of both data and traditional software products. Data use cases typically do contain some higher level of uncertainty than most software projects, but the methods of engineering still apply. A lot of work has been done in recent years to enable faster convergence of practices through automation — for example, in machine learning, it has become quite common to have both features and weights automatically processed rather than carefully studied and crafted by a data scientist.
It is also a question of talent. Similar to traditional agile practices, being able to properly operate the data elements of this universal mesh requires T-Shaped people, people able to bridge the gap of their core discipline to facilitate the integration of data or the setup of infrastructure. The shift of mindset also needs to happen from an infrastructure, their role should increasingly become that of an enabler for instance in creating blueprints rather than of operator of the platform.
A universal mesh looks beyond just the creation/consumptions of “data products”, but looks at creating an internal structure that facilitates the integration and use of data within the domain.
It provides further independence to the domain teams supported by platform engineers providing them blueprints and closely collaborating with domain teams to fulfil their needs.
This increased independence for the team creates an increased complexity necessitating some level of abstractions such as e.g. APIs to interact with some of the domain data.
In exchange, it provides enhanced agility to the teams, who are then able to choose systems, data models, and tools that fit their internal needs as well as raise bottom-up rather than top-down initiative improvements.