· 8 min read
What is a data mesh?
Data Mesh explained
What is a Data Mesh?
A data mesh is an architectural framework designed to address advanced data security challenges through a distributed and decentralized approach. Organizations often have multiple data sources across different business units that need to be integrated for analytics. A data mesh effectively connects these disparate data sources through centrally managed data sharing and governance guidelines. This framework allows business functions to control how shared data is accessed, who can access it, and in what formats.
Although a data mesh introduces additional complexity to the architecture, it also enhances efficiency by improving data access, security, and scalability. Similar to how software engineering has shifted from monolithic applications to microservices, a data mesh represents the data platform version of microservices. Unlike traditional monolithic data infrastructures that handle all data processes in a central data lake, a data mesh supports distributed, domain-specific data consumers and treats “data as a product,” with each domain managing its own data pipelines.
Notably, traditional data mesh principles advocate for domain teams owning the underlying platform or data storage layer. The connecting tissue between these domains and their data assets is a universal interoperability layer that enforces consistent syntax and data standards. While this can lead to some duplication of infrastructure, some teams have adopted “data mesh-like” structures with platform teams managing a more centralized platform.
Why use a data mesh
Until recently, many companies leveraged a single data warehouse connected to myriad business intelligence platforms. Such solutions were maintained by a small group of specialists and frequently burdened by significant technical debt.
Today, the architecture du jour is a data lake with real-time data availability and stream processing, with the goal of ingesting, enriching, transforming, and serving data from a centralized data platform. For many organizations, this type of architecture falls short in a few ways:
A central ETL pipeline gives teams less control over increasing volumes of data As every company becomes a data company, different data use cases require different types of transformations, putting a heavy load on the central platform Such data lakes lead to disconnected data producers, impatient data consumers, and worse of all, a backlogged data team struggling to keep pace with the demands of the business. Instead, domain-oriented data architectures, like data meshes, give teams the best of both worlds: a centralized database (or a distributed data lake) with domains (or business areas) responsible for handling their own pipelines. Data meshes provide a solution to the shortcomings of data lakes by allowing greater autonomy and flexibility for data owners, facilitating greater data experimentation and innovation while lessening the burden on data teams to field the needs of every data consumer through a single pipeline.
Meanwhile, the data meshes’ self-serve infrastructure-as-a-platform provides data teams with a universal, domain-agnostic, and often automated approach to data standardization, data product lineage, data product monitoring, alerting, logging, and data product quality metrics (in other words, data collection and sharing). Taken together, these benefits provide a competitive edge compared to traditional data architectures, which are often hamstrung by the lack of data standardization between both ingestors and consumers.
This paradigm shift requires a new set of governing principles accompanied with a new language:
- serving over ingesting
- discovering and using over extracting and loading
- Publishing events as streams over flowing data around via centralized pipelines
- Ecosystem of data products over centralized data platform
What challenges does a data mesh solve?
Even though organizations have access to ever-increasing data volume, they have to sort, filter, process, and analyze the data to derive practical benefits. Organizations often utilize a central team of engineers and scientists for managing data. The team uses a centralized data platform for the following purposes:
- Ingest the data from all the different business units (or business domains).
- Transform the data into a consistent, trustworthy, and useful format. For example, the team could make sure all dates in the system are in a common format or summarize daily reports.
- Prepare the data for data consumers, like by generating reports for humans or preparing XML/YAML/Markdown files for applications.
As data volume increases, organizations face increasing costs to maintain the same agility as before. The monolithic system is difficult to scale because of the following reasons.
Siloed data team
The central data team has specialist data scientists and engineers with limited business and domain knowledge. However, they still have to provide data for a diverse set of operational and analytical needs without a clear understanding of motivation.
Slow responsiveness to change
Data engineers typically implement pipelines that ingest the data and transform it over several steps before storing it in a central data lake. Any requested changes require modifications to the entire pipeline. The central team has to make these changes while managing conflicting priorities and with limited business domain knowledge.
Reduced accuracy
Business units are disconnected from the data consumers and the central data teams. As a result, they lack the incentive to provide meaningful, correct, and useful data.
Benefits of a data mesh
Data democratization: Data mesh architectures facilitates self-service applications from multiple data sources, broadening the access of data beyond more technical resources, such as data scientists, data engineers, and developers. By making data more discoverable and accessible via this domain-driven design, it reduces data silos and operational bottlenecks, enabling faster decision-making and freeing up technical users to prioritize tasks that better utilize their skillsets.
Cost efficiencies: This distributed architecture moves away from batch data processing and instead, it promotes the adoption of cloud data platforms and streaming pipelines to collect data in real-time. Cloud storage provides an additional cost advantage by allowing data teams to spin up large clusters as needed, paying only for the storage specified. This means that if you need additional compute power to run a job in a few hours vs. a few days, you can easily do this on a cloud data platform by purchasing additional compute nodes. This also means that it improves visibility into storage costs, enabling better budget and resource allocation for engineering teams.
Less technical debt: A centralized data infrastructure causes more technical debt due to the complexity and required collaboration to maintain the system. As data accumulates within a repository, it also begins to slow down the overall system. By distributing the data pipeline by domain ownership, data teams can better meet the demands of their data consumers and reduce technical strains on the storage system. They can also provide more accessibility to data by providing APIs for them to interface with, reducing the overall volume of individual requests.
Interoperability: Under a data mesh model, data owners agree on how to standardize domain-agnostic data fields upfront, which facilitates interoperability. This way, when a domain team is structuring their respective datasets, they are applying the relevant rules to enable data linkage across domains quickly and easily. Some fields commonly standardized are field type, metadata, schema flags, and more. Consistency across domains enables data consumers to interface with APIs more easily and develop applications to serve their business needs more appropriately.
Security and compliance: Data mesh architectures promote stronger governance practices as they help enforce data standards for domain-agnostic data and access controls for sensitive data. This ensures that organizations follow government regulations, like HIPPA restrictions, and the structure of this data ecosystem supports this compliance through the enablement of data audits. Log and trace data in a data mesh architecture embeds observability into the system, allowing auditors to understand which users are accessing specific data and the frequency of that access.
Should i build a data mesh ?
Organizations handling multiple data sources and needing quick data experimentation and transformation should seriously think about adopting a data mesh approach. To assist in determining if a data mesh investment is suitable for your organization, we’ve developed an easy-to-use calculation. Just provide a numerical answer to each question, add them up, and get your overall ‘data mesh score.’
- Quantity of data sources. What is the number of data sources your company manages?
- Size of your data team. How many data analysts, data engineers,platform engineers, and product managers (if any) do you have on your data team?
- Number of data domains. How many functional teams (marketing, sales, operations, etc.) rely on your data sources to drive decision making, how many products does your company have, and how many data-driven features are being built? Add the total.
- Data engineering bottlenecks : On a scale of 1 to 10, with 1 meaning “never” and 10 meaning “always,” how often does the data engineering team become a bottleneck in implementing new data products??
- Data governance : How important is data governance for your organization on a scale of 1 to 10, with 1 meaning “it’s not a concern” and 10 meaning “it’s a top priority”?
Data mesh score breakdown
Generally, a higher score indicates more complex and demanding data infrastructure requirements, making a data mesh more beneficial. If you score above 10, adopting some data mesh best practices is likely a good idea. A score above 30 places your organization in the data mesh sweet spot, and joining the data revolution would be highly advisable.
Here’s the score breakdown:
1–15: Your data ecosystem’s size and simplicity might not necessitate a data mesh.
15–30: Your organization is rapidly maturing and may be at a pivotal point for leveraging data. We strongly recommend adopting some data mesh practices to ease future transitions.
30 or above: Your data organization is a key innovator, and a data mesh will support ongoing and future initiatives to democratize data and provide self-service analytics across the enterprise.
As data becomes more prevalent and the needs of data consumers diversify, we expect data meshes to become increasingly common, especially for cloud-based companies with over 300 employees.