I was excited to attend the Gartner Data & Analytics Summit, held in early June this year, where I participated in more than 50 sessions focused on strategies and discussions to help data and analytics leaders meet the demands of today and the future.
Across these sessions, one recurring concept was active metadata: the hidden force that seems to be enabling the top trends of 2021: augmented data catalogs, autonomous DataOps, data fabric, and data mesh, data and analytics governance, and consumerization of data tools.
In this article, I’ll unpack the basics of active metadata and list the 5 takeaways that will help you leverage active metadata to build a forward-looking data stack.
We like to think of the modern data stack as a magical solution, but even modern data teams with modern infrastructure often struggle to find and document their data.
Today’s data teams are facing these cataloging challenges:
- They have little insight into what data lives where.
- They often spend more time looking for data than actually analyzing it.
- They find it difficult to share context across data assets among business users.
Today’s traditional data catalogs just don’t address these issues well.
However, machine learning–augmented data catalogs actively crawl and interpret metadata to fix these problems. They enable real-time data discovery, automatic cataloging of data assets, and better context around data, all of which significantly lowers the time it takes to go from problem to insight.
Not sure what a machine learning data catalog (MLDC) is? Learn more here.
“By 2023, organizations utilizing active metadata, machine learning, and data fabrics to dynamically connect and automate data management processes will reduce their time to data delivery, and impact on value by 30%, Gartner expects.”
— Roberto Torres, CIO Dive
Traditional data catalogs just passively contain and organize technical metadata — i.e. basic information about an organization’s data. Active metadata, though, pervasively finds, enriches, inventories, and uses all these kinds of metadata, taking a traditionally “passive” technology and making it truly action-oriented.
This helps organizations maximize the value of their data and find deeper insights as the catalog delves into user activity, connections across data assets, and more. Activating metadata is thus the first and most significant step towards setting up a DataOps framework that works for diverse data users in an organization.
A data fabric is a unified environment — made up of an architecture and data services running on top of that architecture — that helps organizations manage their data. Think of it as a “fabric” that stretches across all different data sources and endpoints.
“A data fabric utilizes continuous analytics over existing, discoverable and inferenced metadata assets to support the design, deployment and utilization of integrated and reusable data across all environments, including hybrid and multi-cloud platforms.”
— Ashutosh Gupta, Gartner
The data fabric is not one ready-made tool or technology. Instead, it’s composed of various tools, resources, and processes. The data fabric is an emerging design framework that identifies and connects data from disparate applications to discover unique, business-relevant relationships between the available data points.
No standalone tool or solution today is equipped to serve as a full-fledged data fabric architecture. Instead, the starting point is to invest in metadata management solutions. These need to support metadata ingestion, sharing, curation, activation, and representation with a knowledge graph. Getting metadata right is the first step to setting up a composable data fabric for your data system.
The modern data stack is fast evolving and diverse. It’s easy to scale up in seconds with little overhead but it can be a pain in bringing governance, trust, and context to data — and that’s where active metadata makes itself indispensable in the ecosystem.
In the past, data catalogs and management tools were built for more technical users like data engineers and scientists. But these platforms are increasingly becoming more consumerized or meant for everyone to use in their daily work.
Less technical business staff now expect to quickly access and use trustworthy data — not by emailing an engineer, but with easy self-service tools.
Rather than being a burden, these users provide a new perspective that’s a big plus as organizations figure out how to use and structure their data.
As more business users are diving into data, enterprises are realizing the importance of reducing the data discovery and prep time and providing plenty of context to help less technical users generate and act on their insights. That’s why traditional data management tools are starting to give way to modern metadata management tools (e.g. Atlan, my company) that are focused on great end-user experiences, not unlike what we have seen with modern enterprise tech products like Slack.
These modern tools are not only more accessible to business users but also significantly improve productivity for data engineering teams. This is in line with the broader trend of the consumerization of enterprise tech-led by new tools like Slack and Notion.
To make sense of and trust data, it’s imperative that data not be in silos. Multiple levels of hierarchy and management make data problems worse when no one knows who is looking for a particular dataset and why.
That’s why governing data isn’t just about putting restrictions on data access but also about democratizing data and ensuring it reaches the right users at the right time.
The ultimate goal of data governance is to empower smoother and faster decision-making.
Today we are seeing a convergence, where data governance is becoming an active part of the data analytics pipeline. Rather than a nice bonus, governance is now a must-have feature for modern data ecosystems that need to be open and more accessible to all without compromising on data security or compliance.