data catalog

The growing popularity of data products has spawned the development of a new technology for publishing and consuming data products: internal data marketplaces. 

Unlike commercial platforms, such as Amazon Data Marketplace or Snowflake Marketplace, internal data marketplaces are designed to foster data sharing among business units and departments within an enterprise.

But what if your organization has already invested in an enterprise data catalogue? Isn’t that supposed to support one-stop shopping for data? Well, yes and no. Today, a data catalogue gets users halfway to data, while a data marketplace closes the loop.

Today, a data catalogue gets users halfway to data, while a data marketplace closes the loop.

Why a Data Marketplace

A data marketplace makes it easy for data producers to create, publish, manage, and monitor data products and collaborate with data consumers. Conversely, it makes it easy for data consumers—whether internal or external to an organization—to find, evaluate, sample, and subscribe to data products and ask questions of data producers. 

Think of a data marketplace as a retail store for data. Rather than searching for products in a distribution warehouse, consumers can shop for data in an environment tailor-made for them.

A data marketplace fills a vacancy in large data environments, especially ones adhering to a data mesh approach which distributes data product development to numerous business and data domains. A data marketplace sits on top of a data platform, either data lakehouse, data mesh, or data fabric, which together create what Eckerson Group calls a Data Product Platform (DPP). (See figure 1.) I predict most large organizations will have an internal data marketplace within 3 to 5 years.

Figure 1. Data Product Platform

data products

Role of a Data Catalog

A data catalog, on the other hand, is designed to index metadata so that business users—mostly power users—can discover and evaluate data assets that they might want to leverage in an ad hoc analysis. 

They are not designed to provide direct access to data itself. It’s implied that business users who find relevant assets in a data catalog, will then use their analytical tool of choice to access and use that asset. Some data catalogs now offer provisioning mechanisms where business users can submit a ticket to request access to the asset and some even provide a direct query mode, but most IT administrators turn off this function for security reasons.

Data catalogs are not designed to provide direct access to data itself.

Given their exquisite ability to coral metadata from across the organization, data catalogs have become a critical tool for data stewards and data curators to validate and describe data and automate data governance functions.

 Although data catalogs started as discovery tools, they have quickly become the foundation for metadata management and a key tool for data developers and administrators to monitor quality and compliance, troubleshoot issues, and evaluate the impact of schema and other changes on downstream applications.  

Comparing Data Marketplaces and Data Catalogs

Both data catalogs and data marketplaces offer a single place for users to go to discover data. But that’s where the similarities end. 

A data catalog indexes all data assets, while a data marketplace lists only data products, which are typically a subset of data assets or are produced from one or more assets. Data assets can be anything for anybody, while data products are designed to serve an explicit target audience with high-value, exquisitely governed data. (See table 1.)

data catalog

Table 1. Data Catalogs Versus Data Product Platforms

A data catalog today is designed for data discovery and governance, while a data marketplace is designed for data sharing. 

Power users and stewards are the primary users of data catalogs, while data producers and consumers use a data marketplace. Data stewards and curators are assigned to manage and maintain data catalogs while data product managers do that task for DPPs. Catalog users are internal to an organization, while DPP data consumers can also be external.

Unlike a data catalog, data marketplaces predefine access to data, which eliminates data sharing friction.

Unlike a data catalog, data marketplaces predefine access to data, which eliminates data sharing friction. Data producers can define upfront who can access a data product for how long and for what purpose. 

This eliminates the frustrating delays people experience when they submit requests and wait days, weeks, or months for permission to access data. In a data marketplace, the basic transaction is a subscription in which users sign up to get an automated feed of data for a certain amount of time delivered to their target of choice in the preferred format.

Convergence

It’s silly and expensive for organizations to invest in two data libraries: a data catalog and a data marketplace. Thus, organizations that already have an enterprise data catalog use it as a defacto data marketplace. This works fine for power users but is insufficient for casual users and it doesn’t work at all for external users.

I’ve been prodding data catalog vendors to jump on the data marketplace bandwagon for months and create marketplace extensions to their data catalogs. Some have, namely Informatica and Zeenea. And Alation recently announced that it would follow suit. Meanwhile, data pipeline and data fabric vendors have been jumping into the fray, offering data marketplaces as extensions to their platforms.

Consequently, it’s likely that organizations will have a mish mash of data catalogs and data marketplaces that spring up in various departments and business units. At some point, data and business leaders will need to reconcile and integrate these various investments, either by standardizing on a single platform or integrating them into a hub-and-spoke architecture.

Conclusion

Today, organizations that generate dozens or more data products should consider investing in a data marketplace, even if they already have a data catalog. That’s because these two platforms serve distinct purposes and audiences. A data catalog serves as a foundation for data governance and metadata management, while data marketplaces ease the friction of sharing and consuming data.

In the future, I hope that data marketplaces become extensions to data catalogs. This would be the easiest pathway for organizations, assuming that the extensions do justice to the retail data shopping experience. However, I suspect that data marketplaces and data catalogs will blossom throughout every organization because they’ll be built as extensions to existing products. We’ve already seen this happen with data catalogs and we’re starting to see it happen with data marketplaces.

Meanwhile, there are several pureplay data marketplace vendors, such as Harbr Data, that organizations should evaluate now if their data and domains teams are investing heavily in data products. These pureplay products also set the bar for the kind of producer and consumer experience that you should expect in a data marketplace, whether an extension of an existing product or not.

This article was originally published on Eckerson Group.