MarTech Landscape: What’s the difference between a data warehouse and a data lake?
For marketers, the difference is more than just the choice of metaphors.
It might seem odd to ask a marketer if they’d like their data in something described metaphorically as a building or a body of water.
In this article, part of our MarTech Landscape Series, we look at the characteristics of these two types of massive data storage.
Digital marketers are increasingly working with big data, the huge amounts of raw information pouring from social media, contact centers, online behavioral tracking and other sources. And two of the most common kinds of storage for large amounts of data are “data warehouses” and “data lakes.”
While marketers obviously involve IT in storage decisions, it’s helpful to understand the capabilities and costs of your systems by understanding the data storage employed.
A data warehouse provides storage for data that is typically structured for databases as it enters, and the data often comes from operational systems — transactions, customer records, human resources, customer relationship management systems, enterprise resource planning systems and so on. The data is usually sifted and prepared carefully before stored in a warehouse, which is often the preferred mechanism if the information is legally binding and needs to be traceable.
A warehouse can store unstructured data like body cam footage from police officers, said James D’Arezzo, CEO of storage performance provider Condusiv Technologis. Even though that kind of data is not typically structured for a database, it can enter as a list of files. But, like the physical structures they are named after, data warehouses are designed primarily for storing data that is properly sorted, filtered and packaged when it enters.
As the names imply, data lakes are more amorphous than warehouses. They store all kinds of data from any sources, including video feeds, audio streams, facial recognition data, social media posts, and the like.
Lakes sometimes use artificial intelligence to characterize the inflowing data, such as naming it, but the formatting, processing and management of the data is usually undertaken when it is exported for a given need, not before it is stored. While warehouses are typically much more discriminating in what kinds of data they allow in, lakes accept virtually everything.
Although lakes aren’t necessarily faster for accepting or processing data, D’Arezzo told me, their data managers don’t have to create structures and incoming criteria to accept the data. For a marketer, he added, lakes mean a greater depth and breadth of data sources than in a warehouse.
Why this matters to marketers
Data management systems can employ both warehouses and lakes, or they might focus on one type or another. D’Arezzo recommends that marketers understand the kind of storage where their data lives, the analytical tools available, the integration with systems that can act on the data, costs, any performance issues, and whether the storage resides on the company’s physical premises, in the public cloud, in the company’s private cloud, or in some combination.
In terms of costs, data preparation before storage for a warehouse can be expensive and time-consuming, and warehouses traditionally have stored their huge amounts of data on cheap but slow magnetic tape, while lakes often use commodity drives.
D’Arezzo also notes that, sometimes, marketers don’t actually know what they want to do with the data before it is stored, so it might be limiting or difficult to prepare it for an unknown purpose. Facial recognition data, social posts or data from Internet of Things devices, he said, can fall into that category, in which it might be better to store first and decide later.
Warehouse vendors include IBM, Google, Microsoft, Teradata, SAP, while some lake vendors are AWS, Microsoft, Informatica, and Teradata.