3 min read

Data Lake vs Data Warehouse

By Caleb Ochs on May 19, 2022 1:08:16 PM MDT

City landscape

Download PDF version Reprint

Data Lakes are rising in popularity, with some market analysts giving it nearly 33% of the value chain market1. It is important to understand what a Data Lake is and how it compares to a Data Warehouse.

What is a Data Lake?

A Data Lake is a centralized repository for all manner of structured, semi-structured, and unstructured data. It stores data in its original format with no transforming or cleansing and can cost-effectively scale to meet enterprise organizations’ needs.

A Data Lake's primary goal is to provide data scientists and analysts a single repository of all the organization's data. They can use the data lake to execute deep analysis, combine disparate data sets, and distribute reports to support organizational operations.

Benefits of a Data Lake

Implementing a Data Lake enables an organization to create a centralized repository of their data quickly. Loading data to the Data Lake is straightforward, only requiring a connection to the source data. Instead of the traditional ETL (Extract, Transform, Load) framework, Data Lakes are implemented using ELT (Extract, Load, Transform). Transformation of the data is completed by analysts, scientists, and report writers after the Data Lake is established. A Data Lake makes data available quickly, with the caveat that analysts and data scientists will later need to do the hard work of scrubbing and modeling the data for effective reporting and analysis.

Cloud storage is inexpensive, making enterprise-use of cloud storage infrastructure a growing trend2. Specifically, a Data Lake in the cloud can be a more cost-effective option than traditional data warehouse storage. For example, Microsoft Azure offers five terabytes (TB) of data storage for $200 per month3. Comparatively, five TB of data storage in a Data Warehouse (e.g., in Azure's Synapse Analytics) could cost as much as $1,200/mo4.

Data Lakes offer support for Machine learning (ML) and Artificial Intelligence (AI). Azure provides tools that make ML and AI quick and easy to implement. Big data processing tools such as Hadoop and Spark can also be deployed on top of a Data Lake, making it a valuable asset for predictive and diagnostic analytics.

Analysts can connect to the raw data with various analytical tools including Power BI to visualize data. With an Azure Data Lake, users can take advantage of the native Power BI connector to quickly find the data files they need for their analysis.

The design of your Data Lake will be driven by the data available, rather than the specific reporting requirements - which can be cumbersome to define – or the available technology, which may change with time. Importing new data to a Data Lake is simply a matter of moving data, making it an expedient way to provide data access to analysts and report-writers.

Comparison: Data Lake vs. Data Warehouse

In Summary...

The primary benefits of a Data Lake are centralized data, and wide reporting support. Though these benefits have shown to enable an increase in organizational growth5, the Data Lake should not be considered a replacement for a traditional data warehouse. Rather, the two are best served in conjunction with each other to support the operational and analytical needs of the organization.


References

  1. Data Lake Market to hit US $24,308 million by 2025 (2020) [Market Research]. Adroit Market Research https://www.globenewswire.com/news-release/2020/11/24/2132790/0/en/Data-Lake-Market-to-hit-US-24-308-0-million-by-2025-Global-Insights-on-Trends-Value-Chain-Analysis-Leading-Players-Growth-Divers-Key-Opportunity-and-Future-Outlook-Adroit-Market-Re.html
  2. Data Storage Trends in 2020 and Beyond (2019) [White Paper]. Spiceworks https://www.spiceworks.com/marketing/reports/storage-trends-in-2020-and-beyond/
  3. Microsoft Azure Storage Overview Pricing (2021) [Service Offering]. Microsoft https://azure.microsoft.com/en-us/pricing/details/storage/
  4. Microsoft Azure Synapse Analytics Pricing (2021) [Service Offering]/ Microsoft https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/
  5. Angling For Insight In Today's Data Lake (2017) [Analysis Report] Michael Lock, Senior Vice President, Analytics and Business Intelligence (Aberdeen) https://s3-ap-southeast-1.amazonaws.com/mktg-apac/Big+Data+Refresh+Q4+Campaign/Aberdeen+Research+-Angling+for+Insights+in+Today's+Data+Lake.pdf
Caleb Ochs

Written by Caleb Ochs

Caleb Ochs is the VP of Delivery Operations at Blue Margin Inc.