Overview
In this episode of The Dashboard Effect, Brick Thompson and Landon Oaks take on a foundational data architecture decision that has long-term consequences for every analytics initiative: should you store raw data or pre-aggregated data? Their answer is clear and consistent throughout the conversation. In modern cloud environments, raw data is almost always the right choice, and the historical reasons for aggregating early no longer hold up the way they once did.
The episode is a useful reference for any team making storage and pipeline decisions today that they will have to live with for years. See how Blue Margin’s Managed Data Service applies these architectural principles to build data environments that preserve the flexibility and granularity organizations need to answer the questions they have not thought to ask yet.
What This Episode Covers
Why Raw Data Is Preferred (0:30 – 2:07)
The core argument for raw data is flexibility. When you pre-aggregate, you lock in assumptions about what questions will be asked and at what level of detail. Business needs change, new analysis requirements emerge, and the logic that seemed sufficient at build time rarely stays sufficient for long. Raw data preserves the ability to revisit those decisions without rebuilding from scratch.
The Changing Economics of Storage (1:30 – 2:07)
The historical case for aggregation was largely economic: storage and compute were expensive, and keeping everything was not practical. That constraint no longer applies in modern cloud environments. Storage costs have dropped to the point where retaining massive datasets is a low-cost, high-value insurance policy against future reporting needs that no one can fully anticipate at the outset.
When Aggregation Makes Sense (3:00 – 4:15)
There are legitimate use cases for aggregation. When data is captured at extremely high frequency, down to fractions of a second, a raw trend line becomes unreadable and aggregation is necessary for the data to be useful. Brittle legacy systems that cannot handle raw data extraction present a practical constraint as well. Outside of scenarios like these, the default should be to keep everything.
Best Practices for Aggregation (6:09 – 7:06)
When aggregation is unavoidable, the hosts offer guidance on how to do it without creating unnecessary technical debt. Do not limit the columns you select, maintain enough context to understand what the aggregated data represents, and ensure there is a strategy in place to access the underlying raw data if a future requirement demands it. Aggregation done well leaves doors open. Aggregation done carelessly closes them.
Managing Large Volumes of Raw Data
Storing billions of rows of raw data presents real technical challenges, but the hosts are clear that these are manageable through partitioning and modern cloud-based processing. The challenges of working with large raw datasets are engineering problems with known solutions. The challenges of discovering you aggregated away data you needed are often not solvable at all.
Who It’s For
This episode is worth your time if you are a data engineer or architect making pipeline and storage decisions for a new or evolving analytics environment, a technology leader evaluating the long-term cost of data architecture choices made early in a project, an analyst who has encountered a reporting requirement that could not be met because the underlying raw data was not retained, or any organization building a data foundation and wanting to avoid the most common form of self-inflicted technical debt.
Why It’s Worth a Listen
The economics argument alone is worth the listen. The instinct to aggregate early is often a habit inherited from an era when storage constraints made it necessary, and many teams are still operating under those assumptions without realizing the constraints no longer exist. This episode makes the updated case clearly and gives teams a principled reason to change their default behavior.
The guidance on how to aggregate when you must is equally practical. Most discussions of this topic treat it as binary, but Brick and Landon acknowledge the real-world scenarios where aggregation is unavoidable and offer a thoughtful set of practices for minimizing the damage when that is the case.
For any team that has ever had to tell a stakeholder that a reporting requirement cannot be met because the data was not retained at the right level of granularity, this episode makes the case for why that conversation should never have to happen again.