This line is spoken by Dorothy, the Tin Man, and the Scarecrow, played by Judy Garland, Jack Haley, and Ray Bolger (respectively) in the film The Wizard of Oz*, directed by Victor Fleming (1939).*
While walking through the dark forest, Dorothy, the Tin Man, and the Scarecrow get scared when they start hearing noises coming from the foliage around them. The Scarecrow wonders if there are any animals out there who might be interested in eating things like, um, straw, and the Tin Man’s like, Yeah, there are probably lions and tigers and bears out there. Whatever.
Naturally, this causes the Scarecrow to anxiously repeat, “Lions and tigers and bears,” and Dorothy throws in the good old “oh my” for dramatic effect. Pretty soon, the three rapscallions are skipping through the forest chanting, “Lions and tigers and bears, oh my!”
The Data Jungle
I still get scared wandering through the Data Jungle, and I am supposed to be an experienced Software Engineer, working at a company who are masters of data at scale. Often I hear people talking about data, but I am confused by what they are talking about. Is it a Data Lake, a Data Warehouse, a Data Mart, etc.? Am I a Data Engineer or a Data Scientist? My employment handle is Software Engineer, but I have a graduate degree in Computer Science? It’s okay to admit when we’re confused: yes, no, maybe?
There are only two hard things in Computer Science: cache invalidation and naming things.
— Phil Karlton
You’re so right Phil, naming things is hard, and trying to understand what people are talking about when they do name things is even harder.
A good place to get some clarification is from the videoWhat is a Data Warehouse? - YouTube)
My personal take on this is
- Data Lake — is a place we go fishing for data…
- This could be raw customer/client data
- Data Warehouse — is a place we store processed data
- This could be raw materials, intermediate materials, and/or finished product
- Data Mart — is a part of the Data Warehouse where we shop for products
- This could be reports, projections, recommendations, etc.
The term ‘Data Lake’ is not as mature in our industry as ‘Data Warehouse’ and is still evolving in definition, so there continues to be much confusion, and much confusing use of this term.
When I hear ‘data lake’ I usually think of source data, raw data, customer/client data, etc. When I hear of a badly managed Data Lake, I prefer the term Data Swamp.
In terms of raw data, this is source data that needs processing, that needs to go through some ETL process (see below), where the data goes from the Lake to the Warehouse.
In terms of customer data, they often own the lake, and while they may let us fish in their lake, it is very rude to pollute their lake, so we should never be able to manipulate data in their lake.
We may have our own internal Data Lake, for example, we might pull data from an external Lake, such as a customer Lake, into our internal lake. We may pull raw data, such as log streams, event streams, etc. into our internal Lake. We may even play with the data in our own Lake, transform it, and put it back in.
However, other people have other ideas on what a Data Lake means, but this is the perspective I am most comfortable with. I also like the perspective in the intricity video below, where the Lake represents Open-Mindedness and the Warehouse represents Orderliness.
An important process is Extract-Transform-Load, where we Extract data from the Lake, Transform it into better more orderly forms, and Load it into the Warehouse; or ETL.
At Forgerock we invest a scary amount of effort into ETL. Scary in the amount of resources that can be consumed when transforming data at scale. Scary in the sophistication of technology we use and invent to make this possible. However, transforming data is central to the value we add via Data Mining, Machine Learning, and other processes for understanding the data, understanding how to offer high-quality perspectives and even recommendations to our customers in terms of Identity Access and Management, IAM, our core business model.
A Data Warehouse is a more mature term that has more specific meaning in the data enterprise, where there is a lot more mature technology, but also new innovations and new technology emerging.
Source of Truth
Essential to a competent Data Warehouse is the Source of Truth, the official data that everything else depends on. The format of the data is not all that important, only that we are confident in the veracity of the data. When we pull the data out, then we can format it as we please. On the other hand, the format of the data does need to meet minimal quality criteria, which is beyond the scope of this discussion.
Personally, I believe that Relational Database Systems are best for maintaining the source of truth, and many enterprises believe this as well because it’s one of the most common Source of Truth practices. In particular, we really need to trust our source of truth, so we need to invest in correctness.
- Normalize data, a key feature of any RDBMS, is the most efficient form in terms of space, of storage required, and also solves various pathological data access issues.
- When you are dealing with Petabyte, Exabyte, Zettabyte, etc. scales, compact storage has payoffs
- On the other hand, Relational Database operations can be slow compared to other storage forms, especially when you have to ‘join’ data across relations (tables)
- Traditionally, RDBMS systems did not scale well, did not deal with replication, sharding, etc.
However, modern products have mostly eliminated these concerns
- See the video on Database Normalization for the important aspects of good (orderly) data management
- Solid mathematical foundations
- With so much ‘hand waving’ in our industry, it is good to have some technologies that are rigorously well understood, and mathematically correct
- High data integrity
- Data and data relationships need to make sense
- High transactional integrity
- Because updating data concurrently is prone with risky pathological cases
Because operational overheads of extracting data are often at conflict with how quickly people want to play with their data, it is common to use more specialized databases, often noSQL databases. These databases deliver faster queries and transformational operations, at the expense of larger storage overhead. They often lack the kind of data integrity and transactional controls of relational databases, but in many applications, this kind of control is not necessary, and if things get screwed up, you can always rebase from the source of truth.
Examples of Analytic Engines include
- Industry standard
- Great at handling temporal data, keeping track of when things happen
- Which can also be used in Data Lake applications to ingest data at scale (see also Data Jet Stream)
- Interesting features such as Semantic Web applications
Often, Data Visualization is driven from analytic engines because it’s faster to render new data perspectives.
Generally, the Data Mart is a specialized perspective on the Data Warehouse. In some data architectures, a Data Mart might be an application that accesses the Analytic Engine in the Data Warehouse and presents it to the user. The Analytic Engine is more suitable than the Source Of Truth because it can access and process data much faster so that the end user does not have to wait too long.
Personally, I like to think of a Data Mart like Costco, Wal*Mart, Real Canadian Superstore, etc., where the building is really just a warehouse, with a retail entrance. If you ever shop in Costco, it’s most obvious you are in a Warehouse.
Data Jet Stream
I am inventing a new term here because our new reality is we have to deal with Streaming Data increasingly, and I have not found a better catchy term. This leads to situations where the view of data is constantly changing, sometimes so fast that our conventional Data Lake, Data Warehouse, and Data Mart solutions cannot keep up.
For example, in Kafka Streams they use KSQL to make the continuous stream of data look more like a database.
Hmmm, I should plan a blog article on Data Streams as this touches so many areas in Computing Science now.
My sense of the Data Jet Stream, is high velocity, high momentum, high energy data. Data that moves faster than conventional data management technologies were designed to handle.
While we don’t use Kafka at Forgerock, we do rely increasingly on Google Dataflow, which has a PubSub service similar to Kafka. A key area of innovation at Forgerock, is dealing better with Streaming Data.