Every organization has enormous amounts of data that needs to be stored and managed. Effective data management is the critical aspect of any organization and the ability to access and communicate the data are the key factors that decide the efficiency in data management. The sheer volume of data that must be managed by organizations has increased so markedly that it is often overwhelming to manage it.
Previously, the most commonly used solution for storing the accumulated data was the data warehouse or the enterprise data warehouse. This is a system generally used for reporting and data analysis which is considered a core component of business intelligence. In other words, Data warehouses are central repositories of integrated data from one or more disparate sources.
Below are the characteristics of data warehouses that are different from a data lake in the following key dimensions.
- Data: A data warehouse has a structured and processed data set.
- Processing: A data warehouse uses a schema on write.
- Storage: Tends to be expensive for a data warehouse.
- Agility: A data warehouse by its very nature is a fixed configuration and less agile.
- Security: A data warehouse has a mature model.
- User perspective: A data warehouse is primarily designed for business professionals via the tools provided.
A data lake is similar to the data warehouse and runs the same process (move data physically) but, will always keep the source format. Here, data is ingested in as close to the raw form as possible without enforcing any restrictive schema. By moving the data into a data lake in its original format, helps in eliminating the upfront costs of data ingestion like transformation compared to placing data in a purpose-built data store.
Data lake holds a vast amount of raw data in its native format, including structured, semi-structured and unstructured data as a storage repository.
Data is readily available and one doesn’t have to connect to a live production system every time he/she want to access a record. Moreover, the data doesn’t need to be harmonized, indexed, and it is searchable, even easily usable.
Data lakes, however, do not index and cannot harmonize because of the incompatible forms of source data. Once data is placed into the lake, it’s available for analysis by everyone in the organization. The key feature is that it can be built on relatively inexpensive hardware. They have been made quite popular by the Hadoop community, with the focus moving from disparate silos to a single Hadoop/ HDFS.
Despite all this, the audiences throughout an enterprise are expected to be highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata.
The characteristics of data lake in the following key dimensions.
- Data: A data lake includes every source type including unstructured and raw.
- Processing: A data lake uses a schema on read.
- Storage: Designed for low-cost storage.
- Agility: Highly agile and will be configured and reconfigured as required
- Security: Security needs to be aligned to business needs and goals.
- User Perspective: Tends to be the focus for data scientists.
Pentaho CTO James Dixon has generally been credited with coining the term “data lake”. He describes a data mart (a subset of a data warehouse) as akin to a bottle of water, “cleansed, packaged and structured for easy consumption,” while a data lake is more like a body of water in its natural state.
According to Gartner, “In broad terms, data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format,” said Nick Heudecker, research director at Gartner.
As data warehouse and data lake have opposite competing set of characteristics, an alternative and a complementary solution were sought out for, landing at Data hub or even an enterprise data hub (EDH).
A data hub is a hub-and-spoke approach to data integration, where data is physically moved and re-indexed into a new system.
The key characteristic that differentiates a data hub from a data lake is that the data hub system supports discovery, indexing, and analytics. The prime objective of an EDH is to provide a centralized and unified data source for diverse business needs.
Major vendors like Cloudera’s with EMC offer Isilon data lakes, which via Cloudera can be turned into a data hub architecture.
Data needs to be stored from its multitude of sources and used by a very wide range of users who vary in terms of their technical competence, starting from business people who need report-driven analytics to data scientists using the latest deep learning algorithms.
How the data is stored becomes a consequence to the use case, so the simpler the use case, the more complex the data storage needs to be, and conversely, the more science that will be applied the closer to the raw state.
An enterprise is likely to see all these use cases and therefore it is more about the complimentary usage of these techniques rather than seeing them as divergent uses.