Is a Data Lake Correct for You?Ritesh Koul
Back to the Future
The late 2000s saw an exponential growth in the number of people owning a computing device along with some form of connectivity to the internet. As we reached 2017, this resulted in enormous amounts of (often) heterogeneous & unstructured data being generated on a daily basis. “Big Data” thus became the norm and traditional Data Warehouses became increasingly unable to accommodate this change.
Today, owning an Enterprise Data Warehouse (EDW) is a hallmark of any medium-to-large 21st century firm that has successfully matured beyond infancy. Many firms desire and strive to reach a stage where they can own and fully take advantage of an EDW of their own.
Yet, as we continue to embrace the technological advances of today, our views on Data Warehouses also need a revisit: is it really necessary to own an EDW or is it best to leapfrog towards Data Lakes?
Data Warehouses vs Data Lakes
The emergence of Big Data is why Data Lakes are so important today. Traditional Data Warehouses are built to handle a limited amount of well-structured data. A large amount of storage space is often expensive and unstructured data is impossibly difficult to store.
Data Lakes are built to store enormous amounts of data – structured, unstructured, homogeneous, or heterogeneous – at affordable prices. The question of “Why should I invest in a Data Lake?” is thus straightforward to answer: it’s future proof.
Is a Data Lake Correct For You?
As we continue the transition towards a data-centric world, it only makes sense to have the right infrastructure to handle large loads of inbound data. To understand if your firm needs a Data Lake, first take a look at its workings. If ten years down the line, the kind of data you expect to handle is diverse and large in nature, then it may be best to invest in a Data Lake rather than a Data Warehouse.
In fact, there is a list of indicators that can help effectively identify whether your firm is positioned to require and take full advantage of a Data Lake. Keeping in mind the present and future, consider the following:
- You wish/expect to load all forms of data – raw and refined – without turning any of it away
- You expect to handle various data types simultaneously – including dynamic data
- Your data is equally important as both an operational and analytical element (that is, you wish to not just read your data, but also learn deeply from it)
- Your data is volatile. That is, its structure and nature tends to change often
- Your data is expected to include multiple protocols
If you answered yes for more than one of the above points, then a Data Lake may be more suitable for your firm than a Data Warehouse.
When Is a Data Lake Not Required?
Despite the advantages offered by Data Lakes, it isn’t reasonable to completely rule out EDWs as a Data Storage option. Regardless of their limitations, EDW’s are still very good at what they do. To understand if your firm needs an EDW more than a Data Lake, consider the following points:
- You expect to deal only with a limited stream of data (or) a data stream growing at a restrained rate
- All inbound data is expected to be adequately well structured, homogeneous, and static
- The workings of your firm are heavily dependent on traditional data storage and analytics technologies (such as SQL)
- You do not possess the resources/manpower to ensure that data governance, quality, and security are maintained at every stage
If your requirements match the above points, it may be more suitable for your firm to install an EDW instead of a Data Lake.
The benefits that Data lakes offer are many. They are highly scalable, flexible, offer a large number of ways to query data, and eliminate the concept of silos. Furthermore, the penetration of Hadoop technology means that there are incredible ways to process and learn from data stored in lakes. Unless your firm is expected to handle only limited amounts of data, it may be best to skip a warehouse and leapfrog to a Data Lake lest you should be left behind.