A Deep Dive Into Data Lakes

In the age of Big Data, we’ve had to come up with new terms to describe large-scale data storage. We have databases, data warehouses and now data lakes.

While they all contain data, these terms describe different ways of storing and using that data. Before we discuss data lakes and why they are important, let’s examine how they differ from databases and data warehouses.

Let’s start here: A data warehouse is not a database. Although you could argue that they’re both relational data systems, they serve different purposes. Data warehousing allows you to pull data together from a number of different sources for analysis and reportiong. Data warehouses store vast amounts of historical data for complex queries across all data types being pulled together.

There are lots of use cases for data warehouses. For example, if you’re doing very large amounts of data mining, a data warehouse is way better than a traditional database. Similarly, if you need to quantify vast amounts of market data in-depth, a data warehouse could help. You can use that data to understand the behavior of users in an online business to help make better decisions around services and products.

Variety is the Spice of Data Lakes

A data lake is a newer data processing technology which focuses on structured, semi-structured, unstructured, and raw data points for analysis. Data warehouses, on the other hand, only look at both structured and processes data. Where data warehousing can be used by business professionals, a data lake is more commonly used by data scientists. Data lake examples include Amazon S3, Google Cloud Platform Cloud Storage Data Lake, as well as the Azure Blob Storage service.

Data lakes are centralized storage and data repositories that allow you to work with a variety of different types of data. The cool thing here is that you don’t need to structure the data and it can be imported “as-is.” This allows you to work with raw data and run analytics, data visualization, big data processing, machine learning tools, AI, and much more. This level of data agility can actually give you some pretty cool competitive advantages.

Data lakes are centralized storage and data repositories that allow you to work with a variety of different types of data.

A recent Aberdeen survey found that a well-conceived data lake lays the groundwork for an elevated level of analytical activity. For leading companies, those analytical activities also translate to substantial ROI (return on investment) in the form of business growth and higher profits. This means being able to process vast amounts of data from business-critical sources like logs, data from clicks and click-streams, social media, and even IoT data that’s funneled into the data lake. A data lake acts as a powerful tool to facilitate business growth, identify new market opportunities, and even help make better market and business decisions.

Data Lakes and Cloud Architecture

Data lakes also offer up some really advanced use-cases when it comes to working with data. That is, working with machine learning, predictive analytics, data discovery, and much more. When it comes to integrating with other services and solutions, it’s important to understand where data lakes actually fit in. Consider the following chart:

(Source: Google Cloud)

Using Google as an example, you can see how data lakes act as a centralized point to store, process, and analyze massive volumes of structured and unstructured data. As part of the cloud infrastructure, this data is processed in a secure, cost-efficient, and agile manner. Within the cloud storage architecture, you’d integrate other key services which will help facilitate data processing. For example, predictive services using machine learning, data warehousing with BigQuery, or data mining with Datalab.

Because of its flexibility, data lakes offer organizations the flexibility to capture every aspect of your business operations in data form. As the chart above indicates, once this data is captured, you’ll be able to leverage a variety of great services to actually extract insights from it.

Getting Started and Overcoming Challenges

If a data lake is something that’s on your horizon, it’s important to take a few notes on how to get started and become successful. So, here are a few tips and best practices:

Data ingestions. Your data has to come from somewhere. This could be a sensor, user data, information from various e-commerce sites, user activity aggregation, and much more. Now, the next step is to get it into your data lake. You’ll need to consider the source hardware, connectivity, and another key points that might slow down the ingestion process. For example, depending on your use-case, you might need to deploy SSDs instead of HDDs for faster and better performance.
Network connectivity into the data lake. This can be a major bottleneck if not designed properly. If you’re storing data outside of a major cloud provider, consider using one of the many powerful methods of connectivity into the cloud. For example, data center partners can leverage a dedicated link with Azure ExpressRoute to improve connectivity. Remember, you do have options as to where your data lake sits. It can be on premise, within a colocation, or within an actual cloud provider. Arguably, when you’re processing this much information, it’s not a bad idea to simply store the data as close to where the jobs will be processes; that is, within the cloud itself.
Structuring data. First of all, this isn’t entirely a requirement. However, the size of the file, how many files you have, the structure of your folders, and number of different data points can all have impact in performance. So, in some cases, you might need to structure some data points first, before storing it in a data lake.
Optimizing I/O. When placing data in to a data lake, the expectation is to actually do something with it. So, let’s say you’re running a job on Hadoop. This job may be CPU, memory, or even I/O intensive. For example, running a machine learning engine could actually take up quite a few CPU cycles. Real-time analytic, on the other hand, could be super memory intensive. Finally, transforming or prepping data for analytics might be I/O intensive. Take this into consideration when designing your own use-case. Finally, when leveraging a large cloud provider, it might make sense to place you data lake in the same cluster where you’re performing your jobs.

Data lakes, even though they’re newer solutions, are quickly becoming key parts of a growing business that’s trying to understand their various data sets. Plus, these data sets within a data lake can get pretty big. To give you an idea, pay-as-you-go pricing for Azure Data Lake Storage will run you $0.039 per GB for the first 100TB. For 1,000TB – 5,000TB, it’ll cost you $0.037 per GB. And yes, they do have options for 5,000TB and up configurations. The bottom line is that these lakes can become oceans pretty quickly. To that same extent, this data isn’t benign. Rather, its valuable and where performance is definitely critical, so is durability. GCP and their data lake within the Cloud Storage platform is designed for 99.999999999% annual durability.

So, if you’re looking to get started with a data lake, you’ll need to examine four issues:

The specific data types and sources you’ll be using.
How much data you’ll need and/or create.
When you’ll need this data (remember, you are charged based on read/write operations as well as other factors).
The types of data analytics, mining, or exploration that you’ll be doing.

After that, it might get a bit complicated. But, don’t let that dissuade you from using a really powerful tool. Working with cloud partners and data science experts can really help you gain the most out of your data sources. That same Aberdeen survey I mentioned earlier found that organizations who deployed a data lake actually outperformed similar companies by 9% in organic revenue growth. So, using data intelligently can really give you a competitive edge. Just like anything else, a data lake is a tool to help you better grasp the vast amount of information that we’re creating every day. Remember to apply best practices, ask the right questions – and don’t be afraid to jump in.