• About Us
  • Sponsorship Opportunities
  • Privacy Policy

Data Center Frontier

Charting the future of data centers and cloud computing.

  • Cloud
    • Hyperscale
  • Colo
    • Site Selection
  • Energy
  • Cooling
  • Technology
    • Internet of Things
    • AI & Machine Learning
    • Edge Computing
    • Virtual Reality
    • Autonomous Cars
    • 5G Wireless
  • Design
    • Servers
    • Storage
    • Network
  • Voices
  • Podcast
  • White Papers
  • Resources
    • Events
    • Newsletter
    • Companies
    • Data Center 101
  • Jobs
You are here: Home / Storage / A Deep Dive Into Data Lakes

A Deep Dive Into Data Lakes

By Bill Kleyman - August 29, 2018 Leave a Comment

A Deep Dive Into Data Lakes

Data lakes are centralized storage and data repositories that allow you to work with a variety of different types of data. (Photo: Rich Miller)

LinkedinTwitterFacebookSubscribe
Mail

In the age of Big Data, we’ve had to come up with new terms to describe large-scale data storage. We have databases, data warehouses and now data lakes.

While they all contain data, these terms describe different ways of storing and using that data. Before we discuss data lakes and why they are important, let’s examine how they differ from databases and data warehouses.

Let’s start here: A data warehouse is not a database. Although you could argue that they’re both relational data systems, they serve different purposes. Data warehousing allows you to pull data together from a number of different sources for analysis and reportiong. Data warehouses store vast amounts of historical data for complex queries across all data types being pulled together.

There are lots of use cases for data warehouses. For example, if you’re doing very large amounts of data mining, a data warehouse is way better than a traditional database. Similarly, if you need to quantify vast amounts of market data in-depth, a data warehouse could help. You can use that data to understand the behavior of users in an online business to help make better decisions around services and products.

Variety is the Spice of Data Lakes

A data lake is a newer data processing technology which focuses on structured, semi-structured, unstructured, and raw data points for analysis. Data warehouses, on the other hand, only look at both structured and processes data. Where data warehousing can be used by business professionals, a data lake is more commonly used by data scientists. Data lake examples include Amazon S3, Google Cloud Platform Cloud Storage Data Lake, as well as the Azure Blob Storage service.

Data lakes are centralized storage and data repositories that allow you to work with a variety of different types of data. The cool thing here is that you don’t need to structure the data and it can be imported “as-is.” This allows you to work with raw data and run analytics, data visualization, big data processing, machine learning tools, AI, and much more. This level of data agility can actually give you some pretty cool competitive advantages.

Data lakes are centralized storage and data repositories that allow you to work with a variety of different types of data.

A recent Aberdeen survey found that a well-conceived data lake lays the groundwork for an elevated level of analytical activity. For leading companies, those analytical activities also translate to substantial ROI (return on investment) in the form of business growth and higher profits. This means being able to process vast amounts of data from business-critical sources like logs, data from clicks and click-streams, social media, and even IoT data that’s funneled into the data lake. A data lake acts as a powerful tool to facilitate business growth, identify new market opportunities, and even help make better market and business decisions.

Data Lakes and Cloud Architecture

Data lakes also offer up some really advanced use-cases when it comes to working with data. That is, working with machine learning, predictive analytics, data discovery, and much more. When it comes to integrating with other services and solutions, it’s important to understand where data lakes actually fit in. Consider the following chart:

(Source: Google Cloud)

(Source: Google Cloud)

Using Google as an example, you can see how data lakes act as a centralized point to store, process, and analyze massive volumes of structured and unstructured data. As part of the cloud infrastructure, this data is processed in a secure, cost-efficient, and agile manner. Within the cloud storage architecture, you’d integrate other key services which will help facilitate data processing. For example, predictive services using machine learning, data warehousing with BigQuery, or data mining with Datalab.

Because of its flexibility, data lakes offer organizations the flexibility to capture every aspect of your business operations in data form. As the chart above indicates, once this data is captured, you’ll be able to leverage a variety of great services to actually extract insights from it.

Getting Started and Overcoming Challenges

If a data lake is something that’s on your horizon, it’s important to take a few notes on how to get started and become successful. So, here are a few tips and best practices:

  • Data ingestions. Your data has to come from somewhere. This could be a sensor, user data, information from various e-commerce sites, user activity aggregation, and much more. Now, the next step is to get it into your data lake. You’ll need to consider the source hardware, connectivity, and another key points that might slow down the ingestion process. For example, depending on your use-case, you might need to deploy SSDs instead of HDDs for faster and better performance.
  • Network connectivity into the data lake. This can be a major bottleneck if not designed properly. If you’re storing data outside of a major cloud provider, consider using one of the many powerful methods of connectivity into the cloud. For example, data center partners can leverage a dedicated link with Azure ExpressRoute to improve connectivity. Remember, you do have options as to where your data lake sits. It can be on premise, within a colocation, or within an actual cloud provider. Arguably, when you’re processing this much information, it’s not a bad idea to simply store the data as close to where the jobs will be processes; that is, within the cloud itself.
  • Structuring data. First of all, this isn’t entirely a requirement. However, the size of the file, how many files you have, the structure of your folders, and number of different data points can all have impact in performance. So, in some cases, you might need to structure some data points first, before storing it in a data lake.
  • Optimizing I/O. When placing data in to a data lake, the expectation is to actually do something with it. So, let’s say you’re running a job on Hadoop. This job may be CPU, memory, or even I/O intensive. For example, running a machine learning engine could actually take up quite a few CPU cycles. Real-time analytic, on the other hand, could be super memory intensive. Finally, transforming or prepping data for analytics might be I/O intensive. Take this into consideration when designing your own use-case. Finally, when leveraging a large cloud provider, it might make sense to place you data lake in the same cluster where you’re performing your jobs.

Data lakes, even though they’re newer solutions, are quickly becoming key parts of a growing business that’s trying to understand their various data sets. Plus, these data sets within a data lake can get pretty big. To give you an idea, pay-as-you-go pricing for Azure Data Lake Storage will run you $0.039 per GB for the first 100TB. For 1,000TB – 5,000TB, it’ll cost you $0.037 per GB. And yes, they do have options for 5,000TB and up configurations. The bottom line is that these lakes can become oceans pretty quickly. To that same extent, this data isn’t benign. Rather, its valuable and where performance is definitely critical, so is durability. GCP and their data lake within the Cloud Storage platform is designed for 99.999999999% annual durability.

So, if you’re looking to get started with a data lake, you’ll need to examine four issues:

  • The specific data types and sources you’ll be using.
  • How much data you’ll need and/or create.
  • When you’ll need this data (remember, you are charged based on read/write operations as well as other factors).
  • The types of data analytics, mining, or exploration that you’ll be doing.

After that, it might get a bit complicated. But, don’t let that dissuade you from using a really powerful tool. Working with cloud partners and data science experts can really help you gain the most out of your data sources. That same Aberdeen survey I mentioned earlier found that organizations who deployed a data lake actually outperformed similar companies by 9% in organic revenue growth. So, using data intelligently can really give you a competitive edge. Just like anything else, a data lake is a tool to help you better grasp the vast amount of information that we’re creating every day. Remember to apply best practices, ask the right questions – and don’t be afraid to jump in.

LinkedinTwitterFacebookSubscribe
Mail

Tagged With: Big Data, Data Lakes

Stay informed: Get our news updates!

* indicates required
Are you a new reader? Follow Data Center Frontier on Twitter or Facebook.
bill@kleyman.org'

About Bill Kleyman

Bill Kleyman is a veteran, enthusiastic technologist with experience in data center design, management and deployment. Currently, Bill works as the Executive Vice President of Digital Solutions at Switch.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Facebook
  • Google+
  • LinkedIn
  • Pinterest
  • Twitter

Voices of the Industry

How Automation is the Precursor to AI in the Data Center

How Automation is the Precursor to AI in the Data Center A guest article from CoreSite asserts automation is shifting data centers from reactive to preventative on the way to predictive. Brenda Van der Steen, VP of Marketing at CoreSite, explores how automation acts as not only a precursor to artificial intelligence, but a crucial "step in the journey."

DCF Spotlight

hyperscale data center

The Hyperscale Data Center Drives the Global Cloud Revolution

The hyperscale data center is reshaping the global IT landscape, shifting data from on-premises computer rooms and IT closets to massive centralized data center hubs. Explore further how cloud campuses will continue to enable hyperscale operators to rapidly add server capacity and electric power.

An aerial view of major facilities in Data Center Alley in Ashburn, Virginia. (Image: Loudoun County)

Northern Virginia Data Center Market: The Focal Point for Cloud Growth

The Northern Virginia data center market is seeing a surge in supply and an even bigger surge in demand. Data Center Frontier explores trends, stats and future expectations for the No. 1 data center market in the country.

See More Spotlight Features

White Papers

edge

Empower Your Edge Computing — for the Cloud

Today, colocation providers like EdgeConneX are working with leading cloud and IT service providers to bring their solutions to the edge, deployed in close proximity to enterprises, so they can easily access secure, local connectivity to the cloud for any and all of their workloads. Download the new white paper from EdgeConnex exploring edge computing for the cloud that offers an in-depth look at what factors have driven the cloud to the mainstream, the challenges to enterprise cloud adoption, security concerns, spend, the benefits of an integrated cloud solution and more.

Download
See More White Papers »

Get the Latest News from DCF

Job Listings

RSS Job Openings | Peter Kazella and Associates, Inc

  • Critical Power Sales Associate - New York, NY
  • Mechanical Commissioning Project Manager - Los Angeles, CA
  • Revit Designer - Dallas, TX
  • UPS Field Service Technician - Raleigh, NC
  • UPS Field Service Engineer - Secaucus, NJ

See More Jobs

Data Center 101

Data Center 101: Mastering the Basics of the Data Center Industry

Data Center 101: Mastering the Basics of the Data Center Industry

Data Center Frontier, in partnership with Open Spectrum, brings our readers a series that provides an introductory guidebook to the ins and outs of the data center and colocation industry. Think power systems, cooling, solutions, data center contracts and more. The Data Center 101 Special Report series is directed to those new to the industry, or those of our readers who need to brush up on the basics.

  • Data Center Power
  • Data Center Cooling
  • Strategies for Data Center Location
  • Data Center Pricing Negotiating
  • Cloud Computing

See More Data center 101 Topics

About Us

Charting the future of data centers and cloud computing. We write about what’s next for the Internet, and the innovations that will take us there. We tell the story of the digital economy through the data center facilities that power cloud computing and the people who build them. Read more ...
  • Facebook
  • LinkedIn
  • Pinterest
  • Twitter

About Our Founder

Data Center Frontier is edited by Rich Miller, the data center industry’s most experienced journalist. For more than 15 years, Rich has profiled the key role played by data centers in the Internet revolution. Meet the DCF team.

TOPICS

  • 5G Wireless
  • Cloud
  • Colo
  • Connected Cars
  • Cooling
  • Cornerstone
  • Design
  • Edge Computing
  • Energy
  • Executive Roundtable
  • Featured
  • Hyperscale
  • Internet of Things
  • Machine Learning
  • Network
  • Podcast
  • Servers
  • Site Selection
  • Social Business
  • Special Reports
  • Storage
  • Virtual Reality
  • Voices of the Industry
  • White Paper

Copyright Data Center Frontier LLC © 2019