Data lake

A data lake is a large-scale storage repository and processing engine. A data lake provides "massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs".^[1]

Invention

The term was coined by James Dixon, Pentaho chief technology officer.^[2] Dixon used the term initially to contrast with "data mart", which is a smaller repository of interesting attributes extracted from the raw data. He wrote: "If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." ^[3] Dixon argued that data marts have several inherent problems, and that data lakes are the optimal solution.

Dixon identified 2 shortcomings of data marts: "Only a subset of the attributes are examined, so only pre-determined questions can be answered." and "The data is aggregated so visibility into the lowest levels is lost." ^[3] These problems are often referred to as "siloing" and, in agreement with Dixon, PricewaterhouseCoopers says that data lakes could "put an end to data silos".^[4] In their study on data lakes they note that "Enterprises across industries are starting to extract and place data for analytics into a single, Hadoop based repository." They note organizations such as UC Irvine Medical Center, Google and Facebook who have embraced the data lake concept.

The idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various forms including Reporting, Visualization, Analytics and Machine learning.

The data lake includes structured data from relational databases (rows and columns), semi-structured data (csv, logs, xml, and newer formats like json), unstructured data (emails, documents, pdf's) and even binary data namely images, audio and video, thus creating a centralized data store accommodating all forms of data

Examples

One example of a data lake is the distributed file system, Apache Hadoop.

Many companies also use cloud storage services such as Amazon S3.^[5] There is a gradual academic interest in the concept of data lakes, for instance,Personal DataLake^[6] an ongoing project at Cardiff University to create a new type of data lake which aims at managing big data of individual users by providing a single point of collecting, organizing, and sharing personal data.^[7]

The earlier data lake (Hadoop 1.0) had limited capabilities with its batch oriented processing (Map Reduce) and was the only processing paradigm associated with it. Interacting with the data lake meant you had to have expertise in Java with map reduce and higher level tools like Pig & Hive (which by themselves were batch oriented). With the dawn of Hadoop 2.0 and separation of duties with Resource Management taken over by YARN (Yet another resource negotiator), new processing paradigms like Streaming, interactive, online were available with Hadoop and the Data Lake.

Criticism

In June 2015 David Needle characterized "so-called data lakes" as "one of the more controversial ways to manage big data".^[8] PricewaterhouseCoopers were also careful to note in their research that not all data lake initiatives are successful. They quote Sean Martin, CTO of Cambridge Semantics,

We see customers creating big data graveyards, dumping everything into HDFS and hoping to do something with it down the road. But then they just lose track of what’s there.^[4]

They advise that "The main challenge is not creating a data lake, but taking advantage of the opportunities it presents.".^[4] They describe companies that build successful data lakes as gradually maturing their lake as they figure out which data and metadata are interesting to the organization.

References

↑ "What is Hadoop?". SAS. Retrieved 7 Nov 2015.
↑ Woods, Dan (21 July 2011). "Big data requires a big architecture". Tech. Forbes.
1 2 Dixon, James. "Pentaho, Hadoop, and Data Lakes". James Dixon’s Blog. James. Retrieved 7 November 2015.
1 2 3 Stein, Brian; Morrison, Alan (2014). Data lakes and the promise of unsiloed data (pdf) (Report). Technology Forecast: Rethinking integration. PricewaterhouseCooper.
↑ Tuulos, Ville (22 Sep 2015). "Petabyte-Scale Data Pipelines with Docker, Luigi and Elastic Spot Instances".
↑ http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?reload=true&arnumber=7310733
↑ http://www.researchgate.net/publication/283053696_Personal_Data_Lake_With_Data_Gravity_Pull
↑ Needle, David (10 June 2015). "Hadoop Summit: Wrangling Big Data Requires Novel Tools, Techniques". Enterprise Apps. eWeek. Retrieved 2015-11-01. Walter Maguire, chief field technologist at HP's Big Data Business Unit, discussed one of the more controversial ways to manage big data, so-called data lakes.

This article is issued from Wikipedia - version of the Tuesday, April 26, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.