How do I set up a data lake

The data lake concept: the treasure in the data lake

Data and information play an increasingly important role in companies and have become the new production factor alongside work, capital and the environment. Hardly any other term has shaped this area as much as “Big Data”.

It is by no means just about the size of the data volumes. Rather, new types of data have arisen in companies in recent years, from an increasing number of sensor data and technical log files to social media. These often contain valuable information, but are ignored in classic business intelligence systems. Ultimately, big data means taking advantage of a large number or all of the data.


In order to gain new knowledge, it is important to intelligently combine traditional and new analysis methods. The optimal integration of big data technologies such as Hadoop with existing architectures is of crucial importance here. "
Matthias Reiss, IT Specialist Big Data, IBM

The heterogeneity of the "new" data and the rapid changeability of the formats can hardly be mapped with classic data warehouse processes and methods. The main reasons for this are the high upfront effort for data integration and the resulting lack of flexibility in being able to implement new requirements at very short notice and in an agile manner. In addition, some data can only be converted into classic, relational structures with difficulty. The data lake concept is being used more and more to address these new analytical requirements.


Similarities and differences to the data warehouse

Compared to the classic data warehouse, this is a paradigm shift: if the data was traditionally first transferred into defined structures using complex data quality and integration processes, it is stored directly in its original form in the data lake. This means that any data can be used quickly and easily for analyzes and linked as required. While the classic data warehouse focuses on high (process) efficiency for interactive analyzes and reports, and the information is prepared relatively precisely for the user, a data lake above all enables the easy discovery of new relationships in unprocessed data. This “research-oriented” approach, which is widespread in data science, is particularly useful when it is not clear whether the data can be used to generate benefits. In practice, it can be established that a combination of standardized self-service analyzes and data science-oriented procedures is almost always required, which leads to architectures like the one in Figure 1. The data lake is divided into a raw data area and an area with processed, integrated and quality-assured data (refined data).

Image 1: The data lake and its areas.

Data lakes: combining technologies

The heart of the data lake is usually the open source framework Hadoop. It can process any type of data in large quantities, with the computations spread across many nodes in a cluster. This makes it ideal for storing and analyzing the raw data in their original form. Sometimes there is also a need to analyze data before saving it. Reasons for this are, for example, real-time requirements (e.g. actions result directly from analyzes, from warnings to fully automatic processes) or that full storage is not technologically or economically sensible and only certain events should be filtered out or one Pre-compression should take place. Streaming analysis systems such as IBM InfoSphere Streams are used for this, which enable analyzes to be carried out directly in the data stream.

Within the data lake, data is partially processed to make the work of the specialist user easier, for example by converting it into dimensional models with the corresponding dimensions, precalculating inventory key figures from incoming and outgoing goods and applying data quality procedures. This prepared data is then often made available to a large group of users.

The concept of the prepared part of the data lake essentially corresponds to that of the data warehouse. Therefore, classic database technologies are often used, whereby the demands of the user for simplicity have increased significantly. A multitude of new innovative products - from hybrid in-memory data warehouses to data warehouse appliances and cloud offerings address this need. The main aim is to implement new requirements more quickly in order to be able to (react) quickly and more agile to dynamic changes in the market environment.

The agility gained in this way is, however, a major challenge from a governance point of view. This is not limited to security, but also includes aspects such as the traceability of the processes, documentation of the data content and interpretations or the masking of data for certain user groups. Effective governance requires a holistic approach across the entire process and technology boundaries in order to get a complete picture of the “puzzle”.

The influence of technology on business models

In all technology discussions, it is important to always keep an eye on the benefits. New, flexible concepts such as the data lake and innovative products such as Hadoop and streaming analyzes offer far more options than modernizing existing analysis landscapes. Above all, they enable completely new business models and fields.

The example of the Danish wind turbine manufacturer Vestas shows how technology can change a business model in the long term. Due to increasing price pressure and the high production costs in Europe, new ways had to be found to differentiate oneself from global competition. In order to not only deliver systems, but also complete projects with “built-in” investment security, Vestas therefore decided to use big data to calculate optimal locations for wind turbines / parks and to make these results available to customers and sales staff.

Above all, the accuracy of the calculations and the type of data used were decisive. The accuracy strongly depended on the level of detail of the basic data, in this case the wind information. Thanks to the transition from the weather balloon to laser-based measurement technology, values ​​with significantly higher precision and frequency are now available, which has led to a massive multiplication of the data. However, the variety of data that had to be combined was also interesting: starting with sensor data such as weather information, through historical information on systems to master data such as map information. In addition to the performance, the simple integration of new data formats in their original form was essential. This corresponds to the raw data area of ​​the data lake. For implementation, a Hadoop distribution developed for corporate use was used with IBM BigInsights.

Hadoop Basics for Successful Implementation

The Apache Hadoop Framework, with its flexible approach based on a distributed file system, is one of the core components of a data lake architecture. Data in the most varied of forms can be stored here efficiently and cost-effectively and made available for analysis. Installation, operation and maintenance of a Hadoop cluster, however, require a not-to-be-underestimated amount of know-how, effort, time and costs.

Photo 2: Vestas offers its customers investment and planning security through big data.

Hadoop distributions take the horror of a Hadoop implementation with optimally coordinated open source components, enhanced by useful extensions and tools that are oriented towards practice in companies. This enables the construction of a data lake with optimal integration into existing system landscapes and makes Hadoop fit for use in the company from installation to finished analysis and visualization.

Data and analysis for everyone

Data lake concepts promise to enable a broad group of people in the company to have access to data and analyzes. The underlying platform should enable knowledge beyond the standardized BI reports and invite people to work creatively with the data. This is exactly where there are some hurdles to overcome in the Hadoop environment. Java APIs or languages ​​like Pig require profound programming skills. The corresponding skill in companies as well as in the market has so far only been available to a limited extent.

Components that build on the know-how that has existed in the company for many years, offer a quick entry into the world of big data and can accelerate the implementation of analyzes promise success here. One approach for this is, for example, tools that work similarly to spreadsheet programs, but take into account the special requirements of big data. On the one hand, intuitive import processes are required for typical data such as JSON, CSV, TSV or integrated web crawlers. On the other hand, it makes sense to first define analyzes on a small data section (sample) before they are applied to the entire database, which ideally should be automated.


“The focus of the data lake is not on collecting the data, but on using it. The high flexibility of this concept enables not only the modernization of existing analysis landscapes but also completely new, data-based business models. "

Stephan Reimann, IT Specialist Big Data, IBM

Another important point is the connection of analysis and reporting tools. This requires a Hadoop SQL engine which can be integrated with standard JDBC / ODBC drivers. The decisive factor is the ANSI-SQL compatibility, which makes the use of existing BI tools on Hadoop data possible in the first place.

In addition to SQL, complex, statistical analyzes, for example with R, are becoming increasingly important. The possibility of executing R code directly on Hadoop clusters opens up completely new fields of application, but requires special implementations of R, as it was originally developed for single-user systems. In addition to the functional aspects, the operation of a Hadoop cluster also requires multi-tenancy and workload management functionalities in order to be able to differentiate between the various analytical requirements and to make resources available efficiently.

Clear water instead of cloudy broth

The best tool case is of no use if the individual tools are not planned and used with care. With all the opportunities that a data lake offers, governance should not be overlooked. More than ever, it is important not to lose track of the flood of information.

It is therefore not enough to set up the data lake by filling data from all available sources into the central repository, but rather to use the possibilities of this concept correctly. Without sensible information management and appropriate governance, this will not succeed.

Questions about the source, trustworthiness, protection and lifecycle management of the data are more important than ever. What data is available in the repository, how is it defined and in what context do they relate to each other? Information that enables further knowledge to be gained in the first place. Powerful data integration tools with intelligent metadata management make it possible to maintain control and to guarantee the traceability of the processing processes.

The treasure in the data lake

The data lake offers many opportunities to use data and information in companies profitably. In addition to completely new use cases and the resulting business opportunities, it enables, above all, a “democratization” of data, or in other words: having the right data available at the right time. Or if this is not the case, it can be made available quickly and easily. And the right decision-making basis for an important decision is often worth gold. Good luck with your treasure hunt in the data lake!

>> To continue: The treasure hunt in the data lake continues

Matthias Reiss and Stephan Reimann 

www.ibm.com/de/de