Data Lake Architecture Blueprint

1 Purpose and Scope

The purpose of this document is provide a blueprint for a GDPR-compliant architecture of the Data Lake and its integration. The blueprint will be accompanied by rationales behind the approach and technology selection.

2 Business goal

One of the major requirements for XXXXX is to be capable of providing data-driven features, based on Data Science (DS) and Machine Learning (ML) techniques and algorithms.

In order to enable a DS/ML-based approach, that is, a data-driven approach, XXXX needs an augmentation - a system component that will collect data and provide suitable interfaces for conducting DS/ML analysis, modeling and training. As such a component matches a data-centric architectural pattern known as Data Lake, this new subsystem is referred as Data Lake (DL).

3 Data lake

A data lake, in general, is a system that is capable of holding a vast amount of raw/unstructured and/or semi-structured data in its native format and provide a way data can be efficiently accessed and used (many times) later, when it is needed.

Gartner refers to Data Lakes in broad terms as “enterprise-wide data management platforms for analyzing disparate sources of data in its native format”.

Some of data lake key characteristics are:

  1. Type of data: it can support data of any kind: raw/unstructured, semi-structured and structured
  2. Schema: while processing data (store and retrieve), it does not imply applying schema on write but on read
  3. Storage: designed for low cost storage
  4. Agility: highly agile, configure and reconfigure as needed
  5. Usage: used for data analysis, data exploration (used mostly by data scientists)

In order to implement a successful Data Lake, the following key criteria have to be fulfilled:

  1. Adequate platform/technology
  2. Right data
  3. Appropriate interfaces

Adequate platform/technology criteria means that selected technology should sustain in a long period important aspects as:

  1. scalability - ability to accept a vast and increasing amount of data in a performant way
  2. durability - to prevent data-loss
  3. flexibility - to efficiently supports for different kind of data and data processing and reprocessing

Right data criteria means that data stored into Data Lake have to be accompanied by the proper set of metadata (e.g. data source, timestamps, lineage). Without that, Data Lake turns easily in, so called, Data Swamps - a lot of unusable data.

Appropriate interfaces criteria stands for a requirement that Data Lake users should easily serve themselves by using provided Data Lake APIs. Without adequate interfaces that are known to DL users (data scientists), there will be need for extra effort performed by Data Engineers (and possibly a resource bottleneck).

4 Data lake requirements

In context of the XXX, the Data Lake requirements are as follow:

  1. Data and data sources - to accept the data coming from XXXX
  2. Access patterns - To be used by Data Scientist to perform ad-hoc queries (exploratory data analysis - EDA) using different tools, and to allow training of ML models
  3. To provide a GDPR compliant solution
  4. To avoid vendor lockup - to work in on-premises environments

Data Lake should not be responsible for data processing and data governance. These aspects should be the responsibility of products’ services.

Fig 1 - Data Lake context

4.1 Data and data sources

Data Lake is expected to accept and store data that is coming from existing data sources (e.g. existing applications) as well as from future data sources (e.g. 3rd party databases). It also needs to provide convenient integration mechanisms for different data sources.

In order to estimate the data volume, velocity and growth of it, a sizing estimation exercised needs to be performed upfront. This step provides input for proper resource sizing (storage capacity, clusters sizes, etc.) for the current and expected load.

4.2 Access patterns

Data from Data Lake will be used by Data scientists using different tools for data analysis (EDA), visualization, testing model hypothesis and training models. This implies different access pattern (random searching, multiple-join based queries, sequential access, etc.).

A Data Lake implementation needs to support all these different access patterns in an easy way so the data users (data scientists) are able to on their own (self-serving API).

4.3 GDPR compliance

Among all requirements and responsibilities that a GDPR-compliant system has to satisfy, only the following are applicable to Data Lake implementation:

  1. Right to rectification - support for private data updates
  2. Right to be forgotten - support for private data erasure
  3. Data subject consent - support for erasure private data when the respective consent is revoked

Generally speaking, Data Lake is a data storage. In that sense, only GDPR requirements that are applicable to the stored data management, such as data update and data erasure, has to be supported by an actual Data Lake implementation.

4.4 Vendor lockup

This requirements means that the DL implementation cannot be based on any environment specific service (e.g. AWS cloud services), so it can be deployed in an on-premises environment.

5 Criteria

Based on the above consideration and requirements that a DL implementation should be responsible for, the following is a list of criterions that a technology (or the used tech stack) needs to satisfy or ensure:

  1. Scalability - ability to accept a vast and increasing amount of data in a performant way by scaling (scaling out)
    • Support for expected write throughput
    • Support of the expected total data volume
  2. Durability - to safely persist data and to prevent data-loss (replication, backup)
  3. Support for schema-less write
  4. Efficiently handling stream-like data (time series, series of events/messages)
  5. Support for data erasure - in case specific private data needs to be removed from DL in order to satisfy GDPR requirements
  6. Support for data updates - in case specific private data needs to be updated in DL in order to satisfy GDPR requirements
  7. Support for different read access patterns (random access, query/search language, sequential bulk reads) used by data science tools (access from SQL-capable tools, R, etc.).
  8. Simple and flexible connect API for connecting with XXXXX
  9. Ability to provide fast data updates - this is a soft-requirement as there are no present products/services that would rely on this DL feature.

6 Available technology stacks

The following is a list of technologies that have been considered as candidates for implementing Data Lake.

  1. Postgres (RDBMS)
  2. Apache Kafka
  3. Apache Cassandra (NoSQL)
  4. Apache Hadoop

6.1 Postgres (RDBMS)

Removed from anonymized example review, contains pros and cons of this technology choice.

6.2 Apache Kafka

Removed from anonymized example review, contains pros and cons of this technology choice.

6.3 Apache Hadoop

Removed from anonymized example review, contains pros and cons of this technology choice.

6.4 Apache Cassandra

Removed from anonymized example review, contains pros and cons of this technology choice.

7 Data lake architecture

After taking into account the above criteria, it is clear that using a single technology to implement Data Lake is not possible.

There has to be one technology used as the Data Lake storage layer that would be responsible of persisting all historical data and satisfy all mentioned DL criterions.

In order to provide necessary interfaces that would support different access patterns, a collection of specific technologies is needed, each one specifically tailored for certain usage pattern. This collection implements a part of the system usually called caching layer (contains only the last version of data that is constructed out of the historical data coming from the DL storage layer).

Fig 2 - Data Lake architecture

That is why we can only talk about selecting a base technology for DL implementation. In addition to the storage function, the base technology has to support easy connection mechanism both towards data sources and the caching layer.

7.1 Rationales for selecting Apache Kafka

Removed from anonymized example review, contains pros and cons of this technology choice.

7.2 High-level Data Lake architecture

The following picture represents the Data Lake architecture.

Fig 4 - Data Lake architecture details

The DL architecture is given in its planned context where XXXX is the data source for DL and Data Science (data scientists, machine learning algorithms) represents the way DL needs to be used.

Application represents existing data source for DL (e.g. user clicks, application events, data from 3rd party databases). It is connected to DL via Kafka connectors (Apache Kafka Connect API).

Data Lake consists of two major parts: DL storage (persistence layer) that and, so called, caching layer. The storage layer is responsible for data accepting, storing and retrieval. The caching layer is responsible for using, transforming and providing data via different interfaces. While DL contains the entire data history, the caching layer contains only the last version of data (hence caching layer).

The DL storage part is implemented as a Kafka cluster with respective topics. The caching layer consists of different technologies that are needed for supporting different usage and access patterns. Caching layer instances are connected with Kafka storage topics via connectors implemented using Kafka Connect API.

Neo4j is an example how DL can support graph-like queries. Other graph databases can be also used in the same way using an appropriate connector implementation. ElasticSearch (ES) is used to provide a text search/indexing usage pattern. Apache Ignite is used to provide a SQL compatible interface (e.g. to be used from Tableau). The last part of the caching layer is the most complex one. It is based on Apache Spark. It provides a way to process data in complex ways using both historical data from Kafka topics and the last version of it (using other caching layer interfaces), and using machine learning algorithms (Spark ML library). It provides an API that can be used from R (directly or supported by Jupyter notebook).

7.3 Data sources

For the time being, XXXX is the only data source for the Data Lake. The Kafka Connect API can be used to consume data, filter and transform them if needed (e.g. GDPR-related transformations, decryption), and publish them to DL storage topics.

7.4 Storage layer

Removed from anonymized example review, contains pros and cons of this technology choice.

7.5 Caching layer

Removed from anonymized example review, contains pros and cons of this technology choice.

7.6 GDPR compliance

Removed from anonymized example review, contains pros and cons of this technology choice.

8 Implementation Roadmap and PoC

The following steps represents a roadmap for implementing the DL. The first part of it (steps 1 - 7) can be considered as a PoC implementation. Steps 13 - 16 are planned to test the system performance in case of encrypted data in DL. All automation and provisioning for the PoC will be done in AWS using EC2. The automation/provisioning scripts will be developed in a platform-agnostic way so they can be used in any environment.


  1. Provide automation script (collection of Ansible roles and playbooks) for provisioning an Apache Kafka cluster.
  2. Provide automation mechanism (Ansible) for deploying an ElasticSearch cluster (in-memory text search/indexing cache)
  3. Implement and provision a distributed Kafka-to-ElasticSearch connector based on Kafka Connect API
  4. Provide automation script (Ansible) for provisioning InfluxDB+Grafana stack for collecting metrics
  5. Provide automation script for collecting Kafka broker/topic and connector metrics data and shipping it to the monitoring stack (using Telegraf)
  6. Using a Kafka load generator (Berserker), model test data for Kafka topics
  7. Execute functional, performance and load test using the test for the entire infrastructure (Kafka cluster, connector, ElasticSearch) - the goal is to simulate anticipated production load and data volume

SQL and Spark:

  1. Provide automation script for provisioning Apache Ignite in-memory cluster
  2. Implement Kafka-to-Ignite connector
  3. Provide automation script for provisioning Kafka-to-Ignite connector
  4. Provide automation script for provisioning Apache Spark cluster and connecting it to Kafka and Ignite
  5. Provide automation script for provisioning Jupiter server (connected to Spark cluster)

Handling encrypted data:

  1. Implement and provision Kafka-to-Kafka connector for data encryption (encrypt test data using multiple keys and format-preserving encryption)
  2. Implement and provision the decryption service (to decrypt test data) to be used by Kafka-to-<cache> connectors
  3. Modify Kafka-to-<cache> (ElasticSearch, Ignite) connectors to use decryption service
  4. Perform performance tests using encrypted test data

Data reprocessing:

  1. Implement Spark job for data reprocessing (data erasure process)
  2. Using test data and load generator, execute testing of the data reprocessing Spark job
  3. Implement Kafka-to-Kafka connector
  4. Provision Kafka-to-Kafka and connect DL with XXXX

Graph DB cache:

  1. Provision Neo4j instance
  2. Implement and provision Kafka-to-Neo4j connector