3 min read

What is a Data Lake?

A tsunami of data resulted in the creation of a lake. It just so happens that the aquatic life in this lake resembles 1's and 0's, making any data-fisher very excited.

The term 'Data Lake' is a metaphor describing data storage that does not discriminate on what data types* are stored in it (*data types = structured, unstructured, semi-structured).

As data navigates through the data lake, it undergoes cleaning, integration, and potentially aggregation to make it usable and useful. All these activities require computing power that interfaces with the data lake, extracts and manipulates the data, and then reintegrates it into the data lake.

Clockwise: Audio, Relational Data, XML, Structured Data, JSON, Video.
💡
Data Lakes use the Schema-on-Read.
In this method, comprehension and interpretation of ingested data are done the moment the data is stored as opposed to Schema-on-Write, where the structure for the data is established beforehand, during the information's creation.

Best Approaches for Creating a Data Lake.

Of course, the biggest favour you can do for yourself as you create a Data Lake is to ensure it doesn't end up a Data Swamp.

💡
A "Data Swamp" is a term used to describe situations in which an organization's data repository becomes disorganized, chaotic, and largely unusable due to the accumulation of large volumes of data without proper organization, management, or governance. In a data swamp, data may lack metadata, context, or documentation, making it difficult for users to find, understand, or trust the data for decision-making or analysis purposes.

Designing a data lake should not be considered a trivial task. It is a time-consuming process and before designing it, users must know what types of data are being used, at what volumes and how frequently are data being updated.

Divide the data lake into multiple layers (also called zones)

  • Each layer has an increasing data quality index i.e. data quality improves as it moves from left to right (Raw to Cleaned).
    • RAW:
      • Raw
        • Reservoir for data in its original state.
        • Data is unfiltered and unpurified.
        • This layer is also called the Bronze Layer or Landing Layer.
      • Conformed
        • Data is transformed into a common storage file format (Parquet is the most popular format but CSV and JSON can also be used).
        • Also called the Base or Standardized Layer.
    • CLEANED:
      • Enriched
        • Data is available as consumable datasets.
        • Also called Silver, Tranformed, Refined, Processed or Enriched.
    • PRESENTATION
      • Curated
        • This layer is business-facing i.e. non-technical resources can connect to it to satisfy their reporting needs.
        • Also called Trusted, Gold, Production, Curated, Analytics or Consumption.
    • SANDBOX
      • Used by data scientists, analysts and others for exploration and playing with the raw data.
💡
This is 'A' style for architecting a data lake, not the only one.

Divide the data lake into folders

This approach provides a lot more flexibility and, not surprisingly, an opportunity to shoot oneself in the foot, if not cautious. Folders can be considered 'zones' and their arrangement should be driven by decisions made collaboratively.

Examples of folder structures are:

  1. Save data into folders per business unit.
  1. Save data into folders broken down by specific roles.
  1. Save data in a folder named after the data lifecycle stages.
💡
Other ways for dividing data into folders are:
- Time partitions (the day data was ingested)
- Frequency data folder/zone is accessed (e.g. Hot, Cold, Lukewarm)
- Owner's name
- Sensitivity of data (e.g. open access, limited access, top secret)

I write to remember, and if, in the process, I can help someone learn about Containers, Orchestration (Docker Compose, Kubernetes), GitOps, DevSecOps, VR/AR, Architecture, and Data Management, that is just icing on the cake.