May 3, 2024 4 min read Data Architecture

Relational Data Warehouses: What are they?

We all know about Relational Databases but what is a Relational Data Warehouse?

A Relational Data Warehouse (RDW) is organized on the principle of relationship models, the same as a Relational Database. Data entities from different domains are represented as tables (e.g. a table to hold records about Customers and a table to hold records about Sales that were made to Customers). Tables representing different domains are usually linked to each other through a common attribute (e.g. a Customer has a unique ID which is also recorded in the Sales table to show the relationship model between a Customer and their purchases).

A RDW collects, stores and manages significant volumes of structured data from a variety of sources.

💡

Data warehouses can also be based on non-relational data sources. Non-relational data warehouses include types such as columnar, NoSQL, and graph data warehouses.

An RDW is, in theory, a Single Source of Organizational Truth.

This property of an RDW is its most valuable. It signifies that all data stored in the RDW adheres to a uniform, structured format, presenting a unified, precise portrayal of information. This guarantees uniform access for all users, eradicating disparities and data isolation. Such integration enhances organizational decision-making, collaboration, and operational efficiency. Additionally, it mitigates the likelihood of errors and misunderstandings inherent in disparate, incongruous data origins.

💡

When utilized by the entire company, an RDW is commonly referred to as an Enterprise Data Warehouse (EDW). The EDW offers a cohesive, consolidated perspective on the entirety of the organization's data.

How are RDW/EDW created?

Usually, when constructing an RDW/EDW, you'll establish data pipelines that execute three primary steps, known as extract, transform, and load (ETL):

Firstly, the pipeline retrieves data from the source systems, such as databases, flat files, JSON files, CSV, XML etc.
Next, the retrieved data undergoes transformation to conform to the requirements of the target systems (in this instance, aligning with the specifications of an RDW/EDW). This process may entail tasks like cleaning, filtering, aggregating, or integrating data from multiple sources.
Finally, the transformed data is loaded into the data warehouse.

💡

Developing RDW/EDW is not a walk in the park. It is therefore vital that before starting work on one, be sure an RDW/EDW is really the solution. Someone could just be asking for an ad-hoc report/dataset that does not warrant the creation of a full-blown RDW/EDW.

Use a top-down approach to design RDW/EDW.

An RDW/EDW has become popular in data-literate organizations because they can be used to determine strategy, product roadmaps, sales targets and operations.

It is therefore advisable that any RDW/EDW's construction begins with questioning the organization's goals and needs.

Be sure about the organization's strategy

If you don't know the corporate strategy, RDW/EDW will be useless to your needs. It is therefore critical to speak with the right people, ask them the right questions and use this information to devise a plan for the RDW/EDW. For example, an objective could be "increase sales in lagging geographies". To discover which geographies might be lagging in sales, the RDW/EDW must have data from sales and products divided across geographies.

If the objective is to "increase employee satisfaction", the RDW/EDW should have provision for Employee data and some sort of Survey outcomes.

Architect the data warehouse

Develop a high-level architectural framework for the data warehouse, encompassing its structure, data models, and data integration processes, per the business requirements. These elements will serve as your technical prerequisites.

Model the data domains

Construct an intricate data model for the data warehouse, considering the interrelations among different data entities and the level of detail (granularity) of the data.

Create the data model

Implement the requisite databases, schemas, tables, and fields for the data warehouse, following the schema-on-write approach.

Write ETL procedures

The ETL processes will extract data, manipulate it to a desired structure format and load it into the RDW/EDW.

Continuously test and improve the RDW/EDW

As you become proficient in managing RDW/EDW, tweak the architecture, data models and other constructs to make the whole system leaner and performant.

It's Not Always Sunny in Philadelphia

Yup.

Along with all the good things RDW/EDW allows us to do, there are serious side effects related to their development.

They have limited flexibility

RDW/EDW are usually supportive of specific analysis types (such as those that need structured data) and this limits their ability when it comes to processing semi/unstructured data sources. Additional hacks/tools are needed to expand the RDW/EDW's analytical capabilities.

They require strong security

With so much important and proprietary data in the same place, it is important to apply the latest and greatest security protocols to keep malicious users out. If one does get into the RDW/EDW, they have unhindered access to ALL of the organizations' data.

They are complex

It is no surprise they are. An RDW/EDW is the preferred Single Source of Organizational Truths and therefore designing it requires input from multiple stakeholders, some with simple needs and some with overlapping, complex demands.

Data integrations may not be straightforward

This issue is more about the tools we use rather than any inherent problems in the RDW/EDW design activity. The selected technology for the RDW/EDW may not have the right connectors to extract data from a Postgres database and architects have to find ways around this limitation.

💡

This might not be a concern nowadays since most RDW/EDW providers have connectors and integration capabilities with many data sources.

Data latency is a real concern

It's not surprising this is a concern. Remember RDW/EDW are processing, moving, and saving massive quantities of datasets, with each dataset requiring its transformation logic. This can lead to a situation where the data in the RDW/EDW is not up to date.

I write to remember, and if, in the process, I can help someone learn about Containers, Orchestration (Docker Compose, Kubernetes), GitOps, DevSecOps, VR/AR, Architecture, and Data Management, that is just icing on the cake.