Relational Data Warehouses: What are they?
A Relational Data Warehouse (RDW) is organized on the principle of relationship models, the same as a Relational Database. Data entities from different domains are represented as tables (e.g. a table to hold records about Customers and a table to hold records about Sales that were made to Customers). Tables representing different domains are usually linked to each other through a common attribute (e.g. a Customer has a unique ID which is also recorded in the Sales table to show the relationship model between a Customer and their purchases).
A RDW collects, stores and manages significant volumes of structured data from a variety of sources.
An RDW is, in theory, a Single Source of Organizational Truth.
This property of an RDW is its most valuable. It signifies that all data stored in the RDW adheres to a uniform, structured format, presenting a unified, precise portrayal of information. This guarantees uniform access for all users, eradicating disparities and data isolation. Such integration enhances organizational decision-making, collaboration, and operational efficiency. Additionally, it mitigates the likelihood of errors and misunderstandings inherent in disparate, incongruous data origins.
How are RDW/EDW created?
Usually, when constructing an RDW/EDW, you'll establish data pipelines that execute three primary steps, known as extract, transform, and load (ETL):
- Firstly, the pipeline retrieves data from the source systems, such as databases, flat files, JSON files, CSV, XML etc.
- Next, the retrieved data undergoes transformation to conform to the requirements of the target systems (in this instance, aligning with the specifications of an RDW/EDW). This process may entail tasks like cleaning, filtering, aggregating, or integrating data from multiple sources.
- Finally, the transformed data is loaded into the data warehouse.
Use a top-down approach to design RDW/EDW.
An RDW/EDW has become popular in data-literate organizations because they can be used to determine strategy, product roadmaps, sales targets and operations.
It is therefore advisable that any RDW/EDW's construction begins with questioning the organization's goals and needs.
Be sure about the organization's strategy
If you don't know the corporate strategy, RDW/EDW will be useless to your needs. It is therefore critical to speak with the right people, ask them the right questions and use this information to devise a plan for the RDW/EDW. For example, an objective could be "increase sales in lagging geographies". To discover which geographies might be lagging in sales, the RDW/EDW must have data from sales and products divided across geographies.
If the objective is to "increase employee satisfaction", the RDW/EDW should have provision for Employee data and some sort of Survey outcomes.
Architect the data warehouse
Develop a high-level architectural framework for the data warehouse, encompassing its structure, data models, and data integration processes, per the business requirements. These elements will serve as your technical prerequisites.
Model the data domains
Construct an intricate data model for the data warehouse, considering the interrelations among different data entities and the level of detail (granularity) of the data.
Create the data model
Implement the requisite databases, schemas, tables, and fields for the data warehouse, following the schema-on-write approach.
Write ETL procedures
The ETL processes will extract data, manipulate it to a desired structure format and load it into the RDW/EDW.
Continuously test and improve the RDW/EDW
As you become proficient in managing RDW/EDW, tweak the architecture, data models and other constructs to make the whole system leaner and performant.
It's Not Always Sunny in Philadelphia
Yup.
Along with all the good things RDW/EDW allows us to do, there are serious side effects related to their development.
They have limited flexibility
RDW/EDW are usually supportive of specific analysis types (such as those that need structured data) and this limits their ability when it comes to processing semi/unstructured data sources. Additional hacks/tools are needed to expand the RDW/EDW's analytical capabilities.
They require strong security
With so much important and proprietary data in the same place, it is important to apply the latest and greatest security protocols to keep malicious users out. If one does get into the RDW/EDW, they have unhindered access to ALL of the organizations' data.
They are complex
It is no surprise they are. An RDW/EDW is the preferred Single Source of Organizational Truths and therefore designing it requires input from multiple stakeholders, some with simple needs and some with overlapping, complex demands.
Data integrations may not be straightforward
This issue is more about the tools we use rather than any inherent problems in the RDW/EDW design activity. The selected technology for the RDW/EDW may not have the right connectors to extract data from a Postgres database and architects have to find ways around this limitation.
Data latency is a real concern
It's not surprising this is a concern. Remember RDW/EDW are processing, moving, and saving massive quantities of datasets, with each dataset requiring its transformation logic. This can lead to a situation where the data in the RDW/EDW is not up to date.
I write to remember, and if, in the process, I can help someone learn about Containers, Orchestration (Docker Compose, Kubernetes), GitOps, DevSecOps, VR/AR, Architecture, and Data Management, that is just icing on the cake.