A future of Data Warehousing

Data is growing bigger and the traditional ways to turn data in insights or reports using the classic EDW will become obsolete in time. Still, it is hard to believe that the future will show a rupture from classic EDW into .. what? Though data sizes can become huge not all of the data is meaningfull and even not all meaningfull data is needed in the EDW. I believe that a near future BI or BA architecture should account for both the classic EDW and Big Data components. Maybe this can be based on following thoughts:

(1) The Data Lake contains raw data. The Hadoop based Data Lake is wanted by the Big Data world to store mass amounts of structured, semi-structured or unstructured data, derived from batches or streams, from internal or external sources. Dependent on the size of the data sources, the data is physically copied, or virtually linked to the Data Lake, or made available in any physical or virtual mixture. Among other functions it serves as the location where a Data Scientist will find data to analyze.

(2) A Data Scientist isn't always a data quality expert or lacks time to clean to start analyzing. The Data Hub attempts to structure or clean the data in the Data Lake beforehand, in a physical or virtual way by views. Trying to clean all Data Lake data will not be possible in general, but parts can be done. It is important to realize that Data Hub data might not be realtime anymore while Data Lake data can be. The Data Lake and Data Hub will probably reside in the same Hadoop cluster. Both are designed to feed the Analytic Marts of the Data Scientists

(3) The Data Lake and Data Hub look like the classic Staging Area and the ODS. The Staging Area fed the classic EDW and the ODS was primarily designed for business analysis purposes. I assume that the Data Lake or Data Hub can be used to feed the classic EDW to replace the Staging area and the ODS, while still serving it's primary function for the Data Scientists. Off course, the classic EDW will only use parts of the Data Lake/Hub.

(4) The classic EDW will - as it used to be - be used for periodic tactical/strategic reporting and analyses using dedicated Data Marts. Therefore its data can best be derived from the cleaned and structured Data Hub, to replace the data cleaning process originally present in the Staging Area or ODS.

(5) Realtime Operational BI, Event Based Analysis, or Business Activity Monitoring needs realtime data to be able to produce realtime alerts. The classic EDW isn't suitable for this. This function must be based on virtualised data management like the Logical Data Warehouse. The data needed can best be derived from the Data Lake, the Data Lake will contain the realtime data needed. For reasons of performance, only a fraction of it can be used.

(6) Big Data reporting or analysis can't use the copy-copy-copy process of the Classic EDW approach anymore. There is just too many data. For non-realtime periodic or adhoc reporting or analysis of Big Data, cleaning, and therefore a (virtual) copy process to the same Hadoop cluster is still imaginable. The data source can be the Data Hub. If the Big Data resides in the cloud, it would be advisable to build your Data Lake/Hub and BI/BA process in the same cloud.

(7) For realtime Big Data reporting or analysis a separate cleaning process is hard to imagine. Therefore, the source of data must be the Data Lake. The data in the Data Lake should be virtual for the most part, or even better, all! This means that the Data Lake is only a logical component, in real it resides at the source or sources of the Big Data. Also the Big Data BI/BA process to report or analyse in realtime should be done where the Big Data are. If clean data is required the Big Data should be clean the moment they are created. To achieve data processing performance the hardware and software generating and maintaining the Big Data should deal with concepts like parallellism/replication/segmentation in an MPP architecture, like Hadoop provides for Data Lakes that (partly) exists physically. It seems that all must be in the same cluster; the Big Data, the Data Lake, the data processing, all are situated in one Big Data appliance to be able to handle Big Data in realtime. To achieve the performance needed, that Big Data appliance will structure the Big Data automatically based on the way it is created and demanded, it can change this structure when changes in creation or demand are noticed. This automatic data structuring resembles the tedious job we data warehouse specialists had to do manually to construct a classic EDW. Finally, if the Big Data resides in the cloud, all are in the same cloud, in the same appliance in the cloud.

(8) Till now I never used the word agile. I don't think the agile way of working is "invented" to manage Big Data. Many years ago, before the data was Big Data, I already built classic EDW's in an agile way. Methods like "Rapid Warehousing Methodology" were based on agility. But off course it also can be used to deal with Big Data.

(9) Finally, as data is constantly growing, it seems inevitable that every analysis environment, a Lake, Hub, EDW, Mart, Appliance, only seem to deal with data without a long history, as probably observed by everyone who dealt with classic EDW's. The data that is gathered today wasn't gathered earlier and therefore unavailable. In general, this means that keeping historic data in your warehouse or lake or whatever isn't your greatest challenge, related to data sizes. To store, maintain, and access the last year of data most likely is a much bigger issue.

dd 14-11-16