Data Warehouse Hybrid Modeling Methodology

Data Warehouse Hybrid Modeling Methodology Viktor László Takács1, Katalin Bubnó2, Gergely Gábor Ráthonyi1, Éva Bácsné Bába3 and Róbert Szilágyi1 1 Institute of Applied Informatics and Logistics, Faculty of Economics and Business, University of Debrecen, Debrecen, HU 2 School of Mathematical and Computational Sciences, University of Debrecen, Debrecen, HU 3 Institute of Rural Development, Tourism and Sports Management, Faculty of Economics and Business, University of Debrecen, Debrecen, HU Corresponding author: Viktor László Takács (takacs.viktor@econ.unideb.hu)


Introduction
From the 2000's we had to study a new concept in management sciences. It was 'Business Intelligence' (BI). Krauth (2008) summarized the actual information about Business Intelligence and also forecasted the expected development and changes of BI for the next ten years. One of the priorities changes he emphasized was that 'Technologies providing business intelligence will leave the corporate framework and move on a much wider scale to serve the growing demands of organizations and individuals for accurate, substantial and comprehensible information. ' After ten years we saw that the demand was actually raised in the leaders of small and medium enterprises also for the dissolution of business intelligence, furthermore there are several so-called self-service business intelligence solutions we should use to satisfy this demand. From the 2010s we could see that the basic technology of business informatics the online analytical processing (OLAP) was released in scientific research also, mainly in processing measured data.
In the 2000s business environment -mainly in small and medium enterprises -unfortunately we saw that neither the economists nor the IT staff of these small firms could work with this BI-solutions (softwares). The reason for the problem is not in technical difficulties but in acquiring the right way of thinking to model the function of their own firm, define correctly the information requirement of management (conceptual modeling) and translate it to a logical model. We participated in a data warehouse projects where we realized that our ' corporate bus matrix' contained about 80-120 indicators with nearly 200 dimensions (dimensional attributes), therefore we started to work on the early stenography to formalize the management question (Takács & Bubnó 2012).

Related Works
It is an often-mentioned problem today in the literature that there is no standardized or widely agreed method for implementing the conceptual model (Bánné 2012;Macedo & Oliviera 2015;Rizzi 2008). Furthermore, it is a good practice to try to follow the classical design steps of database systems (Halassy 1994) in the design of the data warehouse (conceptual model->logical model->physical model->implementation), but opinions differ in the literature which should be the right order of the steps, furthermore, there is a lot of overlap. Mainly conceptual and logical modeling are often mixed with each other or their borders are blurred, however Halassy clearly defined the levels of database planning, furthermore he proved that these levels must be separated from each other (Halassy 1994).
According to the method presented by us, the conceptual model is nothing more than a set of formalized leadership questions. A management question can already be seen as part of an OLAP data cube, and OLAP cubes can be built from subsets of management questions. The question is to optimize their number and distribution. That is, how many cubes do we have to build and how many questions can be answered. While the former is a matter of financial concern for customers, the latter is about the efficiency of the information system.
As regards data warehouses as information systems, the question of efficiency is most of all the abovementioned two aspects (cost and amount of information that can be extracted). Di Tria, Lefons, & Tangorra (2017) tested design methodologies and carried out for cost-effectiveness analysis. They have set up their framework and metrics to implement this analysis. The classic approaches of data warehouse design can be sorted into two sets: data driven methods and requirement driven methods. Both have advantages and limitations also (Di Tria, Lefons, & Tangorra 2017). For example, the requirement driven approach leads such multidimensional schemas that usually results in one data cube that answer only one question of the management. The main problem with the multidimensional schemas of the other approach is the big number of the potential questions which cause data lakes that become data swamps. We consider a cuboid as a base of a potential management question. The problem is: what is the minimal (optimal cost) number of cuboids? These problems with both approaches lead to the birth of several so-called hybrid modeling methods to design data warehouses. Di Tria, Lefons, & Tangorra (2017) identified the criteria based on the literature to evaluate these hybrid methods. They used the 4 common and main criteria that is necessary to evaluate a data warehouse designing methodology. These are: Correctness, Completeness, Minimality and Understandability (Halassy 1994). Furthermore, based on the literature they defined metrics to the evaluation of costs and benefits, specially the 'Metrics for schema quality' and the 'Metrics for design effort'. Then they compare six hybrid methods based on this framework: 1. Graph-based Hybrid Multidimensional model (referred to as GrHyMM), 2. UML for Data Warehouse (referred to as UMLDW), 3. Multidimensional Design by Example (referred to as MDBE), 4. Phipps&Davis Methodology (referred to as PDM), 5. Goal-oriented Requirement Analysis for Data Warehouse Design (referred to as GRAnD), 6. Goal/Question/Metric-based Methodology (referred to as GQM).
Based on Di Tria, Lefons, & Tangorra (2017) in Table 1 we present the steps of the six methodologies above and the features or results of the certain steps, extended with our Visualized Management Question-based Design methodology (referred to as VMQD*).
Our Visualized Management Question-based Design methodology is closest to GQM in the steps, but there are some differences.
1. Requirement Analysis. We collect questions with metrics and dimensionality of the problem also in this phase with the required visualizations from the decision makers via interviews. Then we formalize these specifications with our special structured stenography that is based on the terminology corresponding to the current problem. The output of this step is a set of formalized questions. 2. Deriving minimal granularity. Based on the set of formalized questions we specify the required minimal granularity for every indicator. The output of this step is the set of indicators with minimally detailed dimensions.

Schema Selection
MeER related to questions

Manual
Refinement modified automatically generated schema 3. Deriving ideal schemata. We map the dimensional attributes and values to keys, produce the initial conceptual schemata. The output of this step contains ideal dimensions (keys, attributes and hierarchies) and ideal facts (with dimension keys for join), independently from the sources. 4. Source Analysis. The main question of this step is: What kind of transactions can we get them from?
We decompose ideal facts into potential elementary transactional attributes and identify them in the source systems. The output of this step is the derived potential schemata. 5. Integration. Ideal schemata from the requirement analysis are compared with potential star schemata.
Match occurs, when the two schemata contain the same fact, and, both have the same dimensionality in the same granularity level. In this step we define required transformations and calculate fact tables and common dimensions with attributes. 6. Multidimensional modeling. We build the cube(s) with dimensions, dimension hierarchies and measures.
We used the Di Tria, Lefons, & Tangorra (2017) notation to visualize the framework of our methodology for better comparison (Figure 1).  Every question of the management is a visualization of aggregated indicator(s) detailed with dimensional attribute(s). So, we must formalize these questions of the management.
The formalisation has to cover:

Methodology of formalization of conceptual design
In the following paragraph we would like to explain our methodology.
1. In the phase of requirement analysis, we collect and formalize management questions. The question and description are defined in a textual and formal way. The 'manager' is the person for whom the system provides information. As a result, she expects to see a data-visualization report appearing on different dashboards. The reports and dashboards need to be configured according to the requirements of access rights (what kind of level managers can access the statements).
In the requirement analysis, we 'analyse' the management question based on the following considerations in Table 2: What is the indicator? In which aggregation? What is the unit? Which visualization we want to see? in Which detail(s)? Is there a slicer?
This formal definition is related to one diagram. Several different visualizations need to arise from the leader on a question related to the requirement specification, so many formal descriptions are made in this section.
depending on whether the given fact table contains one or more indicators.

Dimension table formally:
During optimization we can notice one of the following cases in Table 3: • Different indicators with the same dimensionality and granularity can be grouped into one fact table. • Different indicators could be similar when they have the same dimensionality but diverging granularity, so we examine whether they can be produced from one another. • Dimensionality of one indicator is a proper subset of the dimensionality of the other indicator, so the other indicator dimensionality will be the minimum required.

I D I D I D I
Combining indicators I 1 and I 2 with the same dimensionality. We create the Descartes multiplier of the two indicators.

I A I A B I I A B
A and B are dimensions of indicators I 1 and I 2 and I 1 is proper subset of I 2 . ○ E {pk} pk primary key of E entity ○ E {a} a attribute of E entity ○ E {fk} fk foreign key of E entity ○ R( ) R relation, transaction between different entities, with the relational attributes ○ R {a} a attribute of R relation • In the case of an Analysis-Oriented Source System (OLAP), the multidimensional relational model is typical.
depending on whether the given fact table contains one or more indicators.

Dimension table formally:
hierarchy key dhk in the lower right index, when hierarchy levels are stored in separate tables 5. During integration, we coordinate the ideal data model with the available source system data. Describe the requirements that include attributes, dimension keys, and additional indicators to use. We design the data loading (ETL/ELT) process in Table 4.
In the data loading process, extracting, transforming, and loading only those keys, attributes, and indicators from the source system to the data warehouse that we need.
During the transformation of data, two types of data loading processes can be discussed: • During ETL (Extract, Transform, Load) process the various data manipulations carried out after extracting the data from the source system are typically in an intermediate system (Staging Area) and then transferred to the data warehouse. Intermediate use of complex multi-step data manipulations (transform) is needed, for simpler systems it is worth thinking about using it as it can serve more data warehouses with intermediate data than a commonly used general date dimension, geographical dimension, organizational dimension (enterprise data lake).
The transformations can be simple, then the indicators are loaded from one source system relation, and the dimensions are also derived from a single source system.
The transformations may be more complicated when the values of the indicators are loaded from several relations' attributes or from several different source systems. This also occurs with the indicators of dimension.
• During ELT (Extract, Load, Transform) process the various data manipulations take place after the data is extracted from the source system and loaded into the data warehouse.
When we formalize a business question, the method is pure requirement-driven. We do not bother with data-driven approach at this stage. We bother with it after we have collected and formalized and modelled all the questions, and the required data model is generated, then we examine the questions through the data-driven approach also. At this point we examine how the formalized elements of the questions are stored in the source systems. If we can produce them in the required granularity (or more detailed) then we optimize them (for example with the use of some kind of aggregation function) to the ETL/ELT (Extract-Transform-Load/Extract-Load-Transform) processes.

Research data processing example (Collecting and processing fitness tracker data)
The relationship between physical inactivity and some chronic health conditions is a widely researched area but further efforts are needed to assist people to adopt healthier lifestyles (Lee et al. 2012). Using wearable activity trackers can be a promising opportunity for individuals to improve lifestyle behaviour (Maher et al. 2017). There are several studies in this area, mainly from the lifestyle behaviour and health approach (Henriksen et al. 2018;Kaewkannate & Kim 2016;Karapanos et al. 2016). Our research does not examine activity trackers from the aspect of health, we want to present the way that we can process data with OLAP technology we collected with a very simple device. If an end user who wants to know his own activity by using such an activity tracker, usually he downloads a software that processes his data every day and informs him. But if we plan a wide research where we want to collect several persons' data, and we want to recognize trends and patterns in the behaviour of the society, it could be a useful approach if we plan a data cube with OLAP approach. With our hybrid design methodology, we get metadata from the dimensions and attributes, so if we extract our data into a dataset, we get the formal description of the structure of our dataset also, in order to share and compare it to other similar researches.
The value of the indicator I can be obtained by summing through D dimension (roll up) with the aggregate function in the lower left index of I. This is an aggregation is from D {dk} at the bottom of the Summa symbol to the level at the top of the Summa Sign (all or D {dhk} hierarchy level, leaving the original key. This is referred to

I D I D I D I
Combining I 1 I 2 indicators with the same dimensionality. We create the Descartes multiplier of the two indicators.
Unpivoting I  Our research investigated the physical activity of university students using fitness tracker. Participants in the pilot test had to meet several criteria. Participants had to wear the device with normal living conditions for 90 consecutive days, which simulated the normal living conditions of most students. An important element of the long-term pilot test is that it can represent the full range of normal people's activities in a real environment. Each participant was informed on the most important information about the device and the potential for managing possible sources of error. The battery of the bracelet was recharged by the users every 20 days, depending on the use, whereby the data was collected at the same time. The data was sent by the users for one week on a daily basis and then at the charges mentioned above. This level of data supply has served to reduce the potential loss of data. We informed the students about the study and all the participants provided informed consent in compliance with the principles of the Declaration of Helsinki (WMA 2013) and the new GDPR (EP 2016). The study was approved by the Regional Ethics Board (DE RKEB/IKEB: 4843-2017) at the Clinical Canter of the University of Debrecen.
Collected bracelets' data are processed using OLAP technology. We use the following hybrid design methodology and formal descriptive techniques to design, implement, and document the operations related to the information system (Research Data Warehouse) that we produce. Table 5: The students' daily activity by daily steps intensity categories in March:

Question1 formally in
It shows how many days were completed by the students in March by daily step categories.
The ' daily step categories' naturally require a detailed discussion during the requirement analysis and at the same time predicts a clustering task in the integration phase. Table 6: Students' average daily activity in March by category and gender:

Question2 formally in
It shows the students' averagely completed days in March by daily step categories and gender.
The term 'students' and ' daily' refers to the maximal details of the data average that we can deal with in the minimal granularity, optimal data model or in the integration phase.  Table 7: Average daily steps of men, women and all by the day of the week in March:

Question3 formally in
It shows a comparison of mens', womens' and combined average daily number of steps in March, in the days of the week.

Minimum granularity for each indicator
We define attributes for values and keys for dimensional attributes in the questions.

March is month value of Date dimension (D {March} ) and weekday is an attribute of Date dimension (D {weekday} ) so the related dimension key should be MonthKey (D {MK} ) and DayofWeek (D {DoW} ), so the common dimension key for both is DateKey (D {DK} ).
Gender is an attribute of Person dimension (P {gender} ) so the possible dimension key should be PersonKey (P {PK} ) or GenderKey (P {GK} ), the minimum granularity is GenderKey (P {GK} ).
The following indicators with the minimal required granularity should be the base to answers each question we have:

Ideal schemata for OLAP system
In this step we determine which indicators can be stored in a common fact table.
During optimization, we can see which indicators are similar and can be produced from one another. In this case activity indicator detailed by GenderKey (GK) can be generated from the activity indicator detailed by PersonKey (PK). This means that we have already managed to optimize the number of indicators we are building.

Source analysis
In this phase we discover the data of the source systems driven by the facts and dimensions specified in the ideal data model.
The data from the Android phone (S {cumulative steps} ) is a cumulative step number for a given time, so we must first calculate its dynamics (increment for the previous measurement).
Finally, we generate a common large data source from many individual files: We generate 10-minute time intervals data from the activity tracker data. Data from the android phone Time property with minute accuracy is 10 minutes raw data, must be normalized as 10-minute accuracy data.

S S
is the whole part of the real number. Each S {steps} value must be broken down into the current 10-minute increments and the previous 10-minute increments. This brings out normalized Android bracelet data.

Integration
During integration phase, we describe the production of fact tables and dimensions specified in the ideal data model to be used to answer the questions. We determine indicators and dimensions needed for integration (not necessarily in this order), but as a result of integration, these should be a kind of documentation.
Finally, we determine the steps of the data loading process (ETL/ELT), looking at their sequence. Our strategies to achieve our integration goal are top-down (ideal model -> source) and bottom-up (source -> ideal model) strategies, both are widely used in information processing and knowledge ordering, in practice, they can be seen as a style of thinking, teaching, or leadership. The The data load ETL process During this process, we load the S (Steps) relation properties of the source system and match the dimension keys of the fact table that contains the 10-minute normalized steps in the OLAP system in Table 8.
After the base ETL we load the dimensions (Tables 9-11) defined in the ideal data model and make the necessary conversions.   After we have made the basic etl of the DateMonth dimension, the number of rows is related to DateKey granularity with monthly duplicated values, so we must deduplicate the rows noted as below in Table 12.   Calculate gender independent daily step and student number indicators by the summary of the gender dependent daily step and student number indicators and combine these two gender independent indicators, with the four gender dependent indicators. The last step is to divide the three daily step indicators with the related three student number indicators to get the three average daily step indicators.

The Multidimensional modeling phase
We build the cube(s) with dimensions, dimension hierarchies and measures. In our example we implemented our galaxy schema (Figure 2) in Microsoft PowerBI, as the result of our hybrid methodology. We can see it in Figure 1. This data cube is the optimal cube to answer the researchers' questions.
Researchers' questions were: • The students' daily activity by daily steps intensity categories in March (and the preferred visualizing was table). • Students' average daily activity in March by category and gender (and the preferred visualizing was table). • Mens', womens' and all average daily steps by the day of the week in March (and the preferred visualizing was radar chart).
On Figure 3-Figure 5 we can see the print-screens of the dashboards according to the questions. The data cube with dashboards (Takács, 2018) were implemented in Microsoft PowerBI also.

Summary
In our study we presented a method and concrete designing tool that can decrease a serious deficiency in data warehouse conceptual design phase, when the customer and the vendor should think together to draw up the conceptual plan of a management information system. We provide a kind of 'business intelligence problem solving thinking' and a kind of descriptive language that can serve it. We proved with an example, that this approach could work very efficiently in a research area very popular nowadays, that is activity tracking. The problem we presented was simple and there were minimal quantity of management questions, but this hybrid conceptual modeling works in the same way during the conceptual design of a more complex management information system, the visual version of the design process of our example (Takács, 2019) results a very complex graph. The thinking method and the formalisation helps to describe the managerial questions exactly in the conceptual design phase, so it could be an effective intermediate language between designers and creators of the management information system in order to implement successfully, and in the long run help to supply the management or researchers with usual and correct information about their company or research. Our method has a limitation related to the ETL process, because we focused on the transformation made after the extract-load processes first in the intermediate storage, and last in our Research Data Warehouse. In this example we are not defined notations for complex transformation of ETL process.

Funding Information
The publication is supported by the GINOP-2.3.2-15-2016-00005 project. The project is co-financed by the European Union under the European Regional Development Fund. of physical activity of university students in 2018 and specified the research environment. Viktor László Takács, Katalin Bubnó, Gergely Gábor Ráthonyi and Róbert Szilágyi improved the early stenography to a hybrid modeling method in 2018 work closely to the activity researchers.