Datawarehousing, Datawarehousing,... Datawarehousing. This is becoming the new buzz word in the world of computer software.Datawarehousing consultants are now being the most sought after people in the software industry, and with just a little over 2+ years of experience, software companies and organizations are willing to pay any amount they demand. So, what constitutes datawarehousing?. Let's explore it.
Datawarehousing (DW) is also known as Business Intelligence (BI), Business Warehouse (BW), and Business Information Warehouse (BIW).
In order to explain the basic definition of what exactly constitutes a datawarehouse in software terms, let's take an example of none other than our own Indian Railways. We all know that many trains are operated by railways, and they maintain a computerized software system for the same. This system takes care of the following thigs
- Reservation /Cancellation of Tickets
- Preparation of charts
- Train-timings
- Station details
- Goods, Cargo,Parcel details
- Railway employee salaries
and properly maintains and updates these on a regular basis. This means, this system is very frequently updated. Such a system; which is updated regularly is called a transaction processing system, or simply OLTP System (OnLine Transaction Processing system)
Now, say for example, if the top-management of Southern Railways, wants to do an analysis for
- Introducing new trains, new routes, new stations
- Analyze profit and loss of a given train or station
- Analyze railway employees' salary
- Analyze the percentage of cancellations for a particular train or station
- Analyze the percentage of waiting list for a particular train or station
then, they cannot use the OLTP system because it is very frequently updated every second,every minute,every hour,every day,every week and every month. Also, it will be very helpful for the senior management to take decisions based on multi-dimensional views of the system. But unfortunately, OLTP systems are not designed for multi-dimensional views. Again, if analysis is done on OLTP system's data, it will be changed data. For example, if analysis is done in the morning on reservations, there is a possibility that reservation might have been cancelled, which implies incompleteness of the data (for analysis purposes), and analysis has to begin from scratch once again, which is again a waste of time.
In order to over come this, a new concept called OnLine Analytical Processing system (OLAP) was introduced (popularly known as datawarehouse or business intelligence). In this system, there wont be any updates to the data regularly, which means historical data or past data only will be available, in a multi-dimensional view, specifically for analysis purposes.
OLTP vs. OLAP
OLTP -> Updates to data is very frequent.
OLAP -> No updates to data.
OLTP -> Latest/Current data will be available.
OLAP -> Historical data will be available.
OLTP -> Data will be available in normalized tables.
OLAP -> Data will be available in cubes.
OLTP -> Data that is available, will be perfectly suited for day-to-day business.
OLAP -> Data that is available, will be perfectly suited for analysis purposes.
OLTP -> Data that is available, can be acessed by all business users, customers, senior management, et all.
OLAP -> Data that is available, can be accessed only by senior management of the enterprise and/or restricted users.
OLTP -> Live data (Detailed data)
OLAP -> Summarized data.
OLTP -> (simply)Data
OLAP -> Metadata (i.e., Information about data)
OLTP -> Data that is available, can be modified.
OLAP -> Data cannot be modified.
OLTP -> Database design is (very) complex.
OLAP -> Database design is (very) simple.
OLTP -> No. of transaction per users is very high 100s to 1000s.
OLAP -> No. of transaction per users is very low.
OLTP -> Response time is very quick.
OLAP -> Response time is reasonable.
OLTP -> Security is about which users can access which business functionality.
OLAP -> Security is about which data; whether should be at info-provider level, etc. and not about user access.
OLTP -> focuses on getting day-to-day work of the business completed quickly and efficiently. Specific users work on specific business functionality.
OLAP -> There's nothing like creation, updation, deletion, insertion, modification, etc. Only specific users (usually the senior management) display and analyze the data.
Now, we have got a basic idea as well as a brief overview of what is a datawarehouse and what is an OLAP system. Now let's further explore it.
Datawarehouse systems use what is known as cubes, dimensions, and measures. Let's take another example of reservations.
Reservation can be done against a train, a station, a class, date, berth, and no. of passengers. So, in this example, reservation will be considered as a 'Measure', against which various 'Dimensions' like train, station, class, date, etc., can be analyzed. These various dimensions, against a single measure, is called multi-dimensional view. The place where data about both dimensions and measures are available (for a multi-dimensional view) is known as 'Cube'. Similarly, there can be a separate cube for cancellation, trains, stations, passengers, etc.
Now let's analyze some common terms associated with datawarehousing.
Data Scrubbing - Scrubbing, as the name suggests means rubbing or erasing or deleting. Data scrubbing means, erasing (deleting rather) unnecessary data. For example, the data that is loaded into the datawarehouse comes from various OLTP sources. So, scrubbing ensures the inaccuracies in the data due to different OLTP sources are removed and contradicting data is also removed.
Fact table - Stores detailed values about values for the measures (otherwise known as facts)
Data Marts - Business divisions of the datawarehouse based on an organization's (enterprise) business divisions (or departments). For example, in our example of railways Tickets, Finance, etc.
Data Preparation Area - Also known as data staging area. Data which is in OLTP need not necessarily be maintained in a single OLTP system, which means, for datawarehouse, data from various OLTP systems are added. Therefore, in the staging area, data from various OLTP sources is converted into an uniform and unique types, i.e., data is prepared and ready to be loaded into the datawarehouse database (after type conversion).
Datamining - Mining as the name suggests, is to search for something unknown, with the belief that something will be discovered. For example, in the railways large number of cancellations (or reservations), could mean a particular impact on a particular station, etc. Discovering the reason for the same and thereby show a perfect (or near-perfect) future trend is known as datamining.
ETL - ETL stands for Extraction Transformation and Loading.
Extracting is the technique of collecting data, scrubbing it.
Transformation means preparing the data, making the data ready to be available in the datawarehouse.
Loading is the technique of moving the transformed data into the datawarehouse's database (i.e., moving the transformed data into cubes).
Once ETL is completed, data will be available in cubes in multi-dimensional views, and analysis can be carried out.
Popular datawarehouse software - SAP BI, Informatica, COGNOS, SQL Server Analysis Services and many more
SAP BI - Provided by SAP Corporation (www.sap.com)
Informatica-Provided by Informatica Corporation (www.informatica.com)
Cognos - www.cognos.com
SQL Server Analysis Services - Provided by Microsoft Corporation (www.microsoft.com)
So, the next time you hear in news that railways is going to introduce a new train/route, you know how they would have done analysis !!