Taking time to extract, transform, and load all your organization’s stored data to a single data image could be the difference between organizational success and failure. When it comes down to it, an organization cannot build its success on a disorganized, crumbling data infrastructure. Stored data has to be regularly consolidated so that all the diverse sources, operating systems and measurement formats can contribute to a single data image that can communicate what is really happening within the organization. Creating a single data image involves a core process of data extraction, transformation and loading (ETL).
The ETL process affects your data in different ways, all resulting in a single data image for easier data interpretation and decision making. Often times this ETL process includes using specialized ETL software to make the entire process go smoothly.
Extraction
Data extraction refers to the process of taking the data from disparate servers and other sources, including (in theory) paper sources and accumulating it in one location for processing. Extraction can often take place in the final location and the data can be transformed in a single step. Many people think that the rise of cloud servers may mean the end of ETL software. The step of data extraction can, in many cases, be eliminated because of the cloud's capacity for indefinite data collection. However, this theory may not account for the full range of data transformation possibilities.
Transformation
The transformation of data refers to changing the form of data so that units are relatable and all data fits into compatible formats. If you have many data sources, it may require some major transformations. The growth of databases and the increasing variety of data types would make the consolidation of data into a single data image difficult or impossible without ETL software. If the data is not properly integrated, easy to find and merged from a variety of sources and metrics, it would be very difficult to make decisions based on data. Decisions are often made in the absence of data just because of the difficulty extracting information and having it in an interpretable form.
There are two main varieties of data, measurement date (often in numerical format) and qualitative data (often names, images, video or pictorial forms).
- Numeric data that derives from measurement or counts can provide useful graphical analysis to assess changes and states to describe the operation of a business.
- Qualitative data are also the kinds of data you "look up," like names, addresses, or images. These are indexed and indexes can be searched to come up with the piece of information you want. Qualitative data can also be "quantified" by content analyzing the data and making content counts, or by frequency counts of the various categories the user establishes.
The one thing you need to avoid in a data stream is to mix "apples and oranges." If you want to integrate apple counts with orange counts, you have to change the units to number of fruits. To make full use of data from a variety of sources with different methods of measurement and different units of measure, you have to create a "synthetical, i.e., not actually observed," quantitative unit.
Here are some different ways of combining quantitative data with different units:
- The most statistically sophisticated way is to use multivariate regression. By inter-correlating all the variables, one can develop a formula for combining all the disparate data into a single number that reflects all of their influences.
- One can also make a simple index by calculating standard scores [(score-mean score)/ standard deviation of scores] then adding across standard scores to get an index. Standard scores represent deviations from the average (above or below it) in a format that makes different measurement units compatible.
- One can also create a useful index that expresses the influence of all the variables by ranking all of the variables across an "independent variable" like time, then averaging the ranks. Ranks are unit-less and are therefore compatible across variables.
Loading
Data loading is about the process of putting all this transformed data into an orderly set of records on a database. To be useful it has to be organized in a way that it can be accessible, readable, and practical so that end users can easily interface with it. In general, before data can be loaded, the database structure and tables to load must have already been created.
The ETL system includes the capacity to build databases and define the user table. When the load process begins, the database system builds primary key indices and user-defined indexes. The indexes improve the speed of retrieval and a way of accessing records quickly. A powerful loading process is needed to enable full tabulation and cross tabulation of your database.
At 5x Technology our goal is to provide timely, efficient, intelligible, and readily accessible information, useful in the corporate decision-making process. Please contact us to learn more.
Related Articles:
Using Uniserv in Your Data Warehousing & Data mart Design Strategy
Using Cloud Based Solutions for Big Data: How Can This Affect Your Business