Things to know about ETL
In one of the previous blogs ,we got to know about BI and how to get a kick start to a BI career
Here,we will be moving a step ahead into a more specific domain – “ETL ”
What is ETL??
ETL is shortened form of EXTRACT, TRANSFORM, LOAD. It is a process or method in which data gets extracted from a source through different means, changed (transformed ) as per specific need and the changed ( transformed ) data is loaded to another system, mostly known as target system.
Let’s take a simple example of a school to understand this. Let’s Say, your school principal wants the number of students successfully clearing subject tests every month. In another words, how many (%) students are failing and in the subject they are failing in?
So the Prinicpal would need an application that would do three main functions:
- Pull data from exam management system and HR system – EXTRACT
- Consolidate the pull data to sum up student marks – TRANSFORM
- Place consolidated data in Principal’s application – LOAD
Companies need to make smart decisions ,for which the OLTP data is insufficient and results in poor performance. Thus the need of DataWarehouse ,which is a repository of data from all sources in a transformed manner which helps in analyzing data.
The question here is “How to change and transfer data generated by source systems to DataWarehouse/Data Marts which would be further used by a reporting tool?”.
This is where “ETL” comes in the picture.
As already mentioned, 3 main processes involved in ETL(as the name suggests!!) are:
The first part of an ETL process involves extracting the data from the source systems. In many cases this is the most challenging aspect of ETL, as extracting data correctly will set the stage for how subsequent processes will work.
Most data warehousing projects consolidate data from different source systems. These could be CSV files or any Relational databases.
The transform stage applies a series of rules or functions to the extracted data. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the target database:
Selecting only certain columns to load (or selecting null columns not to load). For example, if the source data has three columns (also called attributes), for example roll_no, age, and salary, then the extraction may take only roll_no and salary. Similarly, the extraction mechanism may ignore all those records where salary is not present (salary = null).
Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female)
Encoding free-form values (e.g., mapping “Male” to “1”)
Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data
Aggregation (for example, rollup — summarizing multiple rows of data — total sales for each store, and for each region, etc.)
Generating surrogate-key values
Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns)
Disaggregation of repeating columns into a separate detail table (e.g., moving a series of addresses in one record into single addresses in a set of records in a linkedaddress table)
Lookup and validate the relevant data from tables or referential files for slowly changing dimensions
The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may add new information to the existing one and other might just overwrite existing information with cumulative information. Updating of extracted data is done periodically – daily, weekly or monthly.
Steps involved in ETL
1) Extract (from sources) to Staging area *
3) Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)
4) Load to datawarehouse
What is Staging area and why do we need it
Staging area is a temp schema used to
1. Do Flat mapping i.e dumping all the OLTP data in to it without applying any business rules.pushing data into staging will take less time becasuse there is no business rules or transformation applied on it.
2. Used for data cleansing,transformation and validation
We can achieve ETL using SQL and procedures .However using an apt ETL tool makes it faster ,easy and efficient and lets one focus on the business perspective than on syntax
ETL tools present in market range from open source to paid These could be categorized based on functions, engine type ,Development mode and open source or paid
Following link gives the details for it
Useful links for reference
Following link provides you with a simple case study for ETL followed by its implementation in Informatica http://www.learnbi.com/informatica12.htm