ETL (Extract, Transform, and Load) extracts the data from different source systems then transforms the data and loads the data into data warehouse system.
ETL testing is done before data is moved into a production data ware house system. It is sometimes also called table balancing or production reconciliation. The main goal of ETL testing is to identify and mitigate data defects and general errors that occur prior to processing of data for analytical reporting.
ETL Testing tasks performed are
- Understanding the data to be used for reporting.
- Review data Model
- Source to target mapping
- Data checks on source data
- Packages and schema validation
- Data integrity and quality checks in the target system.
- Performance testing data.
ETL testing involves the operations like:
- Validating the data movement from the source to the target system
- checking of data count in the source and the target system.
- validating data extraction, transformation as per requirement and expectations
- Verifying the table relations like joins and keys are preserved during the transformation
We need ETL because it helps companies to analyse their business data for taking many business decisions. A data warehouse provides a common data repository. ETL provides a method of moving the data from many different sources into data warehouse. As data sources will change the data warehouse will automatically update. ETL process accepts the sample data comparison between the source and the target system. ETL process can perform complex transformation and also needs extra area to store the data. ETL is a predefined process for accessing and manipulating source data into the target database.ETL offers deep historical context for the business and it also helps in improving the quality because it codifies and reuses without a need for technical skills.
It consists of three processes
Here data is taken from the source system into the staging area.staging area gives an opportunity to validate extracted data before it moves into the datawarehouse. There are three data extraction methods.
- Full extraction
- Partial extraction without update notification
- Partial Extraction with update notification
The extraction should not affect the performance and response time of the source systems irrespective of any method used. These source system are live production databases. The validations done during the extraction are:
- Reconcile records with sources data
- Make sure that no spam data loaded
- Data type check
- Remove all types of duplicate data
Data extracted from source server is raw and not usable in its original form. Therefore it needs to be cleansed, mapped and transformed. In transformation step, we can also do customised operations on data like the user wants sum of sales revenue which is not in the database. Validations done during this stage are filtering using rules and lookup tables for data standardisation, character set conversion and encoding handling, use lookups to merge data, using any complex data validation.
Loading data into the target data warehouse database is the last step of ETL process. Huge volume of data needs to be loaded in a relatively short period. Load process should be optimized for performance.
Types of loading:
- Initial load-populating all data warehouse tables
- Incremental Load-applying ongoing changes as when needed periodically.
- Full refresh-erasing the contents of one or more tables and reloading with fresh data.
The different examples ETL tools are Markogic, Oracle, Amazon Redshift.