ETL
Extract
Transform
Load
ETL stands for Extract, Transform and Load, a set of common processes for collecting, integrating, and distributing data to make it available for additional purposes, such as analytics, machine learning, reporting, or other business purposes.
ETL enables an organization to carry out data-driven analysis and decision-making using operational data.
Data is collected from one or more sources.
Data is transformed to apply the required business logic to the data, including cleaning it, re-modeling it, and joining it to other data.
Data is stored in its new destination (such as a data warehouse or non-relational database).
In this step, the focus is first to understand where useful original data is stored and what form or formats it’s in. Then, processes are implemented to access it, perhaps through recurring nightly batch processes, or in real-time or triggered on the occurrence of specific events or actions.
In this step, original data is cleaned, formats are changed, and data is aggregated so it's in the proper form to be stored in a data warehouse or other sources, so it can be used by reporting tools or other parts of the business.
Techniques:
Cleaning
Filtering
Joining
Normalizing
Data Structuring
Feature Engineering
Anonymizing and Encrypting
Sorting
Aggregating
Example transformations include:
Deriving calculated values based on the raw data.
Re-ordering or transposing the data.
Adding metadata or associating key-value pairs to the data.
Removing repetitive data or adding counts of occurrences of data.
Encoding or decoding the data to match destination requirements.
Validating the data.
Performing search and replace functions on the data.
Changing the schema of the data, for example from text to values or IDs.
Approaches:
Schema-on-write.
Schema-on-read.
In this step, the transformed data is stored in one or more places where applications, reporting tools, and other business processes can access it, such as in unstructured object stores, text files, or complex data warehouses.
The process varies widely depending on the nature of the business requirements and the applications and users the data serves.
Techniques:
Full loading.
Incremental loading.
Scheduled loading.
On-demand loading.
Batch and stream.
Push and pull.
Prallel and serial.