Data transformation

An Intro to Data Ingestion and Data Transformation

Big data is defined by the three Vs: velocity - the speed with which the data is generated; variety - the different types of data; and volume - the size of data. Due to these characteristics, some of the daily tasks of a data engineer are to ensure that pipelines are executed correctly, incoming data is stored in the system, and data is accessible for whoever wants to use it. In the following article we are going to dive into the most common strategies for pipelines (ETL - extract, transform, load and ELT - extract, load, transform) and storing (data warehouse and data lake).

Data Lifecycle

Data goes through different stages, starting from the creation, where the job is to collect and to centralize different data generated by various sources. After collecting, you need to process the data, applying different transformation strategies (formatting, labeling, filtering, and so on). Processed data needs to be stored into a destination and is ready to be used with different visualization applications.

What Is a Pipeline?

Because of all of these stages that data goes through, engineers started to create pipelines.  Pipelines are a series of steps where the output of the previous step is the input for the following one, and steps can be executed in parallel. A pipeline has a single direction and it doesn’t contain cycles. Each step in the pipeline represents an operation applied on the incoming data. You can imagine the pipeline like a directed acyclic graph (DAG). The following image shows an example of a pipeline for data collected from sensors.

data pipeline

There are two main strategies for ingesting into a Big Data platform: batch and streaming. Batch ingestion is used when a large amount of data is stored into the system (for example, a number of files). For storing this chunk of data, a pipeline is scheduled at a suited interval to run over the incoming information.

On the other hand, streaming ingestion is used when working with real-time data (for example, a number of messages are pushed into the system every couple of seconds). A publisher/subscriber design pattern application is used. In this case, processing pipelines are running continuously to ensure that the incoming data is stored into the system.

ETL versus ELT

ETL - Extract Transform Load

ELT - Extract Load Transform

Data is stored after applying different transformations on it

Data is stored directly into the sink, the transformation happens after saving it

Commonly used inside data warehouse

Commonly used inside data lake

Needs a lower disk capacity destination because data is already structured and ready to use

Needs a higher disk capacity destination for saving all the raw data

ETL and ELT are both data integration processes used to move data from multiple sources into a single, centralized target data store. The main difference between ETL and ELT is when the data is loaded in the sink and the moment that the transformation step happens. Because of these differences, they are preferred for different use cases.

ETL pipelines are commonly used inside data warehouse jobs for ingesting data. The reason behind this choice is that raw data can be transformed into a structured format. On the other hand, ELT pipelines are commonly used inside cloud-based data lakes which can benefit from using provided storage and processing power.

ETL

Extract, Transform, and Load (ETL) is a data integration methodology that extracts raw data from the sources, transforms the data, and then loads the data into a sink. The method emerged in the 1970’s and remains prevalent amongst on-premise databases that possess finite memory and processing power.

The extracted data only moves from the processing server to the data warehouse once it has been successfully transformed. This strategy helps when you want to shrink the data volume that needs to be stored applying different transformations.

Data Warehouse

A data warehouse is a central data hub used for reporting and analysis. Data in a data warehouse is typically highly formatted and structured for analytics use cases. It’s among the oldest and most well-established data architectures.

The key features of this approach are: data is organized, may have been transformed, and is stored in a structured way. Data exists in the warehouse for a defined purpose, and in a format where it is ready to be consumed. Data may be easier to understand, but less up-to-date. Structures are hard to change.

You can imagine a data warehouse like a store where every product is well organized on a shelf, ready for it to be consumed.

data warehouse

ELT

Unlike ETL, Extract, Load, and Transform (ELT) does not require data transformation to take place before the loading process. ELT loads raw data directly into a target data hub, instead of loading it after the transformation phase finishes.

ELT is relatively new, made possible by the invention of scalable cloud-based data lakes. In terms of storage and processing power, cloud providers facilitate raw data repositories and in-app transformations.

Although ELT data pipeline is not used universally, the method is becoming more popular as companies adopt cloud infrastructure.

Data Lake

Instead of imposing tight structural limitations on data, the idea of data lakes is to dump all of your data—structured and unstructured—into a central location. The key features of this approach are: the data lake contains all raw, unprocessed data, before any kind of transformation or organization. Data is more up to date, but may require more advanced tools for analysis. No structure is enforced, so new types of data can be added at any time.

You can imagine a data lake like a giant pool where any form of data can be thrown into it.

data lake

Data Warehouse versus Data Lake

Data Warehouse

Data Lake

Data is highly structured

Data has any format (raw/structured)

Data is ready to be used for analysis

Data needs to be processed before using it

Needs a lower disk capacity for storing data

Needs a higher disk capacity for storing the data

Commonly used for on premise infrastructure

Commonly used inside cloud solutions

A data warehouse and a data lake are both data storage and management systems, but they have different architectures and are used for different purposes.

Data warehouse satisfies the need for a centralized repository for storing and analyzing structured data helping business intelligence (BI) and decision-making.

On the other hand, data lake is a centralized repository for storing and processing vast amounts of raw, unstructured data for big data and data science analytics.

Summary

The presented techniques are some of the most common patterns used for ingesting and storing data into the system. Each has its pros and cons but it depends on the needs you have to fulfill. Maybe you have an on premise system that can’t store large amounts of raw data, and you need to apply some transformations before saving it, shrinking its volume. Or you need all the unprocessed data to be stored into a central repository and process it when you want to use it for training a machine learning (ML) model.

Conclusion

As you can see, various business scenarios and technical specifications call for different frameworks. Since they depend on each individual project, it's always best to analyze your existing infrastructure and business needs to see which of the above best suits your business plan and can help you overcome your challenges in analyzing data. The first steps in a big data project implementation are auditing and technical consultancy.

References:

About the Author

With a diploma in Computer Science at the Politehnica University of Bucharest, Erhan Mamut has been working as a Full Stack Engineer for 3 years before embarking on his data engineering journey. He enjoys stepping outside his comfort zone and learning new things, which led him to proceed on a new career path in big data, a decision he is thoroughly excited about, one year later.