The world is awash with data, but much of it is unusable. Whether stored in proprietary formats, unstructured or noisy, the best resources frequently need to be processed before they can deliver business insights.
This was a painstaking process in the early days of IT, but Extract, Transform, Load (ETL) tools, which have been common since the 1970s, simplify the task of ingesting data, converting it to a useable format and storing it in the recipient’s managed data warehouse.
The importance of ETL automation tools can’t be overstated. Without clean data to work from, business processes will be inefficient and sub-optimal, and enterprises themselves will be unable to derive actionable insights from raw metrics, or act upon what it’s telling them with any degree of certainty.
What is data transformation in ETL?
Every system that generates data is influenced by the strictures of its creators. The oldest systems still in use would frequently have been optimized to write the most compact data possible. The aim was both to reduce the space consumed, at a time when storage was expensive, and to minimize the number of processor cycles required to process it when computer time was paid for by the minute or byte.
Moreover, frequently “the data warehouse’s database was designed to support the analytical functions required for the business intelligence and business insight function,” explains the team behind Shadowbase. Its design was “typically a highly structured database with complex indices to support online analytical processing (OLAP)… Databases configured for OLAP allowed complex analytical and ad hoc queries with rapid execution time.”
Today, when limits on storage and processing are looser, there’s a tendency towards more human-readable data, with metadata increasing the immediate value of data points by making them browsable and easier to manipulate, compare and contrast.
This is possible thanks to consistent increases in processing speed, paired to a reduction in the price of data storage. Simultaneously, increased use of cloud facilitates ad hoc resource allocation, making it possible to increase the number of containers and processors, or quantity of available bandwidth, when the needs of ETL tools or subsequent processing require it.
What would you use data transformation for?
ETL automation tools aren’t only used to save legacy data from obsolescence. We are generating more data than ever before, and much of it needs immediate analysis to provide insights that help enterprises make timely decisions.
A common example would be the processing of sales metrics and retail stock levels, which may arrive at a central point of collection from a broad range of point-of-sale devices spread across a region, nation or continent. These third-party devices will frequently use incompatible schema, so ETL automation tools will be used to impose a degree of conformity before the output can be used to derive actionable insights.
Transporting data between systems frequently involves multiple steps, such as removing unnecessary fields, instating or expanding abbreviations, and applying descriptors. However, as Oracle outlines, “often, it is not possible to identify the specific subset of interest [for the business], therefore more data than necessary has to be extracted, so the identification of the relevant data will be done at a later point in time.”
Historically, advanced processes, like performing calculations to derive entirely new data from existing values, have often been performed at this stage. Examples could include ranking elements on the basis of frequency or value, or calculating the interval between a defined moment and a timestamp within the data.
Where data is ingested from multiple sources, it may also be necessary to de-duplicate records that appear in each set, combine elements from each if the index matches but the fields differ, or split a single record across multiple tables if that better suits the end-use requirements.
Devising an effective ETL process has therefore always required careful consideration, a clear view of how the processed data will be used and an understanding of how the structure that this requires differs from its original format. Without these, the data recipient would do better to store the original resource in an unstructured data lake, and apply ETL on a per-project basis.
ETL is changing
Pre-cloud and earlier, when resources were scarce and expensive, ETL was the only option, as the three steps of exporting, transforming and loading could be run in concert, with the load process kicking off when the first batch of translated data arrived, even if the export process was still underway at the start of the chain. This was particularly important when the data was effectively streamed, as sales data from a chain of stores would be, since there will be no start and end point in the process.
Despite the overall process being design to work as a system, Talend explains how, “ETL tools are dedicated platforms for the intermediate steps between extracting data and loading it into target repositories.” Each of these platforms must be maintained, and a breakdown in one can impact the others.
However, this architecture is changing.
Today, we can call on exponentially more powerful data processing tools, and the infrastructure to support them, which gives enterprises the option of not only ETL, but ELT: loading the data onto the eventual host system and performing the transformation in-situ. This is particularly relevant when using systems like Hardoop to run multiple data processing threads in parallel, and will become more important as data continues to be generated in ever-larger quantities by IoT devices, without human input.
It also offers significant benefits in terms of simplification, since there are fewer discrete steps in the process.
As Matillion outlines, continuing to rely on traditional ETL tools, many of which were built for bespoke or on-premises data warehouse infrastructure, can cause bottlenecks. But, “cloud native ELT (instead of ETL) is built to leverage the best features of a cloud data warehouse: elastic scalability as needed, massively parallel processing of many jobs at once, and the ability to spin up and tear down jobs quickly.”
This is a boon for enterprises – and to end users, who will gain access to relevant data more quickly. As they become more data savvy, we’ll also see a trend towards self service data preparation, where those users employ graphical interfaces to produce dashboards to monitor key metrics in real time. Built within no-code or low-code environments, these dashboards will unlock the full potential of the organization’s data by making it immediately useful, without routing requests through central analysis, programming or support departments.
The end result will be users performing personal ETL on a far smaller scale, to transform data that’s been processed once already into the format they need to perform their ongoing, day-to-day work.
Case Studies
-
01 /
Mitigating Tech Resourcing Challenges with Highly Skilled Offshore Talent
Discover how a global B2B media business, with over £400 million in annual turnover dealt with the challenge of tight deployment and development timelines with little room for recruitment or onboarding.
-
02 /
High-Speed Machine Learning Image Processing and Attribute Extraction for Fashion Retail Trends
A world-leading authority on forecasting consumer and design trends had the challenge of collecting, aggregating and reporting on millions of fashion products spanning multiple categories and sub-categories within 24 hours of them being published online.