Key Takeaways:
- Apache Airflow is an open-source, workflow management solution that was first built at Airbnb.
- It is scalable, flexible and extensible, and is often used by data engineers to schedule and manage complex workflows/pipelines.
- At Merit Group, our data engineers use Airflow to visualize data pipeline dependencies, check logs, trigger tasks and overall monitor workflows.
Airbnb, the vacation rental company, operates as a marketplace that connects 4 million hosts with 1 billion guests across the globe. Hosts on Airbnb own and manage rental properties in over 100,000 cities. Overall, there are about 7 million properties listed in the Airbnb marketplace, which means that several hosts manage multiple properties.
Brian Chesky, CEO of Airbnb, has repeatedly spoken about Airbnb’s deep focus on technology and innovation. Also, the company has built a world-class data engineering layer that plays a key role in enabling data-driven decision making. As the company’s data platform scaled, they realised that there was a critical need for a next-generation workforce management solution and decided to build Apache Airflow.
Apache Airflow enables data engineers and data scientists to programmatically define and deploy data pipeline components, their dependencies, and execution order using the DAG or directed acyclic graph model and a set of APIs. The pipelines are developed using Python and other familiar constructs.
Automating Data Pipeline Workflows
Typically, data processing is automated using batch jobs that need to be scheduled to be run at a specific time. These jobs may be dependent on existing datasets and/or have other jobs that depend on them. As the volume of data increases, so do the number of batch jobs to be run along with concurrent growth in data infrastructure, resulting in a complex network of computation jobs that need to be orchestrated. This can prove to be challenging to manage or even understand.
The key properties of these networks of jobs that typically are DAGs (directed acyclic graphs) include:
- Scheduling: To ensure the jobs are scheduled to run at a specific interval
- Mission-critical: Some jobs not running can have consequences
- Evolving: Along with the company, the team and data processing also matures
- Heterogenous: Companies keep pace with the rapid strides being made in modern analytics by running multiple, integrated systems
Businesses try workarounds, beginning with schedulers such as cron, then moving on to writing scripts, creating frameworks and so on. These reactive solutions are focused on processing and analysing data rather than enabling workflow management. As a result, they create challenges of authoring, scheduling and troubleshooting jobs, leading to inefficiencies and loss of productivity.
Apache Airflow, on the other hand, has been developed as a workflow engine and therefore facilitates the scheduling and running of complex data pipelines. Tasks are executed in the right sequence, utilising the required resources. It is scalable and its intuitive user interface allows monitoring and fixing possible issues without wasting time.
Read about other advanced tools recommended by Merit’s data engineering experts for powering and optimising your BI Stack.
Key Features and Capabilities of Apache Airflow
Some of the features that make Airflow popular amongst data engineers are:
Ease of Use: Since python can be deployed on Airflow, and Python is an easy language to learn and use, it helps to create flexible, simple to complex workflows quickly and easily
Open Source: Since it is free and open-source, there is a community around it with lots of active users to provide support
Robust Integrations: The community also provides access to ready-to-use operators so that Airflow can be used on Google Cloud Platform, Amazon AWS, Microsoft Azure, etc.
Intuitive User Interface: Monitoring and managing workflows, checking the status of completed and ongoing tasks is intuitive, enabled by Shiny UI
Components of Apache Airflow
The components of Apache Airflow that make it effective as a workflow automation tool include:
Directed Acyclic Graph (DAG): Defined in a Python script, the DAG is a collection of the tasks to be run in an organized manner, in which the relationship between different tasks is shown
Web Server: The user interface built on the Flask, it enables monitoring the status of the DAGs and triggers them
Metadata Database: The status of all the tasks is stored in a database that controls all read/write operations of a workflow
Scheduler: It schedules the execution of DAGs and retrieves and updates the task status in the database.
Some Common Use Cases
Airflow has a variety of uses across industries. It can be used for cleansing, organising, data quality checking, and publishing data to the data warehouse. It can also help with the effective maintenance of data infrastructure through database scrapes, folder cleanup, applying data retention policies, and so on.
At Airbnb, it was used to measure growth by computing metrics around guest and host engagement. It was also effectively used to apply rules for targeted email campaigns for improving customer engagement.
In addition to computing metrics related to search ranking, Airbnb found Airflow useful for computing A/B testing experimentation frameworks logic and aggregates
Running ad hoc, unscheduled workloads can also be run, though Airflow is more suited for the scheduled runs.
Advantages of Airflow
Some of the benefits of using Airflow over a traditional scheduler is that:
- It is open source and hence free as well as supported by a large community of users
- There are multiple ways to monitor workflows from the UI and an email can be sent in case of task failures or delays
- Airflow can be used to track the source of data, its movements etc., especially useful when multiple data tasks need to be read and written into storage.
- It has sensors which are operators that trigger a task based on specific criteria or conditions
- This is an extensible platform that provides access to commonly used Big Data tools such as Hive, HDFS, Postgres, S3, Presto, etc., and allows users to create custom Operators, Hooks, Executors and UI as plugins
- SQL queries can be run against registered connections, and simple charts can be built on top of the results
About Merit Group
At Merit Group, we work with some of the world’s leading B2B intelligence companies like Wilmington, Dow Jones, Glenigan, and Haymarket. Our data and engineering teams work closely with our clients to build data products and business intelligence tools. Our work directly impacts business growth by helping our clients to identify high-growth opportunities.
Our specific services include high-volume data collection, data transformation using AI and ML, web watching, and customized application development.
Our team also brings to the table deep expertise in building real-time data streaming and data processing applications. Our expertise in data engineering is especially useful in this context. Our data engineering team brings to fore specific expertise in a wide range of data tools including Airflow, Kafka, Python, PostgreSQL, MongoDB, Apache Spark, Snowflake, Redshift, Athena, Looker, and BigQuery.
If you’d like to learn more about our service offerings or speak to a Airflow expert, please contact us here: https://www.meritdata-tech.com/contact-us/
Related Case Studies
-
01 /
A Hybrid Solution for Automotive Data Processing at Scale
Automotive products needed millions of price points and specification details to be tracked for a large range of vehicles.
-
02 /
A Unified Data Management Platform for Processing Sports Deals
A global intelligence service provider was facing challenge with lack of a centralised data management system which led to duplication of data, increased effort and the risk of manual errors.