Use these methods to orchestrate your pipelines

Build better workflows and ensure data quality in Databricks

Hector Andres Mejia Vallejo
4 min readMay 31, 2022

This post builds on top of my previous article on continuous training pipelines, but can also be read on its own. Given the components of a training pipeline and presented the stack for its implementation, for instance: MLflow, Delta Lake, hyperopt, and scikit-learn, we are going to integrate and fully automate them. For that, I will introduce some new and exciting Databricks features. Furthermore, at the end of the article, I discuss airflow as an alternative for more general workflows, not restricted to Databricks.

Let’s illustrate the components of our training pipeline:

Continuous training pipeline. Made with ❤ by me.

For the automation of the data ETL, we are going to introduce Delta Live Tables (DLT) and for the machine learning components, we are using Databricks workflows.

Data ETL with Delta Live Tables

DLT is a very useful feature in the Databricks ecosystem because it allows us to integrate all of our data ETL components in a Directed Acyclic Graph (DAG) and execute them periodically or continuously. In addition, it lets us:

  • Track data lineage
  • Monitor data quality after implementing checks
  • Define behavior if data does not pass the checks

It is a great tool to promote good data governance practices very easily. You can read more about DLT here. Let’s see a short snippet on how to declare DLT tables using python:

For a table to be declared, we must define a function with the decorator @dlt.table and its name will be also the name of the table. The same logic applies to views. Note that the @expect decorator is used to define a data quality check called expectation. We can create as many expectations as we need. After the pipeline is established, Databricks will render the DAG on the UI.

A render of a sample data pipeline DAG in DLT.

You can see that Databricks presents metrics for the data that does not pass the checks, and shows the state of each stage in the DAG. This is how we build automated data pipelines and ensure data quality.

Machine Learning Components with Databricks Workflows

Recently, Databricks introduced workflows. They replace the old Databricks jobs, allowing developers to create multiple tasks and declaring dependencies among the tasks. In addition, these workflows can be executed periodically or on-demand.

In the context of our training pipeline, each component: model training, registry, and publishing has its own notebook. Each notebook will represent a single task in the workflow.

Moreover, a workflow can be created using the Databricks Jobs UI. However, to make it versionable, we can use JSON configuration files to declare the tasks, dependencies, and task arguments:

and then use Databricks Jobs CLI 2.1 to create the workflow:

databricks jobs create --json-file db_workflow.json

It will return something like:

{ "job_id": 246 } // 246 is just an example

Then you can execute it by entering:

databricks jobs run-now --job-id 246

Notice in the snippet that the first component, data_etl is the Delta Live Tables pipeline that we previously defined. You can execute a DLT pipeline as a task inside a workflow, which is very convenient. You can read more about Databricks workflows here and how to structure a JSON configuration here.

Now if we go to the UI, you will see your training pipeline rendered and you can track the state of each task. You may also start to wonder if we can build pipelines using JSON and update them using the Jobs CLI, then we can use some CI/CD tool to test the pipeline and update it automatically. Yes, we can implement all sorts of automation, and use it to fit our data needs.

A sample workflow on Databricks.

Alternative: Apache Airflow as the orchestrator

Apache Airflow is an open-source technology that lets you build data pipelines using the same DAG concept easily using a python interface. An immediate benefit of using python code for your pipelines is that it is versionable. You can read more about apache airflow and a quick example here.

This option applies to more general implementation of data pipelines that use other technologies. However, it also has a special DatabricksOperator to include databricks tasks inside Airflow. See here for more information.

In addition, airflow implements extra SQL operators to ensure data quality, if you have other relational databases in which you want to maximize observability and minimize data downtime.

An example render of a DAG in airflow.

I hope this article can be useful to you. Thank you for taking the time to read it. Please, give me some support by clapping the article or leaving a constructive comment. What did I miss? What can I do better?

See you next time!

--

--

Hector Andres Mejia Vallejo

Ecuadorian. Studied CS, now working as a data engineer and doing a masters in data science. Passionate for AI and discovering stories in the data!