Alternatives to dbt (Data Build Tool)

Update 3/11/2024


Top dbt Alternatives


What is dbt?

(If you are already familiar with dbt, skip to Considering Alternatives)

Data transformation is an integral part of the modern data stack, and dbt (data build tool) has emerged as a frontrunner in this space. This tool is unique in that it focuses exclusively on the 'T' in ELT (Extract, Load, Transform), setting it apart from traditional tools that encompass the entire end-to-end pipeline.

According to their latest YouTube video, titled “What is dbt”, dbt Labs tells us they built dbt on two core beliefs:

1 - Transformation logic (i.e. business logic) should be defined in code.  This is SQL, jinja, YAML, and more recently Python.  dbt is by all means a code-first platform, and seems like they have no intention of jumping on the low-code bandwagon.

2 - Treat data assets like a product.

This second point has been a differentiator for dbt.  They are the first to bring software engineering best practices to data pipelines.  This includes version control via Git, automated testing of your data pipelines, and deployment via CI/CD.  It also simplifies the workflow of working in environments (Dev, QA, Prod), and gives a very clear way to migrate through the environments.


But what IS dbt?  dbt is an open source transformation tool that allows anyone comfortable with SQL to author their own data pipelines.  When it comes down to it, dbt is just a clever way to organize and document your SQL files, visualize the data lineage via a DAG (Directed Acyclic Graph), and execute very specific segments of the DAG or pipeline.  Additionally, dbt is very tightly integrated with Git, so code is version controlled and has a defined release process.  dbt encourages, by does not require use of CI/CD for deployjment.  dbt is compatible with almost any database, and offers jinja functions to ensure cross-database compatibility.

To dive deeper on the question “What is dbt”, I recommend this YouTube video by Seattle Data Guy. He does a great job of comparing the old workflow in SSIS to the new workflow in dbt, highlighting why dbt represents a paradigm shift.

dbt Core

The “Core” offering, called dbt Core, is an open source python library.  You install it with `pip`, just like any python library, but that is where the Python ends. You can use dbt with zero Python knowledge; the Python is just responsible for executing SQL, displaying the documentation and other backend stuff.

dbt Core allows an analyst familiar with SQL to express complex transformations in the form of the SQL `SELECT` statement.  Instead of writing DDL (create table) and DML (Insert, update, delete), the analyst will simply create the SELECT statement and dbt will handle the rest.

Table models will re-create an entire table from scratch every time.  When you want to insert new rows, use an incremental model.  If you need a slowly changing dimension, dbt has an out of the box solution for that called dbt Snapshot.

dbt Cloud

dbt Cloud is dbt’s commercial product.  It is a fully managed environment which hosts dbt Core for you and provides these extra features:

  • Scheduling and pipeline observability.

  • Documentation browsing (or now “dbt Mesh”).

  • Cloud based IDE aptly named “dbt Cloud IDE”.

  • Git integration (GitHub, GitLab, etc.), with a simplified “Git with guardrails” workflow.  This makes it easy for beginners to get started using Git.

  • CI/CD Integrations.

  • API for calling jobs externally, or querying logs, etc.

dbt Bottom Line

dbt has become more or less a standard in the data transformation space and is the only open source tool in the modern data stack achieve this status.  While the market for Extract / Load tools, Integration tools, and Orchestration tools is quite crowded, dbt is currently the king of code-based transformation.

Personally, dbt has completely transformed (pun intended) the way I work. I cannot imagine doing a data engineering project without dbt. It is a common reaction for people to hear about dbt and say “why do I need that?”. This was also my first reaction too! You have to use dbt to get the first-hand “wow factor”. The first time you build a big project with 100+ dependencies and tests, it is a thrilling experience to watch it run what it can in parallel and wait for certain objects to run in series. You realize how much power and ease they have given you.


Considering Alternatives

If you are looking for an alternative, it is important to ask yourself if you are looking for an alternative to dbt Cloud (you want to use dbt but not their commercial product), or, if you are looking for a complete alternative that doesn’t use dbt under the hood.  This article highlights both pure alternatives to dbt and alternative ways to run dbt outside of dbt Cloud.


Transformation Only dbt Alternatives

Do It Yourself

This is a typical design pattern for DIY transformations:

  • Create stored procedures to manage transformations (insert, update, delete).

  • Likely, the stored procedure will be fed by many SQL views, or the procedures themselves will create or replace tables along the way.

  • If you are very advanced, a migration tool such as “Flyway” can be used to manage the various environments and deploy the code to Production.

  • Schedule it somehow:

    • Airflow or similar orchestrator.

    • Cron jobs to schedule the procedures.

    • If you are using Snowflake, it is common to use Snowflake Tasks to schedule the stored procedures.

Advantages to DIY

  • Full control.

  • No vendor lock-in.

Disadvantages to DIY

There are a few problems with this approach, many of which are the reason why dbt was invented.

  • Documentation is decoupled from the transformation itself.  Where is it anyway?

  • It is hard to understand the data lineage.

  • It is hard to know if everything is running on schedule.  (Depending on which scheduling approach you did).

  • You’re probably not version controlling everything, unless you fall into that advanced category.

  • Going from Environment to Environment is a lot harder.

  • You are not likely to come up with a better solution than using a tool.

While this is a valid approach and many teams are still doing this today, I would encourage you to find an alternative to this dbt alternative!  (Perhaps consider using dbt? 🙂)


Coalesce: A direct “transformation only” competitor

Coalesce (https://coalesce.io/) is a transformation tool that sits on top of Snowflake.  Like dbt, they are one of the few focusing exclusively on Transformation, but there are a few key differentiators.

Coalesce is very GUI focused, rather than being code-first, which speeds up the workflow tremendously.  For example, you can create all staging models for an entire schema with a few clicks, then add your code based transformations (calculations, etc).  The entire workflow continues with similar ease as you create your facts, dimensions, and reporting models.  They have struck a perfect balance between GUI and the power of code, enabling both technical and non-technical users to efficiently manage and transform data.

Advantages and Key Features of Coalesce

Here are some features I am particularly impressed with.

  • Automatically propagate a new column to all (or selected) downstream models.

  • Unnest JSON with no code.  If you have done this by hand like I have, you know how time saving this will be!  I have my own dbt Macro for this, but I love the out-of-the-box-done-for-you approach here!

  • Table models have truncate + insert instead of replacing.  This preserves Snowflake’s time travel in a more accessible way.

  • GUI for pre-SQL.

  • All tests and documentation are defined in GUI (unless you love YAML).

  • Column level lineage.

  • For projects of massive scale, they have great features for searching and sifting through the docs or DAG.

  • GUI environment management (dev schemas, etc).

  • They have their own take CI/CD for deployment.

  • Built from the ground up for Snowflake.  This means GUI features are available for many of Snowflake's key features, such as Transient tables, Dynamic Tables, External Tables, Streams, to name a few.

Disadvantages Coalesce

Here are what I consider to be the cons of Coalesce:

  • It is missing a native scheduler.  They expect you to have your own scheduler, or use Stored Procedures + Snowflake Tasks to schedule jobs.  However, they are working on building a scheduler.

  • Double Vendor lock-in: since it is not an open source framework, it is hard to take your transformations elsewhere.  And since it only sits on Snowflake, you won’t be able to switch warehouses easily.  You can, of course, easily get the resulting SQL from the transformations.  But it is not as “portable” as something like dbt.  To be fair, the only way to avoid vendor lock-in is to roll your own tools, which is not something I recommend.

Coalesce Quick Comparison to dbt:

  • Coalesce is for Snowflake only while dbt works with all databases.

  • Very similar in nature to dbt, solving a very similar problem to dbt.

  • GUI focus speeds up development while code is still version controlled in Git.

Coalesce Wrap Up

It is hard to describe the power of Coalesce in a blog post (as it is similarly hard to describe dbt); it is a very impressive tool with a lot of power and time saving tricks.

My best advice is to sign up for a demo and see for yourself.

Sign up for a demo, here: https://coalesce.io/request-demo/

SQLMesh

SQLMesh is a true alternative to dbt Core.  Like dbt Core, it is an open source python framework that helps you manage execution of SQL files in DAG order.  But that is almost where the similarities end.

It is backwards compatible with dbt, meaning you can bring your own dbt project and run SQLMesh.  But when coding from the ground up, we operate a little differently.  We don’t use {{source}} and {{ref}}:  One key differentiator of SQLMesh is that it can inspect your SQL to figure out your DAG without needing source and ref macros.

Another key differentiator is the idea of Virtual Data Environments, which ensures a table is never built more than once.  Instead of each developer creating their own dev environment (such as `dbt_jeffskodlberg` schema), SQLMesh’s environment management system will know to use `prod` upstream sources when editing a SQL file.  When you promote your changes from `dev` to `prod`, there is no need to build the `prod` table.  SQLMesh will automatically point to the asset you built during development.

Advantages of SQLMesh:

  • No need for YML files, but they are supported.

  • In-line documentation in the SQL file, instead of separate docs.

  • No need for source / ref macros.  Everything is simplified.

  • They have built a beautiful UI which is an IDE and DAG viewer.

  • The UI has Column Level Lineage

Disadvantages of SQLMesh:

  • There is a bit of a learning curve.  The environment management is just different enough that it will take a day or two to switch from dbt.  But certainly anyone who can learn dbt can learn SQLMesh.


Full ETL Suites

Moving on from “Transformation Only” tools, we need to look at tools that offer data movement + transformations.


Informatica

Informatica is a long-standing dynasty in the ETL space.  They are extremely popular with Fortune 500 companies and companies due to their monolithic nature; it is a single platform that handles all aspects of data integration, transformation, data quality, data governance, and API management.

Informatica has two main products in the ETL space, Informatica Power Center and Informatica Intelligent Cloud Services (IICS).  While there are important differences between the two products, for the sake of this discussion, suffice it to say that Power Center is the traditional, self hosted version and IICS is the cloud PaaS version of the product.  IICS offers additional features focused on cloud integration, but overall the two products are quite similar.

IICS Quick Comparison to dbt

  • If you want to do in-database transformations as dbt does, IICS sells connectors separately for each database.  You can pay a handsome fee for a Snowflake or Big Query connector.  (Pricing is consumption based, hard to estimate.)

  • IICS is capable of doing in-memory transformations, in the more traditional ETL approach.

  • The data lineage will not be as clear to non-technical users compared to dbt.

  • IICS is very expensive. The pricing is not transparent, but is on the high end of the spectrum. You also pay for additional connectors; a Snowflake connector may cost an extra $25,000.


Microsoft SQL Server Integration Services (SSIS)

SQL Server Integration Services (SSIS) is a versatile ETL tool from Microsoft that serves as an alternative to dbt. Unlike dbt, which focuses on in-warehouse transformations using SQL, SSIS provides a comprehensive suite of tools for data extraction, transformation, and loading (ETL). It excels in handling complex data integration workflows and offers a rich set of built-in tasks and transformations. SSIS is particularly strong in scenarios involving diverse data sources, including SQL Server, Oracle, and file-based sources. It's a great option for businesses already invested in the Microsoft ecosystem. However, SSIS's user interface and deployment process can be more complex compared to the simplicity of dbt's code-centric approach.

SSIS Quick Comparison to dbt:

  • More than just a transformation tool.

  • Historically transformations are done before loading (in memory of the SSIS server), although options exist to do in-database transforms.

  • Data Lineage will not be as clear to a non-technical user.

  • Documentation will be held outside of the platform.


Azure Data Factory:

Azure Data Factory (ADF) is a cloud-based data integration service that offers a broader range of functionalities compared to dbt. While dbt specializes in SQL-based data transformations within a data warehouse, ADF provides extensive capabilities for both data movement and transformation across various environments. It supports integrating data from diverse sources, both in the cloud and on-premises. ADF's integration with Azure services enhances its big data processing capabilities. However, ADF's GUI-driven approach contrasts with dbt's code-centric method, making it a better fit for scenarios requiring more complex data integration workflows beyond the scope of in-warehouse transformations.

ADF Quick Comparison to dbt:

  • ADF is a cloud-native integration platform, providing the entire spectrum of ETL / ELT.

  • Documentation tends to be external in ADF, whereas with dbt the documentation is alongside the code.

  • Data lineage is clear via the GUI but not accessible to non-technical users.

  • The deployment process in ADF relies on Azure’s deployment and release pipelines instead of relying on Git based CI/CD.


Matillion

Matillion was a very early player in the Cloud ETL / ELT space.  It was the first cloud-native ELT platform that I had heard of in the 2010’s (back when SSIS basically ruled my world).  It was designed to leverage the power of cloud data warehouses like Snowflake, Redshift, and BigQuery, offering a scalable solution for data extraction, loading, and transformation.  It is a comprehensive data integration platform covering dozens of possible sources, targets and transformations.

Recently Matillion added support to run dbt and Git directly inside of the platform.  Previously they had their own approach to transformation logic, but now I’m seeing a lot of clients use Matillion for integration, then using it to kick off dbt jobs within the platform.

Matillion Quick Comparison to dbt:

  • Matillion is a full integration platform, not just a transformation tool.

  • Follows the ELT design pattern.  All transformations are done in-database.

  • If using Matillions native transformation functionality, the data lineage is not quite as clear as it is with dbt.

  • Running dbt inside of Matillion will not come with a way to serve the dbt Docs.  You’ll need to host the docs elsewhere.

  • Running dbt inside of Matillion has certain limitations.  For example, specifying which environment (target.name variable) you want to run in is currently not possible.  This requires cumbersome work-arounds.

Alteryx

Alteryx is an interesting anomaly.  It is one of my favorite platforms to work in for very rapid transformations, analysis, complex data preparation, etc.  They can connect to dozens of sources and write to dozens of targets and accomplish any transformation you can think of.  Yet, they do not market themselves as an enterprise ETL solution; rather they are positioned as a business user friendly “data prep” tool and “data science” tool.  Alteryx is geared towards rapid insights, and rapid data automation processes, typically combining Excel with database sources, and writing to non-traditional targets like Tableau Server.  Their focus on speed and flexibility make it a bit of a pain point for large, control focused IT organizations.

Advantages of Alteryx

  • Input data from dozens of sources, output to dozens of destinations

  • Intuitive GUI based transformations.  They offer dozens if not hundreds of low-code “tools” to wrangle data.

  • Extremely powerful, flexible, and fun to work with.

  • Business user friendly, although this is for the business user with the engineer mindset.

  • Capable of data science, machine learning, and report creation.

  • Extremely customizable via macros and custom apps.

  • Can be used as an orchestrator and scheduler.

Disadvantages of Alteryx:

  • They were very late to the cloud and in fact, still not very cloud friendly.  

    • The S3 input tool feels like an afterthought, leading you to customize Alteryx.

    • The “Designer” software is desktop native, still no full cloud replacement.

  • No Git integration or version control.  (Although Alteryx Gallery saves prior versions)

  • Because they market towards business users with no data engineering experience, often messy, spaghetti-like workflows are created, leading to technical debt.  (Yes, these suboptimal workflows will become mission critical, with one business person supporting them).

Alteryx Quick Comparison to dbt:

  • Alteryx is more than just transformations, it is an end-to-end analytics and data prep platform.  (And could be considered an integration and orchestration platform).

  • Alteryx is Low Code vs dbt is code-first.

  • No git integration.

  • Documentation is done via text-box-like objects in the workflow itself.


Alternatives to dbt Cloud

dbt Core as a bolt-on offering

There are several companies out there that are leveraging the power of dbt Core within their own proprietary products.  There are too many to cover in detail, and in fact I don’t think I can successfully list them all.  Here is a list of popular tools which allow you to orchestrate (or run) dbt jobs:


Y42

We have saved the best for last!

Y42 is a data orchestration platform that packages best in breed data engineering tools under one roof and provides a ton of value-add on top of these tools.  Y42 uses Airbyte and cData connectors to ingest data into your warehouse, and hosts a managed instance of dbt Core for you.

Y42 has created a proprietary technology called “Virtual Data Build” (VDB) which allows you to version control your data along with your code.  When you roll back to a prior commit, you also roll back the data, automatically.

Advantages of Y42:

  • Beautiful, fun to use UI which speeds up the development process.

  • Best in breed ingest technology (Airbyte, cData) alongside best in breed transformation (dbt).

  • Runs Python throughout the platform: custom Python Ingest, orchestrate arbitrary Python Actions (like Airflow), fully featured Python dbt models.

  • Environment management:

    • When you branch main, you do not need to create a “development environment” in Y42 due to how VDB works. Each branch automatically creates an isolated environment, using zero compute out of the box!

    • When you merge back onto main, you don’t need to worry about CI/CD pipelines. Y42 will automatically switch to the latest version of the assets.

    • VDB saves 30% compute cost due to environment abstraction.

    • Contact me for a more detailed explanation

  • Almost all native dbt functionality exists, so it is easy to transition from dbt to Y42.

  • Column level lineage is available when you move beyond the $0 per month plan.

  • Due to the GUI nature and fully managed dbt, this could be the easiest way to get started using dbt.

  • cData brings a lot of Enterprise connectors including NetSuite, SAP, Oracle SCM, Workday, to name just a few. These are coming to Y42 soon.

  • Option to use GUI or Code to define documentation, tests, and asset metadata.

  • The orchestration pipelines (job schedules, etc.) are also version controlled.

  • Robust Asset Health monitoring.  At a glance view of status of each data asset.

  • Version control code and data together.

  • Automatically roll back data when dbt test fails!

Disadvantages of Y42

  • In order for VDB to work, Y42 has replaced dbt commands with y42 commands. Instead of dbt bulid, you would run y42 build. This would not be a disadvantage, except they have removed a few native dbt commands.

  • It does not have the Semantic Layer that comes with dbt Cloud.

  • Updates to dbt Core will not be available immediately in Y42.

Contact Y42

https://www.y42.com/book-a-call/discovery


Datacoves 

Datacoves is a company that solves the problem of hosting the open source data stack software.  They manage a full data stack for you, including Airbyte, dbt, Airflow, and VS Code.  Because they are hosting dbt on your behalf, and you can easily schedule DAGs with their hosted Airflow, Datacoves can be considered a valid alternative to dbt Cloud.

Datacoves is a very enticing alternative for those who are looking for the conventional tools in the Modern Data Stack, but don’t want to go through the trouble of hosting each tool.

Advantages of Datacoves:

  • Uses native / out of the box dbt.

  • Hosted VS Code plays nice with dbt and Airflow.  It provides a cloud IDE to work with your open source dbt.

  • Bundling dbt with with Airflow & Airbyte can be very powerful.

Disadvantages:

  • It may be cheaper to host everything yourself, but certainly not easier.

  • When compared to dbt Cloud, Datacoves would not have dbt Explorer or dbt’s semantic layer.

Wrap Up

The Modern Data Stack is made up of thousands of tools and it is impossible to touch on all dbt alternatives in a single article, but we covered the most popular alternatives here.  Remember, when looking for a dbt alternative, it is important to understand if you are looking for something to just replace the Transformation step in ELT, or if you are looking for a full ELT suite.  Your motives and business needs are unique, and there is no one-size-fits-all answer.  If you’d like bespoke advice, please reach out to me!

Previous
Previous

Snowpark Container Services: Quick Tutorial Running Metabase

Next
Next

dbt Tip of the Day Series