Databricks vs. Snowflake: when Koalas and Raccoons write SQL

Joe Maraschiello
6 min readAug 2, 2023

--

Koalas (phascolarctos cinereus) are an example of an ecological specialist. They only feed on the leaves of eucalyptus trees in Eastern and Southern Australia. Some Koalas further specialized and exclusively eat leaves from a pair of specific trees. In contrast, the raccoon (procyon lotor) is an ecological generalist; as the unofficial mascot of my home town of Toronto, I can personally attest their ability to thrive in a wide variety of environments and foods (including my own backyard).

In recent months and years, we’ve witnessed the ongoing transformation of two competing analytical platforms: Snowflake & Databricks. Both platforms have made attempts to evolve from specialized solutions to analytical generalists.

Like the eucalyptus snacking koala, both solutions have origins in their respective areas of specialization. Snowflake’s genesis is enveloped in the architecture we associate with data warehouses, and languages like SQL. On the other hand, Databricks’ lineage can be traced back to “Big Data” technologies that emerged out of the Hadoop ecosystem, which eventually metamorphosed to Python on Spark.

As an ecological generalist, there is a lot to admire about the raccoon. Raccoons are omnivorous and can scarf down everything from fruit, nuts, insects, frogs, eggs, and (their favorite) whatever they can find in my waste bin. Any platform that can easily manage the same variety of data and analytics use cases is an impressive one.

Both Databricks and Snowflake have taken major strides towards presenting their platforms as increasingly generalist solutions akin to my unwanted backyard tenant, the humble raccoon. In this article, I’ll make my case for why Databricks is currently ahead in the evolution to analytics generalization. I’ll also explain strategies my peers used to test for and manage situations where specialization mattered, and where it didn’t.

Everybody Loves SQL

SQL is the language of business analytics, and it has been that way for some time.

Early in my professional career, I was enchanted by the allure of SQL. It allowed me to tap into troves of enterprise data and easily identify new insights and opportunities. Writing SQL was the equivalent of having super powers vs my less technical peers.

SQL remains the preferred language for BI (business intelligence) teams, especially when used in conjunction with the latest visualization tools.

Whether by proximity, or coincidence, the ETL (extract, transform, load) jobs written by data engineers that power data warehouses are also commonly written in SQL. Although newer jobs are more likely to be written with other languages, there is no shortage of legacy jobs written in SQL.

Since SQL isn’t going anywhere (especially for BI workloads), the ideal analytics platform should excel when working with the language.

Can Raccoons Digest Bamboo?

Could a raccoon, under extreme circumstances, adapt to digest bamboo? That’s one question better answered by zoologists, but we do know that this ecological generalist’s adaptation and flexibility is a virtue.

Let’s see how Databricks fares as a prospective analytics generalist. Can Databricks digest SQL (or “bamboo”?).

If we are talking about Databricks in 2013, perhaps not so well (remember, Databricks was born from the world of Apache Spark and the “Big Data” ecosystem). In 2023, a number of Databricks adaptations have evolved to make working with SQL a complete experience.

Contemporary Databricks SQL capabilities and features include:

  • SQL warehouse clusters
  • Serverless SQL warehouse
  • Query building & dashboarding optimized for SQL (e.g. syntax auto-complete, sample data, visualizations)
  • Query history for finding areas to optimize
  • Powerful unique-to-Databricks SQL functions to ease working with unstructured data (e.g. “EXPLODE”,“TRANSFORM”, “REDUCE” etc.)

Compounded, these changes add up to a high “quality of life” for BI analysts and data engineers alike. Anyone writing SQL running SQL queries on contemporary Databricks are now first class citizens.

Yet, all the SQL creature comforts in the world wouldn’t mean much if they weren’t backed up by solid performance. To compete with the ecological specialists, the generalists need to keep up in terms of raw query performance.

Best Paw Forward: Query Optimizer

During the early 2000’s, I remember the battle between legacy and new (at the time) data warehousing vendors. I was astonished to see how much query performance could change from vendor to vendor. I was blown away how much a new vendor could outperform an older one, and how effective the various levers I could apply such as query “hints” were.

At the heart of each of these competing data warehousing technologies like DB2, Oracle and Netezza is the SQL query optimizer. The optimizer contains each vendor’s secret sauce for breaking down a query plan and returning the results as quickly as possible. In many cases, a smarter query optimizer could easily out perform a less sophisticated one, even with the disadvantage of hardware handicap.

As a descendent of legacy data warehousing technologies, we should expect Snowflake to have an SQL query optimizer that improves upon past performance. Indeed, the Snowflake’s perceived performance is greatly positive within data analytics circles. It’s understandable that many prospective clients are putting Snowflake head-to-head with Databricks’ SQL performance, to ensure they’re not leaving anything on the table.

The pursuit of raw performance is what drove clients undergoing analytics environment modernization projects to perform real-world testing, unique to their environment.

The Proof is In the Bamboo Pudding

My peers helped a prospective Databricks financial services client evaluate Databricks’ SQL performance with the following strategy:

  • Migrate all SQL workloads to Databricks (these were largely ETL jobs)
  • Where possible, maintain the original SQL code (avoid rewriting code)
  • Review the performance and identify long running jobs (there were four that looked like they would benefit from optimization)
  • Converted the four long running jobs to PySpark and ran them through a low-cost cluster (which, completed in a fraction of the time)

In this example, the client was able to identify specific ETL jobs where there was merit in optimizing them. This helped change how they were evaluating competing analytics platforms. Instead of comparing the two platforms head-to-head in one narrow area of relative performance (i.e. SQL queries), they found a solution that minimized the rewriting of SQL ETL code to jobs that would benefit the most from PySpark.

This client chose to maintain a strategy focused on Databricks as a generalist analytics platform. In their situation, they made the compromise to re-write some code for their edge cases, which allowed them to continue to maximize their benefits from the Databanks platform, while maintaining adaptability.

Ready to Adapt?

Selecting an analytics platform is a forward looking exercise. Your organization not only needs to meet today’s needs, but tomorrow’s as well.

Today, the immediate need might be to maintain existing SQL ETL jobs, which are maintained by data engineers that are most comfortable with the SQL language. Tomorrow, those same engineers may be writing on Python, and joined by new peers with PySpark experience. This is why building on top of an analytical generalist platform is so crucial.

Your organization might be in the midst of a search for “the best bamboo eater in the world” (i.e. SQL query performance). Yet, this is not what leading analytics groups are looking for. Instead, they prefer flexible and open platforms, which offer the freedom to manipulate underlying data files and storage. Where possible, they’re also looking to limit the total number of solutions, which simplifies getting value from the best cataloging, security and governance features built around solutions like Databrick’s Unity Catalog.

Are you ready to share your analytics generalist story? Reach out to discuss your own raccoon evolution and where your industry peers are as well.

--

--

Joe Maraschiello
Joe Maraschiello

Written by Joe Maraschiello

AI, analytics, marketing technology and cloud insights from 15 + years of global consulting experience http://bit.ly/JoeLinkedIN

No responses yet