A Painter in a Print Shop: My Struggle with Databricks
This is a cry into the void. A single man against a freight train. As I have become fond of saying, “A painter in a print shop.” I don’t think I can stop Databricks. In a way, what they have done is brilliant and scratches a very real itch within organizations. They have created a single platform that creates compute resources, manages databases, shows schemas, allows for programming in multiple languages, dashboarding, automation, monitoring, and more.
So, why would we ever stop?
Well, this is a bit of a nuanced topic. Databricks itself is only partially to blame. They made some things easy, far too easy in some cases. It’s borderline trivial to create a compute instance that runs 24/7, costing $90 a day, that does ABSOLUTELY NOTHING. Is this Databricks’ fault? Not really. They created an easy-to-use tool, and people will use it. Monitor it appropriately, and there won’t be any issues.
So What is Databricks Doing?
In short – everything. It has integrated storage, AWS/Azure compute management, querying, dashboarding, and scheduling all in one tool. More than that, all the tools are mostly okay as well.
Painter in a Print Shop
Databricks creates bad habits by making things easy.
![[old_man_yells_at_cloud.png]]
This is where I feel like a “painter in a print shop.” Yes, you can program on Databricks, running your code line by line like a data scientist investigating data. In fact, that’s the exact person who should be doing that. But once you start using a notebook to create automated programs, that’s where I get uneasy. This isn’t a problem limited to Databricks – ANY automated code written using notebooks is bad practice. The issue comes back to how easy Databricks makes it to use their mediocre editor. It’s there, it works, I get paid whether we do it well or poorly. Right?
And another thing! Not everyone needs a database that can expand to petabytes of storage! Database selection is a bit of an art – different databases have different I/O limitations, handle race conditions differently, have extensions that can make them better for certain applications, and Databricks gives you one option and that’s it, much like the editor. Once again, it’s there, it works, and most engineers have neither the time, knowledge, nor motivation to make something more ideal.
But surely the clusters are good? If I understand correctly, Databricks takes a slice of whatever we are paying for clusters. It pays for the compute – which it gets from some cloud provider – and then pockets the rest. Which means, once again, it’s there, it works, and no one is asking questions about why I need 500 cores to do some enormous query.
Okay, What’s the Solution?
Ah, and here we come back to the nuance I mentioned at the beginning. Each of the issues mentioned above could be a team of support people. In an ideal world, engineers would have exactly that – a team of support people nearby that they could request resources, security auditing, and more. What a dream.
In reality, these people are hard to come by even in well-run tech organizations. Everyone wants to store and process data, and no one wants to wait around for support tickets.
I can’t avoid the freight train that is Databricks, but there are things I (and everyone else) can do to play nicely while keeping our programming dignity intact.
Get Everything Running Locally
I didn’t refuse to learn MATLAB just to get vendor-locked on my compute resources. I learned Python so I could build whatever I wanted and run it wherever I wanted. Databricks will fall, and we’ll all be scrambling to run our code on the next great thing. At the very least, running it locally allows programmers to use their testing and debugging tools to write better, faster code.
Pin Down Your Queries and Stop Messing With Them
If we’re going to have to run a huge number of cores to query a huge amount of data, figure out the queries and only run them if you absolutely have to. Even waiting for the clusters to initiate can waste minutes of my day. Minutes I could have spent, I don’t know, staring at the wall in despair. But because I want to.
Speaking of not waiting for queries….
Get Sample Data Locally
Keep the data from the queries and use it to build the rest of the code. This isn’t rocket science. Rocket science doesn’t have to be that painful. Sure, update the stored data occasionally to make sure it still works, but running massive queries every time to test out code is ugly. In fact, you could have multiple versions of the small data sample to test against. Which brings me to my last point.
Test Locally and Test Often
Automated tests. This is like the holy grail of programming, and I’ve seen teams implement this with wildly varying degrees of success. If I have to explain why you need to make automated tests, I don’t know, read one of the many books written on this. Maybe I’ll make another post about testing code.
Down with Databricks…?
No, Databricks is fine, but as programmers, we should strive to program well and maintain good practices. Just because something is easy doesn’t mean it is the best way to do things. Databricks makes bad practices a click away.
And no, this doesn’t mean my code or practices are perfect. Very far from it.