I got an assignment the other day to produce documentation to send to a customer. The extraction of the table names required to execute a certain Databricks notebook was part of the task. The plan was to build an object dependency tree.
The query spanned 279 lines. How can you extract only the table names from a file without having to manually look for them? Can we make use of this technique again in the future?
During data engineering projects I tend to try and minimize the tools being used. I think it’s a good practice. Having too many tools causes sometimes for errors going unnoticed by the teams members.
One of the advantages of having a tool like Databricks is that it allows us to use all the power of python and avoid, like I did in the past, to have something like Azure Functions to compensate for the limitations of some platform.
I’m currently working on a project where I’m adapting a code base of Databricks notebooks for a new client. There are a few errors to hunt but the Web UI is not really friendly for this purpose.
Just wanted a quick and easy way to not have to click around to find the issues.
Here’s a quick script to just do that:
import os, json import configparser from databricks_cli.sdk.api_client import ApiClient from databricks_cli.
There’s not an official function to calculate workdays in Databricks. Here are some solutions.
Having a DimCalendar with holidays and at least Databricks Runtime 12 If you have a DimCalendar in the system you can now do LEFT LATERAL
JOIN without the correlated subquery errors when using non-equality predicates. Check SPARK-36114 for more details.
Calculating working days is then as simple as run a query like:
SELECT mt.StartDate, mt.EndDate, dc.Workdays FROM myTable mt LEFT JOIN LATERAL (SELECT COUNT(DateKey) - 1 AS Workdays FROM dimCalendar dc WHERE mt.