As a data enthusiast, I’m always on the lookout for ways to streamline
my workflow and make my life easier. In this post, I’ll be sharing two
Tabular Editor scripts that have saved me countless hours of manual
formatting in my Power BI projects.
There are two versions of this software. To use the scripts below you only need version two.
This version is free and open source: https://tabulareditor.github.io/TabularEditor/
If you have the money for the license you can support the author (Daniel Otykier) by buying the license for version 3:
https://tabulareditor.com
I got an assignment the other day to produce documentation to send to
a customer. The extraction of the table names required to execute a
certain Databricks notebook was part of the task. The plan was to
build an object dependency tree.
The query spanned 279 lines. How can you extract only the table names
from a file without having to manually look for them? Can we make use
of this technique again in the future?
During data engineering projects I tend to try and minimize the tools
being used. I think it’s a good practice. Having too many tools causes
sometimes for errors going unnoticed by the teams members.
One of the advantages of having a tool like Databricks is that it
allows us to use all the power of python
and avoid, like I did in
the past, to have something like Azure Functions
to compensate for
the limitations of some platform.
I’m currently working on a project where I’m adapting a code base of
Databricks notebooks for a new client. There are a few errors to hunt
but the Web UI is not really friendly for this purpose.
Just wanted a quick and easy way to not have to click around to find
the issues.
Here’s a quick script to just do that:
import os, json
import configparser
from databricks_cli.sdk.api_client import ApiClient
from databricks_cli.runs.api import RunsApi
def print_error(nb_path, nb_params, nb_run_url, nb_error="Unknown"):
error = nb_error.partition("\n")[0]
params = json.loads(nb_params) if nb_params != "" else {}
print(
f"""
Path: {nb_path}
Params: {json.dumps(params,indent=2)}
RunUrl: {nb_run_url}
Error: {error}
"""
)
databricks_cfg = "~/.databrickscfg"
conf = configparser.ConfigParser()
conf.read(os.path.expanduser(databricks_cfg))
api_client = ApiClient(
host=conf["DEFAULT"]["host"],
token=conf["DEFAULT"]["password"]
)
runs_api = RunsApi(api_client)
for x in range(1, 101, 25):
x = runs_api.list_runs(
job_id=None,
active_only=None,
completed_only=None,
offset=x,
limit=25,
version="2.1",
)
if len(x["runs"]) > 0:
for y in x["runs"]:
if y["state"]["result_state"] == "FAILED":
z = runs_api.get_run_output(run_id=y["run_id"])
if "error" in z:
print_error(
z["metadata"]["task"]["notebook_task"]["notebook_path"],
z["metadata"]["task"]["notebook_task"]["base_parameters"][
"Param1Value"
],
z["metadata"]["run_page_url"],
z["error"],
)
else:
print_error(
z["metadata"]["task"]["notebook_task"]["notebook_path"],
z["metadata"]["task"]["notebook_task"]["base_parameters"][
"Param1Value"
],
z["metadata"]["run_page_url"],
)
Follow this documentation to install the requirements. There’s
a lot more you can do with databricks-cli
to make your life easier.
It’s a great tool to add to your toolbox.
There’s not an official function to calculate workdays in Databricks.
Here are some solutions.
Having a DimCalendar with holidays and at least Databricks Runtime 12
If you have a DimCalendar in the system you can now do LEFT LATERAL
JOIN
without the correlated subquery errors when using non-equality
predicates. Check SPARK-36114 for more details.
Calculating working days is then as simple as run a query like:
SELECT mt.StartDate,
mt.EndDate,
dc.Workdays
FROM myTable mt
LEFT JOIN LATERAL
(SELECT COUNT(DateKey) - 1 AS Workdays
FROM dimCalendar dc
WHERE mt.StartDate <= dc.DateKey
AND mt.EndDate >= dc.Datekey
AND dc.IsDayWeekDay = TRUE
AND dc.IsDayHoliday = FALSE
)
If this query is slow, please check if the data types for the dates
columns match. All are DATE
or all are TIMESTAMP
. If the sluginess
remains check if any of the fields are not part of the statistics of
the table.
A while ago I’ve received an email from Jason Alvarez asking me if I
knew a way to change Power Query
connections inside an Excel file.
The problem is similar to my previous post on Mass changing pivot table
connection strings in Excel using PowerShell. Turns out you can and he was able to find this solution:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
#root path to the collection of excel docs:
$rootPath = "C:\Temp\PowerQueryDemo\"
#Current query file path:
$currentPath = "C:\Temp\PowerQueryDemo\"
#Updated query file path:
$newPath = "C:\Temp\PowerQueryDemo\PowerQueryNewSource"
Get-ChildItem -Path $rootPath -Recurse -Include *.xlsx | ForEach-Object{
$docPath = $_.FullName
## Create an instance of excel:
$ExcelAppInstance = New-Object -ComObject "Excel.Application"
$ExcelAppInstance.Visible = $false
## Open the workbook object within our Excel instance:
$workbook = $ExcelAppInstance.workbooks.Open($docPath)
##iterate the list of queries embedded in the workbook,
##updating the file path to our new one:
$workbook.Queries | ForEach-Object{
Write-Output $_.Formula
Write-output $_.Formula.replace($xPath,$tPath)
$_.Formula = $_.Formula.replace($xPath,$tPath)
}
$workbook.Save()
$workbook.Close()
$ExcelAppInstance.Quit()
}
|
The key here is the Queries
property inside the Workbook
object.
Yesterday I had the pleasure to be back to in-person to give a
small fun presentation about bending Power BI to help us solve
Wordle puzzles.
The content is in Portuguese but I will upload an English version
during this week.
Here’s the link on GitHub where you can download the file:
https://github.com/fpvmorais/presentations
If you have any questions please reach me.
Have fun!
While creating a new machine to be the Integration Runtime for
Purview and after I have installed the mandatory JRE that allows
the connection to Snowflake I kept getting this error:
Error: (3913) JavaException: Must have Java 8 or newer installed.
This puzzled me because I had installed version 17. Started
troubleshooting and followed this guide to check if the
installation was correct. Everything seemed alright but the error
was still there.
Here are some tips of the things I’ve learned while creating them.
App Plan
If you already have an Windows ‘App Service Plan’ you’ll need a
different one for running Python.
You can’t host Linux and Windows apps in the same resource group. If
you have an existing resource group named AzureFunctionsQuickstart-rg
with a Windows function app or web app, you must use a different
resource group.
Python version
If you need to use ‘azure-identity’ you’ll need to use python 3.8 instead of the 3.9 otherwise you’ll get:
There’s a lot to chew while unit testing Azure Functions. I going to
be quite liberal with the terminology because technically some of
this will be in fact integration testing and not unit testing per se.
Either way, Azure Functions load the local.settings.json
on
startup, creating several environment variables that then we can use
in our code. In C#
we would access them like so:
tenantId = configRoot["TenantId"];
appId = configRoot["AppId"];
or in Python
: