A #dataOp ramblings

Adventures in Business Intelligence

Terraform/OpenTofu Notes

Using Workspaces for Environment Separation

Terraform workspaces are a powerful feature that allows you to manage multiple environments (such as dev, staging, and prod) within the same configuration. For example, to switch to the production workspace:

tofu workspace select prod

Applying Configuration Changes

When applying changes to your Terraform configuration, you can specify a variable file for different environments and approve the changes automatically:

tofu apply -var-file ./dev.tfvars -auto-approve

Importing Existing Resources

To import an existing resource into Terraform’s state, use the import command with the appropriate variable file:

The credentials provided for the UsageMetricsDataConnector source are invalid.

Last weeks I had several meetings with Microsoft support to try this wrong credentials issue. The problem was that after right clicking to “View Usage Metrics Report” the report would come up empty and after 24h you would get an email with the following message:

<ccon>The credentials provided for the UsageMetricsDataConnector
source are invalid. (Source at UsageMetricsDataConnector.)</ccon>. The
exception was raised by the IDbCommand interface

We tried several approaches until we found this piece of documentation: https://learn.microsoft.com/en-us/power-bi/collaborate-share/service-modern-usage-metrics#considerations-and-limitations

Issues with v4l2loopback and OBS

I started my Linux journey a long time ago with Red Hat in the middle of the nineties and moved after to Slackware. Still to this day one of my favourite distros. I tried Ubuntu for a shortwhile but found a new home on Arch. It was my main distro for the better part of the 2010’s but then tried Void Linux and it’s now my home since at least 3 years.

Tabular Editor Scripts: Simplifying Your Power BI Modeling Experience

As a data enthusiast, I’m always on the lookout for ways to streamline my workflow and make my life easier. In this post, I’ll be sharing two Tabular Editor scripts that have saved me countless hours of manual formatting in my Power BI projects.

There are two versions of this software. To use the scripts below you only need version two. This version is free and open source: https://tabulareditor.github.io/TabularEditor/

If you have the money for the license you can support the author (Daniel Otykier) by buying the license for version 3: https://tabulareditor.com

Extracting tables from SQL queries by using Sqlfluff

I got an assignment the other day to produce documentation to send to a customer. The extraction of the table names required to execute a certain Databricks notebook was part of the task. The plan was to build an object dependency tree.

The query spanned 279 lines. How can you extract only the table names from a file without having to manually look for them? Can we make use of this technique again in the future?

Azure Active Directory extraction with Databricks

During data engineering projects I tend to try and minimize the tools being used. I think it’s a good practice. Having too many tools causes sometimes for errors going unnoticed by the teams members.

One of the advantages of having a tool like Databricks is that it allows us to use all the power of python and avoid, like I did in the past, to have something like Azure Functions to compensate for the limitations of some platform.

List of errors from Databricks API

I’m currently working on a project where I’m adapting a code base of Databricks notebooks for a new client. There are a few errors to hunt but the Web UI is not really friendly for this purpose.

Just wanted a quick and easy way to not have to click around to find the issues.

Here’s a quick script to just do that:

import os, json
import configparser
from databricks_cli.sdk.api_client import ApiClient
from databricks_cli.runs.api import RunsApi


def print_error(nb_path, nb_params, nb_run_url, nb_error="Unknown"):
    error = nb_error.partition("\n")[0]
    params = json.loads(nb_params) if nb_params != "" else {}
    print(
        f"""
Path:	{nb_path}
Params:	{json.dumps(params,indent=2)}
RunUrl:	{nb_run_url}
Error:	{error}
"""
    )


databricks_cfg = "~/.databrickscfg"

conf = configparser.ConfigParser()
conf.read(os.path.expanduser(databricks_cfg))

api_client = ApiClient(
    host=conf["DEFAULT"]["host"],
    token=conf["DEFAULT"]["password"]
)

runs_api = RunsApi(api_client)

for x in range(1, 101, 25):
    x = runs_api.list_runs(
        job_id=None,
        active_only=None,
        completed_only=None,
        offset=x,
        limit=25,
        version="2.1",
    )
    if len(x["runs"]) > 0:
        for y in x["runs"]:
            if y["state"]["result_state"] == "FAILED":
                z = runs_api.get_run_output(run_id=y["run_id"])

                if "error" in z:
                    print_error(
                        z["metadata"]["task"]["notebook_task"]["notebook_path"],
                        z["metadata"]["task"]["notebook_task"]["base_parameters"][
                            "Param1Value"
                        ],
                        z["metadata"]["run_page_url"],
                        z["error"],
                    )
                else:
                    print_error(
                        z["metadata"]["task"]["notebook_task"]["notebook_path"],
                        z["metadata"]["task"]["notebook_task"]["base_parameters"][
                            "Param1Value"
                        ],
                        z["metadata"]["run_page_url"],
                    )

Follow this documentation to install the requirements. There’s a lot more you can do with databricks-cli to make your life easier. It’s a great tool to add to your toolbox.

Calculating workdays in Databricks

There’s not an official function to calculate workdays in Databricks. Here are some solutions.

Having a DimCalendar with holidays and at least Databricks Runtime 12

If you have a DimCalendar in the system you can now do LEFT LATERAL JOIN without the correlated subquery errors when using non-equality predicates. Check SPARK-36114 for more details.

Calculating working days is then as simple as run a query like:

SELECT mt.StartDate,
       mt.EndDate,
       dc.Workdays
 FROM myTable mt
  LEFT JOIN LATERAL
    (SELECT COUNT(DateKey) - 1 AS Workdays
     FROM dimCalendar dc
     WHERE mt.StartDate <= dc.DateKey
        AND mt.EndDate >= dc.Datekey
        AND dc.IsDayWeekDay = TRUE
        AND dc.IsDayHoliday = FALSE
    )

If this query is slow, please check if the data types for the dates columns match. All are DATE or all are TIMESTAMP. If the sluginess remains check if any of the fields are not part of the statistics of the table.

Mass changing power query connection strings in Excel using PowerShell

A while ago I’ve received an email from Jason Alvarez asking me if I knew a way to change Power Query connections inside an Excel file. The problem is similar to my previous post on Mass changing pivot table connection strings in Excel using PowerShell. Turns out you can and he was able to find this solution:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#root path to the collection of excel docs:
$rootPath = "C:\Temp\PowerQueryDemo\"

#Current query file path:
$currentPath = "C:\Temp\PowerQueryDemo\"
#Updated query file path:
$newPath = "C:\Temp\PowerQueryDemo\PowerQueryNewSource"

Get-ChildItem -Path $rootPath -Recurse -Include *.xlsx | ForEach-Object{
    $docPath = $_.FullName

    ## Create an instance of excel:
    $ExcelAppInstance = New-Object -ComObject "Excel.Application"
    $ExcelAppInstance.Visible = $false

    ## Open the workbook object within our Excel instance:
    $workbook = $ExcelAppInstance.workbooks.Open($docPath)

    ##iterate the list of queries embedded in the workbook,
    ##updating the file path to our new one:
    $workbook.Queries | ForEach-Object{
        Write-Output $_.Formula
        Write-output $_.Formula.replace($xPath,$tPath)
        $_.Formula = $_.Formula.replace($xPath,$tPath)
    }

    $workbook.Save()
    $workbook.Close()
    $ExcelAppInstance.Quit()
}

The key here is the Queries property inside the Workbook object.