Calculating workdays in Databricks

There’s not an official function to calculate workdays in Databricks. Here are some solutions.

Having a DimCalendar with holidays and at least Databricks Runtime 12

If you have a DimCalendar in the system you can now do LEFT LATERAL JOIN without the correlated subquery errors when using non-equality predicates. Check SPARK-36114 for more details.

Calculating working days is then as simple as run a query like:

SELECT mt.StartDate,
       mt.EndDate,
       dc.Workdays
 FROM myTable mt
  LEFT JOIN LATERAL
    (SELECT COUNT(DateKey) - 1 AS Workdays
     FROM dimCalendar dc
     WHERE mt.StartDate <= dc.DateKey
        AND mt.EndDate >= dc.Datekey
        AND dc.IsDayWeekDay = TRUE
        AND dc.IsDayHoliday = FALSE
    )

If this query is slow, please check if the data types for the dates columns match. All are DATE or all are TIMESTAMP. If the sluginess remains check if any of the fields are not part of the statistics of the table.

Remember that the query inside the LEFT JOIN LATERAL will run once per row on myTable. This concept is similar to the one you might have seen in Oracle or in SQL Server as OUTER APPLY.

No DimCalendar but you have a recent Databricks Runtime

You can build an UDF that comnines REDUCE and SEQUENCE to calculate the desired value:

CREATE FUNCTION IF NOT EXISTS udf_workdays(StartDate date, EndDate date)
RETURNS int
RETURN
(
    REDUCE(
      SEQUENCE(StartDate, EndDate),
      0,
      (acc, x) -> acc + IF(DATE_FORMAT(x, 'E') IN ('Sat', 'Sun'), 0, 1))
);

SEQUENCE will generate all the dates from start to end date in an array. Then REDUCE will apply our lamba function over each element of the array adding 1 or 0 to the accumulator if the date is an weekday or not.

Then you can call it in your script:

 SELECT
    mt.StartDate,
    mt.EndDate,
    udf_workdays(mt.StartDate, mt.EndDate)
FROM myTable mt

This will be faster than using the first option but unless some logic is added to deal with the holidays you will lose that precision.

What about passing the holiday list as parameter?

If we have somewhere a list of the holidays we can change the UDF to accept an extra parameter with a default value. This will turn the paramener optional.

CREATE FUNCTION IF NOT EXISTS udf_workdays(StartDate date,
                                           EndDate date,
                                           Holidays array<date> array())
RETURNS int
RETURN
(
    REDUCE(
      ARRAY_EXCEPT(SEQUENCE(StartDate, EndDate),Holidays)
      0,
      (acc, x) -> acc + IF(DATE_FORMAT(x, 'E') IN ('Sat', 'Sun'), 0, 1))
);

The remaining logic stays the same, we are just excluding the dates from the Holidays array.

We can then do something like this to call it:

WITH cte as (
  SELECT
    collect_list(TO_DATE(holidays)) holidays
  FROM
  VALUES
    ('2020-12-25'),
    ('2021-01-01') AS tab(holidays)
)
select
  udf_workdays(TO_DATE('2020-12-23'),TO_DATE('2021-01-02')) NewWayDefaultParameter,
  udf_workdays(TO_DATE('2020-12-23'),TO_DATE('2021-01-02'), holidays) WithoutHolidays
from
  cte

Which should return:

-------------------------------------------
| EmptyDefaultParameter | WithoutHolidays |
-------------------------------------------
|                     7 |               5 |
-------------------------------------------

Have fun!

 Share!