There’s not an official function to calculate workdays in Databricks. Here are some solutions.
Having a DimCalendar with holidays and at least Databricks Runtime 12
If you have a DimCalendar in the system you can now do LEFT LATERAL
JOIN
without the correlated subquery errors when using non-equality
predicates. Check SPARK-36114 for more details.
Calculating working days is then as simple as run a query like:
SELECT mt.StartDate,
mt.EndDate,
dc.Workdays
FROM myTable mt
LEFT JOIN LATERAL
(SELECT COUNT(DateKey) - 1 AS Workdays
FROM dimCalendar dc
WHERE mt.StartDate <= dc.DateKey
AND mt.EndDate >= dc.Datekey
AND dc.IsDayWeekDay = TRUE
AND dc.IsDayHoliday = FALSE
)
If this query is slow, please check if the data types for the dates
columns match. All are DATE
or all are TIMESTAMP
. If the sluginess
remains check if any of the fields are not part of the statistics of
the table.
Remember that the query inside the LEFT JOIN LATERAL
will run once
per row on myTable
. This concept is similar to the one you might
have seen in Oracle or in SQL Server as OUTER APPLY
.
No DimCalendar but you have a recent Databricks Runtime
You can build an UDF that comnines REDUCE
and SEQUENCE
to
calculate the desired value:
CREATE FUNCTION IF NOT EXISTS udf_workdays(StartDate date, EndDate date)
RETURNS int
RETURN
(
REDUCE(
SEQUENCE(StartDate, EndDate),
0,
(acc, x) -> acc + IF(DATE_FORMAT(x, 'E') IN ('Sat', 'Sun'), 0, 1))
);
SEQUENCE
will generate all the dates from start to end date in an
array. Then REDUCE
will apply our lamba function over each element
of the array adding 1 or 0 to the accumulator if the date is an
weekday or not.
Then you can call it in your script:
SELECT
mt.StartDate,
mt.EndDate,
udf_workdays(mt.StartDate, mt.EndDate)
FROM myTable mt
This will be faster than using the first option but unless some logic is added to deal with the holidays you will lose that precision.
What about passing the holiday list as parameter?
If we have somewhere a list of the holidays we can change the UDF to accept an extra parameter with a default value. This will turn the paramener optional.
CREATE FUNCTION IF NOT EXISTS udf_workdays(StartDate date,
EndDate date,
Holidays array<date> array())
RETURNS int
RETURN
(
REDUCE(
ARRAY_EXCEPT(SEQUENCE(StartDate, EndDate),Holidays)
0,
(acc, x) -> acc + IF(DATE_FORMAT(x, 'E') IN ('Sat', 'Sun'), 0, 1))
);
The remaining logic stays the same, we are just excluding the dates from the Holidays array.
We can then do something like this to call it:
WITH cte as (
SELECT
collect_list(TO_DATE(holidays)) holidays
FROM
VALUES
('2020-12-25'),
('2021-01-01') AS tab(holidays)
)
select
udf_workdays(TO_DATE('2020-12-23'),TO_DATE('2021-01-02')) NewWayDefaultParameter,
udf_workdays(TO_DATE('2020-12-23'),TO_DATE('2021-01-02'), holidays) WithoutHolidays
from
cte
Which should return:
-------------------------------------------
| EmptyDefaultParameter | WithoutHolidays |
-------------------------------------------
| 7 | 5 |
-------------------------------------------
Have fun!