# Calculating workdays in Databricks

There’s not an official function to calculate workdays in Databricks. Here are some solutions.

## Having a DimCalendar with holidays and at least Databricks Runtime 12

If you have a DimCalendar in the system you can now do LEFT LATERAL JOIN without the correlated subquery errors when using non-equality predicates. Check SPARK-36114 for more details.

Calculating working days is then as simple as run a query like:

SELECT mt.StartDate,
mt.EndDate,
dc.Workdays
FROM myTable mt
LEFT JOIN LATERAL
(SELECT COUNT(DateKey) - 1 AS Workdays
FROM dimCalendar dc
WHERE mt.StartDate <= dc.DateKey
AND mt.EndDate >= dc.Datekey
AND dc.IsDayWeekDay = TRUE
AND dc.IsDayHoliday = FALSE
)


If this query is slow, please check if the data types for the dates columns match. All are DATE or all are TIMESTAMP. If the sluginess remains check if any of the fields are not part of the statistics of the table.

Remember that the query inside the LEFT JOIN LATERAL will run once per row on myTable. This concept is similar to the one you might have seen in Oracle or in SQL Server as OUTER APPLY.

## No DimCalendar but you have a recent Databricks Runtime

You can build an UDF that comnines REDUCE and SEQUENCE to calculate the desired value:

CREATE FUNCTION IF NOT EXISTS udf_workdays(StartDate date, EndDate date)
RETURNS int
RETURN
(
REDUCE(
SEQUENCE(StartDate, EndDate),
0,
(acc, x) -> acc + IF(DATE_FORMAT(x, 'E') IN ('Sat', 'Sun'), 0, 1))
);


SEQUENCE will generate all the dates from start to end date in an array. Then REDUCE will apply our lamba function over each element of the array adding 1 or 0 to the accumulator if the date is an weekday or not.

Then you can call it in your script:

 SELECT
mt.StartDate,
mt.EndDate,
udf_workdays(mt.StartDate, mt.EndDate)
FROM myTable mt


This will be faster than using the first option but unless some logic is added to deal with the holidays you will lose that precision.

## What about passing the holiday list as parameter?

If we have somewhere a list of the holidays we can change the UDF to accept an extra parameter with a default value. This will turn the paramener optional.

CREATE FUNCTION IF NOT EXISTS udf_workdays(StartDate date,
EndDate date,
Holidays array<date> array())
RETURNS int
RETURN
(
REDUCE(
ARRAY_EXCEPT(SEQUENCE(StartDate, EndDate),Holidays)
0,
(acc, x) -> acc + IF(DATE_FORMAT(x, 'E') IN ('Sat', 'Sun'), 0, 1))
);


The remaining logic stays the same, we are just excluding the dates from the Holidays array.

We can then do something like this to call it:

WITH cte as (
SELECT
collect_list(TO_DATE(holidays)) holidays
FROM
VALUES
('2020-12-25'),
('2021-01-01') AS tab(holidays)
)
select
udf_workdays(TO_DATE('2020-12-23'),TO_DATE('2021-01-02')) NewWayDefaultParameter,
udf_workdays(TO_DATE('2020-12-23'),TO_DATE('2021-01-02'), holidays) WithoutHolidays
from
cte


Which should return:

-------------------------------------------
| EmptyDefaultParameter | WithoutHolidays |
-------------------------------------------
|                     7 |               5 |
-------------------------------------------


Have fun!