I’m Mike O’Neill and I’m a data nerd who uses Azure Data Explorer, or Kusto, every day to glean insights into Azure’s developer and code-to-customers operations.
One of the challenges I face is handling seasonality and outliers. For example, large numbers of Microsoft employees take vacation three weeks every year: Thanksgiving week, Christmas and New Year’s.
I analyze what thousands of developers do and those weeks always have low activity, so I have to figure out how to gracefully handle that seasonality.
In this video, I’m going to show you how I used two built-in features of Kusto: startofweek and range, to develop a little function that finds those holiday weeks no matter what year we’re looking at, and making it easier for those engineering managers to do it as well at the same time.
Here’s a visualization of a certain type of developer activity related to bringing new features to production. Those dips are the weekends. Yeah, sometimes we work weekends, but Microsoft prides itself on work/life balance, and so the bulk of activity happens Monday through Friday.
Step one is to group the data to match the activity I’m measuring: by week, not by day. The startofweek function does this nicely, and… while it is a simple function, it’s also really powerful because of that simplicity. There’s no need to decide whether Sunday or Monday is the first day of the week, and those engineering managers can pick it up in seconds and use without being reminded. Puts us all on the same page.
Startofweek smooths things out quite nicely, doesn’t it? But now you can see my problem: those six sharp downward spikes: that’s U.S. Thanksgiving, Christmas and New Year’s. Whether I exclude them or replace the values with an average, I need to identify them dynamically.
But I can’t just exclude the 24th and 25th of December, for example: I have to exclude the entire work week, and each year, those holidays either fall on different days of the week, or on a particular Thursday in November.
Let’s tackle Christmas and New Year’s first:
Step 1: Create a little data table with numbers for month and day.
Step 2: we use the range operator, which lets you create a list of numbers or dates in series. It’s created as a blob of structured text in a new column.
Step 3: we need to explode that blob of text out so that I have a row for each date, like it was a cross join. For that, we use the mv-expand operator.
Step 4: is the easy part. I restrict the rowset to only dates where the month and day match my list of holidays, and then…
Step 5: use startofweek again.
This pretty much works. Except when Christmas and New Year’s fall on a weekend or on Monday. Look at 2016 for example: both these holidays fall on the first day of the week, meaning that New Year’s week ends on January 7th. That was a regular work and school week. To fix that, all I need to do is switch my holidays to Christmas Eve and New Year’s Eve.
What about Thanksgiving? In the U.S., that’s the Thursday in the fourth week of November.
Step 1: Again, I start with a little datatable, but in this case, instead of the numbered day of the month, I need the week for that month, and the day of the week. Kusto uses a timespan of 4 day to represent Thursday, rather than an integer.
Step 2: Again, use the range operator to generate a set of date and…
Step 3: Use the mv-expand operator to explode this out in a cross join.
Step 4: is where things change from the previous example. I grab only Thursdays from the month in question, November.
Step 5: I order the data. This is critical: kusto won’t order things for you. You might think the range operator would land things in order, but it may not.
Step 6: use the row_number operator so that you know which is the fourth Thursday in November.
Step 7: Use startofweek to find the Sunday before Thanksgiving.
That’s it. Now, all I need to do is train my engineering execs to use startofweek, and then do a left anti-join to remove data from those weeks.
Here’s the full code for the function
.create-or-alter function with (folder = @'') SeasonalityWeeks
(
rangeStart:datetime = datetime("2016-01-01")
,rangeEnd:datetime = datetime("2022-01-01")
)
{
let _rangeStart = iif(rangeStart > rangeEnd,rangeEnd,rangeStart);
let _rangeEnd = iif(rangeEnd < rangeStart,rangeStart,rangeEnd);
let _majorFixedHolidays =
datatable(Month:int,Day:int,Name:string)
[
12,24,"Christmas",
12,31,"New Year",
];
let _majorVariableHolidays =
datatable(Month:int,DayOfWeek:timespan ,WeekOfMonth:int,Name:string)
[
11,timespan(4d),4,"US Thanksgiving",
];
_majorVariableHolidays
| extend Date = range(_rangeStart,_rangeEnd,1d)
| mv-expand Date to typeof(datetime)
| where Month == datetime_part('Month',Date)
| extend Weekday = dayofweek(Date)
,Year = datetime_part('Year',Date)
| where Weekday == DayOfWeek
| order by Year asc, Date asc
| extend RowNum = row_number(1,prev(Year) != Year)
| where RowNum == WeekOfMonth
| project HolidayWeek = startofweek(Date)
, Name
| union kind = outer
(
_majorFixedHolidays
| extend Date = range(_rangeStart,_rangeEnd,1d)
| mv-expand Date to typeof(datetime)
| where Month == datetime_part('Month',Date)
and Day == datetime_part('Day',Date)
| project HolidayWeek = startofweek(Date)
, Name
)
| project Name
, HolidayWeekStart = HolidayWeek
, HolidayWeekEnd = datetime_add('Day',6,HolidayWeek)
}