Kusto: Creating Dimensions with the datatable operator

I’m Mike O’Neill and I’m a data nerd who uses Azure Data Explorer, or Kusto, every day to glean insights into Azure’s developer and code-to-customers operations.

In my last video, I talked about how to assess a new data source to identify potential simple dimensions.

Kusto presents a singular problem for creating simple dimensions, especially when you’re used to storing dimensions as a table. You can’t create a little table as you would in Azure SQL and just insert and update values as you please. You also can’t repeatedly overwrite a stream as I used to do in Azure Data Lake.

In fact, you can’t really update anything in Kusto, nor can you overwrite a table that easily either. But Kusto does have an interesting workaround: you can write a list of comma-delimited values, use the datatable operator to make it into a table, and embed that into a function. Need to make changes? Just edit the text.

Let me give you a real example I wrote last week. One of the things I need to track for Azure’s engineering pipeline work is the name of the datacenter to which we’re deploy. Now, there’s nothing secret about this dimension: I pulled the info from these two public webpages.

Instead of creating a table like I would in Azure SQL, I just format all the values as a CSV, and then wrap it with the datatable operator. That operator requires three things: the operator itself, the schema immediately after it, and then a pair of square brackets around the CSV values.

Create that as a function, et voila, you’ve got a dimension. Anytime I need to update the values, I just update the function.

It’s not a perfect solution, however: because you’re just working with text, you’ve got none of the referential integrity capabilities of Azure SQL. Nor can you easily rebuild the dimension programmatically so that you don’t make silly mistakes.

I made that mistake last week, though it was with a mapping function with about 300 values rather than a pure, simple dimension. My colleague left a comment on my pull request, asking me to “find Waldo.” It was a friendly tease because I hadn’t bothered to do a simple check for duplicates.

If you use this method, you’ll need to be extra careful to maintain and run regular unit tests every time you alter the function.

And since early March, all of us in the State of Washington have been living through social distancing for the novel corona virus, my colleague teased me a little bit more, with this new version of “where’s Waldo.”

Stay healthy.

Comments are closed.