As I explained in a previous post about Purview with the official Snowflake connector you need to create a scan per database/schema pair. In the customer where I was working there are a few hundred databases plus their schemas.
Doing this by using the UI would mean that we will be wasting a lot of time clicking away and be prone to forget some database/schema pair.
I’ve explored the documentation about the Scans - REST API and the
rest of this post is just some PowerShell
code to automate the
creation of scans and their schedules in Purview.
Requirements
You could run this with your account if it is just a one time thing.
In my scenario we will run this weekly to pick up newly created databases/schemas and as such the script will be run using a service principal triggered by one of our data pipelines.
This means that it will need to have the right permissions. In my case it was just data curator on the collection. Check this official documentation if you have trouble with permissions.
Parameters
Here’s some parameters for you to adjust to your settings:
$applicationId ="YOUR-APPLICATION-ID"
$tenantId= "YOUR-TENANT-ID"
$secret = "YOUR-APPLICATION-SECRET"
$resource = "https://purview.azure.net/"
$purviewEndpoint = "https://YOUR-PURVIEW-INSTANCE.purview.azure.com/scan/datasources/YOUR-COLLECTION/scans"
Snowflake connection
We are using an ODBC
connection you just have to change to match
the way you access the service. We then execute the query that lists the data we are interested:
#Create and open the connection
$conn = New-Object System.Data.Odbc.OdbcConnection
$conn.ConnectionString = "dsn=Snowflake"
$conn.Open()
#The metadata query
$query = @"
SELECT DB.DATABASE_NAME, SCH.SCHEMA_NAME
FROM SNOWFLAKE.ACCOUNT_USAGE.DATABASES DB
INNER JOIN SNOWFLAKE.ACCOUNT_USAGE.SCHEMATA SCH
ON DB.DATABASE_NAME = SCH.CATALOG_NAME
WHERE SCHEMA_NAME IN ('PRD', 'NOW', 'STG')
GROUP BY DB.DATABASE_NAME, SCH.SCHEMA_NAME
ORDER BY DB.DATABASE_NAME, SCH.SCHEMA_NAME
"@
#Execute the query, create the data structure and close the connection
$cmd = New-Object System.Data.Odbc.OdbcCommand($query, $conn)
$ds = New-Object System.Data.DataSet
(New-Object system.Data.odbc.odbcDataAdapter($cmd)).fill($ds) | out-null
$conn.close()
Get authentication token
You need to authenticate and get a token for the service principal:
## Get auth token
$encodedSecret = [System.Web.HttpUtility]::UrlEncode($secret)
$RequestAccessTokenUri = "https://login.microsoftonline.com/$tenantId/oauth2/token"
$body = "grant_type=client_credentials&client_id=$applicationId&client_secret=$encodedSecret&resource=$resource"
$contentType = 'application/x-www-form-urlencoded'
$token = Invoke-RestMethod -Method Post -Uri $RequestAccessTokenUri -Body $body -ContentType $contentType
The scan loop
In the main event we just go across the dataset and create a scan and schedule for each of the entries. Beware of the various things you will need to adapt like the integration runtime name, your snowflake credential, the collection reference, etc…
I’ve left them all in caps so they are easier to find.
Also, if you check the scanningdataplane documentation you will see that it is outdated and that’s why my calls seem different.
Pro tip: You can just inspect with your browser the calls made by the Purview Studio, that’s how I’ve found about it.
# Add the token we retrieved for the headers used in the calls
$headers = @{
Authorization="Bearer " + $token.access_token
Content='application/json'
}
#get the date to create the schedule
$currentDate = Get-Date -Format "yyyy-MM-ddTHH:MM:00Z"
foreach($r in $ds.Tables[0])
{
$db = $r.DATABASE_NAME
$schema = $r.SCHEMA_NAME
$scanName = "Scan-$db-$schema"
$scanPurviewUri = "$purviewEndpoint/$scanName`?api-version=2018-12-01-preview"
$triggerPurviewUri = "$purviewEndpoint/$scanName/triggers/default`?api-version=2018-12-01-preview"
$scanBody = @"
{
"properties": {
"warehouse": "YOUR-WAREHOUSE",
"database": "$db",
"schema": "$schema",
"credential": {
"referenceName": "YOUR-SNOWFLAKE-CREDENTIAL",
"credentialType": "BasicAuth"
},
"collection": {
"referenceName": "YOUR-COLLECTION-REFERENCE",
"type": "CollectionReference"
},
"connectedVia": {
"referenceName": "YOUR-INTEGRATION-RUNTIME-NAME",
"integrationRuntimeType": "SelfHosted"
}
},
"kind": "SnowflakeCredential",
"id": "datasources/YOUR-COLLECTION-NAME/scans/$scanName",
"name": "$scanName"
}
"@
Invoke-RestMethod -Method PUT -Uri $scanPurviewUri -Headers $headers -Body $scanBody -ContentType 'application/json'
$triggerBody = @"
{
"properties": {
"recurrence": {
"frequency": "Week",
"interval": 1,
"startTime": "$currentDate",
"endTime": null,
"schedule": {"hours": [12], "weekDays": ["Sunday"]},
"timezone": "UTC"
},
"scanLevel": "Full"
},
"id": datasources/YOUR-COLLECTION-NAME/scans/$scanName/triggers/default,
"name": "default"
}
"@
Invoke-RestMethod -Method PUT -Uri $triggerPurviewUri -Headers $headers -Body $triggerBody -ContentType 'application/json'
}
Have fun!