Using a certificate stored in Key Vault in an Azure App Service

For the last two days, I’ve been trying to deploy some new microservices using a certificate stored in Key Vault in an Azure App Service. By now, you’ve probably figured out that we love them around here. I’ve also been slamming my head against the wall because of some not-well-documented functionality about granting permissions to the Key Vault.

As a quick primer, here’s the basics of what I was trying to do:

resource "azurerm_app_service" "centralus-app-service" {
   name                = "${var.service-name}-centralus-app-service-${var.environment_name}"
   location            = "${azurerm_resource_group.centralus-rg.location}"
   resource_group_name = "${azurerm_resource_group.centralus-rg.name}"
   app_service_plan_id = "${azurerm_app_service_plan.centralus-app-service-plan.id}"

   identity {
     type = "SystemAssigned"
   }
 }

data "azurerm_key_vault" "cert" {
   name                = "${var.key-vault-name}"
   resource_group_name = "${var.key-vault-rg}"
 }
resource "azurerm_key_vault_access_policy" "centralus" {
   key_vault_id = "${data.azurerm_key_vault.cert.id}"
   tenant_id = "${azurerm_app_service.centralus-app-service.identity.0.tenant_id}"
   object_id = "${azurerm_app_service.centralus-app-service.identity.0.principal_id}"
   secret_permissions = [
     "get"
   ]
   certificate_permissions = [
     "get"
   ]
 }
resource "azurerm_app_service_certificate" "centralus" {
   name                = "${local.full_service_name}-cert"
   resource_group_name = "${azurerm_resource_group.centralus-rg.name}"
   location            = "${azurerm_resource_group.centralus-rg.location}"
   key_vault_secret_id = "${var.key-vault-secret-id}"
   depends_on          = [azurerm_key_vault_access_policy.centralus]
 }

and these are the relevant values I was passing into the module:

  key-vault-secret-id       = "https://example-keyvault.vault.azure.net/secrets/cert/0d599f0ec05c3bda8c3b8a68c32a1b47"
  key-vault-rg              = "example-keyvault"
  key-vault-name            = "example-keyvault"

But no matter what I did, I kept bumping up against this error:

Error: Error creating/updating App Service Certificate "example-app-dev-cert" (Resource Group "example-app-centralus-rg-dev"): web.CertificatesClient#CreateOrUpdate: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="BadRequest" Message="The service does not have access to '/subscriptions/[SUBSCRIPTIONID]/resourcegroups/example-keyvault/providers/microsoft.keyvault/vaults/example-keyvault' Key Vault. Please make sure that you have granted necessary permissions to the service to perform the request operation." Details=[{"Message":"The service does not have access to '/subscriptions/[SUBSCRIPTIONID]/resourcegroups/example-keyvault/providers/microsoft.keyvault/vaults/example-keyvault' Key Vault. Please make sure that you have granted necessary permissions to the service to perform the request operation."},{"Code":"BadRequest"},{"ErrorEntity":{"Code":"BadRequest","ExtendedCode":"59716","Message":"The service does not have access to '/subscriptions/[SUBSCRIPTIONID]/resourcegroups/example-keyvault/providers/microsoft.keyvault/vaults/example-keyvault' Key Vault. Please make sure that you have granted necessary permissions to the service to perform the request operation.","MessageTemplate":"The service does not have access to '{0}' Key Vault. Please make sure that you have granted necessary permissions to the service to perform the request operation.","Parameters":["/subscriptions/[SUBSCRIPTIONID]/resourcegroups/example-keyvault/providers/microsoft.keyvault/vaults/example-keyvault"]}}]

I checked and re-checked and triple-checked and had colleagues check, but no matter what I did, it kept puking with this permissions issue. I confirmed that the App Service’s identity was being provided and saved, but nothing seemed to work.

Then I found this blog post from 2016 talking about a magic Service Principal (or more specifically, a Resource Principal) that requires access to the Key Vault too. All I did was add the following resource with the magic SP, and everything worked perfectly.

resource "azurerm_key_vault_access_policy" "azure-app-service" {
   key_vault_id = "${data.azurerm_key_vault.cert.id}"
   tenant_id = "${azurerm_app_service.centralus-app-service.identity.0.tenant_id}"

   # This object is the Microsoft Azure Web App Service magic SP 
   # as per https://azure.github.io/AppService/2016/05/24/Deploying-Azure-Web-App-Certificate-through-Key-Vault.html
   object_id = "abfa0a7c-a6b6-4736-8310-5855508787cd" 

   secret_permissions = [
     "get"
   ]

   certificate_permissions = [
     "get"
   ]
 }

It’s frustrating that Microsoft hasn’t documented this piece (at least officially), but hopefully with this knowledge, you’ll be able to automate using a certificate stored in Key Vault in your next Azure App Service.

How to diagnose per-instance issues in Azure App Service

We have several micro-services that we run as App Services within Azure. In the past few weeks, multiple times we’ve experienced problems where a single instance has decided to go crazy but not crazy enough that Azure knows to take it out of rotation. Being able to diagnose these per-instance issues is imperative when it comes to offering a functioning App Service.

The first thing we’ve done is setup Alerts to monitor each App Service not just as a whole, but also per-instance. (NOTE: These are services we have not migrated to Terraform yet, so we have created these alerts manually, just like the services were as well. The alert definitions have been built into our Terraform stack and automatically get deployed as we move these micro-services over to the new Terraform-managed stack.) To get notified of broken single instances:

  1. Open the App Service in the Portal
  2. Under the ‘Monitoring’ section of the blade, select ‘Alerts’
  3. Select ‘New Alert Rule’
  4. Under ‘Condition’, choose ‘Add’
  5. Choose the ‘Http Server Errors’ signal
  6. Place a check in the box next to the ‘Instance’ dimension
  7. Set your Threshold settings according to your preference (for the record, we are using Dynamic / Medium / 5 minutes / Every 5 Minutes)
  8. Click ‘Done’
  9. Set up your Action Group accordingly
  10. Save the alert

Within 10 minutes, it will begin monitoring each instance which should give you better insight not just into the health of your application, but how each of the pieces that comprise your app are operating.

This is important because you may have 10 instances running a single App Service but the failure of a single instance may not create enough failures to trip an alert, depending on your routing scheme or traffic levels.

I personally believe that more visibility is always a positive, and so being able to detect per-instance issues in your Azure App Service can result in shorter outages and ultimately happier customers.

Terraform: “Error: insufficient items for attribute “sku”; must have at least 1″

Last week, we were attempting to deploy a new Terraform-owned resource but every time we ran terraform plan or terraform apply, we got the error Error: insufficient items for attribute "sku"; must have at least 1. We keep our Terraform code in a Azure DevOps project, with approvals being required for any new commits even into our dev environment, so we were flummoxed.

Our first thought was that we had upgraded the Terraform azurerm provider from 1.28.0 to 1.32.0 and we knew for a fact that the azurerm_key_vault resource had been changed from accepting a sku {} block to simply requiring a sku_name property. We tried every combination of having either, both, and none of them defined, and we still received the error. We even tried downgrading back to 1.28.0 as a fallback, but it made no change. At this point we were relatively confident that it wasn’t the provider.

The next thing we looked for was any other resources that had a sku {} block defined. This included our azurerm_app_service_plans, our azure_virtual_machines, and our azurerm_vpn_gateway. We searched for and commented out all of the respective declarations from our .tf files, but still we received the error.

Now we were starting to get nervous. Nothing we tried would solve the problem, and we were starting to get a backlog of requests for new resources that we couldn’t deploy because no matter what we did, whether adding or removing potentially broken code, we couldn’t deploy any new changes. To say the tension on our team was palpable would be the understatement of the year.

At this point we needed to take a step back and analyze the problem logically, so we all took a break from Terraform to clear our minds and de-stress a bit. We started to suspect something in the state file was causing the problem, but we weren’t really sure what. We decided to take the sledgehammer approach and using terraform state rm, we removed every instance of those commented out resources we found above.

This worked. Now we could run terraform plan and terraform apply without issue, but we still weren’t sure why. That didn’t bode well if the problem re-occured; we couldn’t just keep taking a sledgehammer to the environment, it’s just too disruptive. We needed to figure out the root cause.

We opened an issue on the provider’s GitHub page for further investigation, and after some digging by other community members and Terraform employees themselves, it seems that Microsoft’s API returns a different response for App Service Plans than any other resource when it is found to be missing. An assumption was being made that it would be the same for all resources, but it turned out that this was a bad assumption to make.

This turned out to be the key for us. Someone had deleted several App Service Plans from the Azure portal (thinking they were not being used) and so our assumption is that when the provider is checking for the status of a missing App Service Plan, the broken response makes Terraform think it actually exists, even though there’s no sku {} data in it, causing Terraform to think that that specific data was missing.

Knowing the core problem, the error message Error: insufficient items for attribute "sku"; must have at least 1 kind of makes sense now: the sku attribute is missing at least 1 item, it just doesn’t make clear that the “insufficient items” are on the Azure side, not the Terraform / .tf side.

They’ve added a workaround in the provider until Microsoft updates the API to respond like all of the other resources.

Have you seen this error before? What did you do to solve it?

Posts navigation