We have several micro-services that we run as App Services within Azure. In the past few weeks, multiple times we’ve experienced problems where a single instance has decided to go crazy but not crazy enough that Azure knows to take it out of rotation. Being able to diagnose these per-instance issues is imperative when it comes to offering a functioning App Service.
The first thing we’ve done is setup Alerts to monitor each App Service not just as a whole, but also per-instance. (NOTE: These are services we have not migrated to Terraform yet, so we have created these alerts manually, just like the services were as well. The alert definitions have been built into our Terraform stack and automatically get deployed as we move these micro-services over to the new Terraform-managed stack.) To get notified of broken single instances:
- Open the App Service in the Portal
- Under the ‘Monitoring’ section of the blade, select ‘Alerts’
- Select ‘New Alert Rule’
- Under ‘Condition’, choose ‘Add’
- Choose the ‘Http Server Errors’ signal
- Place a check in the box next to the ‘Instance’ dimension
- Set your Threshold settings according to your preference (for the record, we are using
Every 5 Minutes)
- Click ‘Done’
- Set up your Action Group accordingly
- Save the alert
Within 10 minutes, it will begin monitoring each instance which should give you better insight not just into the health of your application, but how each of the pieces that comprise your app are operating.
This is important because you may have 10 instances running a single App Service but the failure of a single instance may not create enough failures to trip an alert, depending on your routing scheme or traffic levels.
I personally believe that more visibility is always a positive, and so being able to detect per-instance issues in your Azure App Service can result in shorter outages and ultimately happier customers.