For the topics we have covered in this chapter to be effective, we must continually monitor all aspects of our system. From security to resilience and performance, we must know what is happening at all times.

Monitoring for security

Maintaining the security of a solution requires a monitoring solution that can detect, respond, and ultimately recover from incidents. When an attack happens, the speed at which we respond will determine how much damage is incurred.

However, a monitoring solution needs to be intelligent enough to prioritize and filter false positives.

Azure provides several different monitoring mechanisms in general and, specifically, in terms of security, and can be configured according to your organization’s capabilities. Therefore, when designing a monitoring solution, you must align with your company’s existing teams to effectively direct and alert appropriately, and send pertinent information as required.

Monitoring requirements cover more than just alerts; the policies that define business requirements around configuration settings such as encryption, passwords, and allowed resources must be checked to confirm they are being adhered to. The Azure risk and compliance reports will highlight any items that deviate so that the necessary team can investigate and remediate.

Other tools, such as Azure Security Center, will continually monitor your risk profile and suggest advice on improving your security posture.

Finally, security patching reports also need regular reviews to ensure VMs are being patched so that insecure hosts can be investigated and brought in line.

Monitoring for resilience

Monitoring your solution is not just about being alerted to any issues; the ideal scenario is to detect and remediate problems before they occur—in other words, we can use it as an early warning system.

Applications should include in their designs the ability to output relevant logs and errors; this then enables health alerts to be set up that, when combined with resource thresholds, provide details of the running processes.

Next, a set of baselines can be created that identify what a healthy system looks like. When anomalies occur, such as long-running processes or specific error logs, they are spotted earlier.

As well as defined alerts that will proactively contact administrators when possible issues are detected, visualization dashboards and reporting can also help responsible teams see potential problems or irregular readings as part of their daily checks.

Monitoring for performance

The same CPU, RAM, and input/output (I/O) thresholds used for early warning signs of errors also help identify performance issues. By monitoring response times and resource usage over time, you can understand usage patterns and predict when more power is required.

Performance statistics can either manually set scaling events through the use of schedules or set automated scaling rules more accurately.

Keeping track of scaling events throughout the life cycle of an application is useful. If an application is continually scaling up and down or not scaling at all, it could indicate that thresholds are set incorrectly.

Again, creating and updating baseline metrics will help alert you to potential issues. If resources for a particular service are steadily increasing over time, this information can predict future bottlenecks.