Month: October 2022

Using architectural best practices – Principles of Modern Architecture

Through years of research and experience, vendors such as Microsoft have collected a set of best practices that provide a solid framework for good architecture when followed.

With the business requirements in mind, we can perform a Failure Model Analysis (FMA). An FMA is a process for identifying common types of failures and where they might appear in our application.

From the FMA, we can then start to create a redundancy and scalability plan; designing with scalability in mind helps build a resilient solution and a performant one, as technologies that allow us to scale also protect us from failure.

A load balancer is a powerful tool for achieving scale and resilience. This allows us to build multiple service copies and then distribute the load between them, with unhealthy nodes being automatically removed.

Consider the cost implications of any choices. As mentioned previously, we need to balance the cost of downtime versus the cost of providing protection. This, in turn, may impact decisions between the use of Infrastructure-as-a-Service (IaaS) components such as VMs or Platform-as-a-Service (PaaS) technologies such as web apps, functions, and containers. Using VMs in our solution means we must build out load balancing farms manually, which are challenging to scale, and demand that components such as load balancers be explicitly included. Opting for managed services such as Azure Web Apps or Azure Functions can be cheaper and far more dynamic, with load-balancing and auto-scaling technologies built in.

Data needs to be managed effectively, and there are multiple options for providing resilience and backup. Replication strategies involving geographically dispersed copies provide the best RPO as the data is always consistent, but this comes at a financial cost.

For less critical data or information that does not change often, daily backup tools that are cheaper may suffice, but these require manual intervention in the event of a failure.

A well-defined set of requirements and adherence to best practices will help design a robust solution, but regular testing should also be performed to ensure the correct choices have been made.

Testing and disaster recovery plans

A good architecture defines a blueprint for your solution, but it is only theory until it is built; therefore, solutions need to be tested to validate our design choices.

Work through the identified areas of concern and then forcefully attempt to break them. Document and run through simulations that trigger the danger points we are trying to protect.

Perform failover and failback tests to ensure that the application behaves as it should, and that data loss is within allowable tolerances.

Build test probes and monitoring systems to continually check for possible issues and to alert you to failed components so that these can be further investigated.

Always prepare for the worst—create a disaster recovery plan to detail how you would recover from complete system failure or loss, and then regularly run through that plan to ensure its integrity.

We have seen how a well-architected solution, combined with robust testing and detailed recovery plans, will prepare you for the worst outcomes. Next, we will look at a closely related aspect of design—performance.

Architecting for resilience and business continuity – Principles of Modern Architecture

Keeping your applications running can be important for different reasons. Depending on your solution’s nature, downtime can range from a loss of productivity to direct financial loss. Building systems that can withstand some form of failure has always been a critical aspect of architecture, and with the cloud, there are more options available to us.

Building resilient solutions comes at a cost; therefore, you need to balance the cost of an outage against the cost of preventing it.

High Availability (HA) is the traditional option and essentially involves doubling up on components so that if one fails, the other automatically takes over. An example might be a database server—building two or more nodes in a cluster with data replication between them protects against one of those servers failing as traffic would be redirected to the secondary replica in the event of a failure, as per the example in the following diagram:

Figure 2.2 – Highly available database servers

However, multiple servers are always powered on, which in turn means increased cost. Quite often, the additional hardware is not used except in the event of a failure.

For some applications, this additional cost is less than the cost of a potential failure, but it may be more cost-effective for less critical systems to have them unavailable for a short time. In such cases, our design must attempt to reduce how long it takes to recover.

The purpose of HA is to reduce the Mean Time Between Failures (MTBF). In contrast, the alternative is to reduce the Mean Time To Recovery (MTTR)—in other words, rather than concentrating on preventing outages, spend resources on reducing the impact and speeding up recovery from an outage. Ultimately, it is the business who must decide which of these is the most important, and therefore the first step is to define their requirements.

Defining requirements

When working with a business to understand their needs for a particular solution, you need to consider many aspects of how this might impact your design.

Identifying individual workloads is the first step—what are the individual tasks that are performed, and where do they happen? How does data flow around your system?

For each of these components, look for what failure would mean to them—would it cause the system as a whole to fail or merely disrupt a non-essential task? The act of calculating costs during a transactional process is critical, whereas sending a confirmation email could withstand a delay or even complete failure in some cases.

Understand the usage patterns. For example, a global e-commerce site will be used 24/7, whereas a tax calculation service would be used most at particular times of the year or at the month-end.

The business will need to advise on two important metrics—the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO). The RTO dictates an acceptable amount of time a system can be offline, whereas the RPO determines the acceptable amount of data loss. For example, a daily backup might mean you lose up to a day’s worth of data; if this is not acceptable, more frequent backups are required.

Non-functional requirements such as these will help define our solution’s design, which we can use to build our architecture with industry best practices.