AWS Architecture: High Availability vs Fault Tolerance
You've likely heard "time is money." For every minute a product is unable to be created, advertised, or sold, the bottom line suffers. This concept is more important than ever in IT. Every minute a system is down results in a loss of revenue and productivity. In fact, it is estimated that 98% of companies lose $100,000 for every hour of downtime. That's a lot of dough.
With AWS, teams can build highly resilient web applications whose downtime is next to never. They just have to make sure their systems are at least highly available or at most fault tolerant, both of which are disaster recovery methods.
It takes time and resources to ensure your systems are sufficiently resilient. However, that pales in comparison to the potential losses incurred by downtime. That is why it is critical to ensure your system is either highly available or fault tolerant.
Let's explore how these concepts work and how they are different.
What is High Availability?
High availability means a system will almost always maintain uptime, albeit sometimes in a degraded state. About AWS, a system has high availability when it has 99.999% uptime, also known as "five nines." To put that in perspective, the system would be down for a mere five minutes and fifteen seconds a year. And yes, that is possible — and even routine — for AWS.
High availability is architected by removing single points of failure by leveraging system redundancy. For instance, if you had five computers connected to one server, that would be a single point of failure. If that server room floods and the server is destroyed, then you're out of luck.
To mitigate this situation, however, a backup server can be switched on in case of emergencies. This is adding redundancy to remove single points of failure. So, high availability removes single points of failure by adding redundancy.
In pre-AWS days, this was an expensive situation to maintain. It was typically done by configuring complex RAIDs to ensure database redundancy. On top of that, hardware would have to be placed in temperature-controlled, bomb-shelter-like structures that are expensive to maintain.
The abundance of cloud services has made high availability an affordable and realistic option for just about any organization. Let's take a deep dive into how we can use AWS to eliminate single points of failure and maximize our system's redundancy.
Example: A Mobile Banking App
Let's say you are in charge of creating a mobile banking app. This app enables users to view their account balance, make transfers, and withdraw funds. Optimally, the user should be able to do all three of these operations all the time. But in the real world, something will fail — and that is where high availability architecture comes in.
In a highly available system, servers for our banking app will span multiple availability zones. Within these availability zones are multiple servers that are configured in the same manner. (Servers on AWS are generally referred to as EC2 Instances.) That way, if one availability zone goes down, the mobile app will point to the other EC2 instances on the backup availability zone. This eliminates any single point of failure.
What about the user data? That needs to be backed up, too. As part of a redundant system, it ensures that the primary database is being replicated on a read-only database in a separate availability zone. If the primary availability zone goes down, users will be able to at least view their account balances. However, they will be unable to withdraw funds or make transfers. That is because those are write operations, and the backup database is only set up for read operations.
Remember, high availability guarantees essential services, not full functionality. Telling people they may encounter a situation where they can't deposit a check once or twice a year isn't too bad, though. Additionally, some of their banking data might not be as up-to-date, because our backup database only replicates once a day.
But what if a higher level of recovery is required? This is where fault tolerance comes into play.
What is Fault Tolerance?
Think of fault tolerance as high availability's older brother. Fault tolerance means that a system will almost always maintain uptime — and users will not notice any differences during a primary system outage.
If high availability was expensive in pre-AWS days, fault tolerance was exceedingly expensive. At a bare minimum, multiple servers would have to be load-balanced, databases would have to be replicated, and availability would need to span multiple regions.
All of this would need to be maintained by the company itself. Not only would the company have to foot the bill for all the hardware and expertise to run it, but they would have to follow esoteric IT security standards.
Most non-IT companies simply aren't designed to handle this level of complexity. AWS, however, takes care of all of these infrastructure requirements, allowing for a fault tolerant system without having to worry about the hardware itself. Let's explore a different example and see how it can be architected for fault tolerance.
Example: An E-Commerce App Combines Fault Tolerance and High Availability
Let's say we have an e-commerce application built using a microservices architecture, where each service is deployed in a Docker container. To ensure high availability, we'll use Amazon Elastic Container Service (ECS) with the Fargate launch type. This allows us to run our containerized applications without having to manage the underlying infrastructure.
For fault tolerance, we'll use a combination of Amazon Aurora Global Database and AWS Lambda. Aurora Global Database provides us with a highly available and fault-tolerant database by replicating our data across multiple AWS regions. In case of a regional outage, our application can continue to operate using the data from another region.
To handle any potential failures in our application, we'll use AWS Lambda and Amazon API Gateway. Lambda functions can be triggered by API Gateway, allowing us to create a serverless architecture that automatically scales based on demand. This helps us avoid single points of failure and ensures our application remains highly available.
Now, let's add some recent AWS innovations to our example. We can use AWS Fault Injection Simulator (FIS) to test our application's resilience to various failure scenarios. FIS allows us to simulate various types of failures, such as network latency or instance termination, and measure our application's response. This helps us identify any potential weaknesses in our architecture and make improvements to increase our fault tolerance.
Additionally, we can use Amazon CloudWatch Container Insights to monitor our containerized applications. Container Insights provides us with detailed metrics and logs, allowing us to quickly identify and resolve any performance issues or bottlenecks.
Lastly, Amazon Q has been introduced to serve as a Large Language Model (LLM) to interact with your customers. Amazon Q will serve as an AI assistant ready to provide insights to your customers about the app. Also, you can query Amazon Q to provide suggestions on how to increase your application's resilience vis-a-vis fault tolerance and high availability.
In conclusion, our e-commerce application leverages High Availability and Fault Tolerance through a combination of Amazon ECS, Aurora Global Database, Lambda, API Gateway, and recent innovations like AWS Fault Injection Simulator and CloudWatch Container Insights. This ensures our application remains available and responsive even in the face of failures or disruptions.
High Availability vs. Fault Tolerance
Whether or not to utilize high availability over fault tolerance depends on your budget, and, consequently, the importance of the system. If you are running an e-commerce website with millions of hits a day, a fault-tolerant system is your best bet.
According to Google, sites that load within 5 seconds have 70% longer sessions. That being said, high availability might not cut it, and you will need to architect a fault-tolerant website. If the system is running in a degraded state, there is a good chance you'll lose customers either way. So, you may as well up the budget to accommodate fault tolerance capabilities.
On the other hand, let's say you are designing a website for your employees that is only accessible from the company's intranet. On this site, employees need to look up data 80% of the time and write to a database 20% of the time. In a situation like this, high availability would be perfectly acceptable.
If the database server goes down, it would crossover to a read-only database. Now the employees can conduct 80% of their business unhindered until the primary system comes back up. As mentioned previously, SQS would be useful in this situation as well. The employees would be able to write to the database, but the request would simply be queued until the primary database is back up and running.
We need our employees to maintain productivity, so some amount of disaster recovery is vital. However, it may not be worth the time or money to maintain a level of fault tolerance.
Final Thoughts
To quote Verner Vogels, the CTO of Amazon, "Failures are a given, and everything will eventually fail over time." This quote is brilliant in its simplicity. A good solutions architect embraces the inevitability of failure. Disaster recovery should not be treated as a contingency plan but as a matter of course.
Want to learn more about AWS? Consider our {Applicable Training} Training!
AWS Cloud Practitioner (Must have for product owners, scrum masters, and managers.)
AWS Certified Developer (Must have for software developers)
AWS Solutions Architect (Must have for architects and senior software developers)
Not a CBT Nuggets subscriber? Sign up today.
delivered to your inbox.
By submitting this form you agree to receive marketing emails from CBT Nuggets and that you have read, understood and are able to consent to our privacy policy.