This is a continuation of the previous blog titled “Introduction to Microservices Architecture”. In case you stumble on this page feel free to read the previous blogs on microserices.
Resilient microservices patterns help to design resilient microservices to handle graceful degradation to a lower-quality service instead of failing. Plan for auto scaling and high availability deployment of the services from day one. Resilience should be built at multiple levels, Infrastructure layer, Network layer, application layer. Management and team needs to change their people and culture to adapt to resilient architectures
The success of microservices architecture lies in embracing failure and plan for it.
There are many patters that help in designing a resilient microservices architecture. Let’s discuss few of the patterns.
Redundancy and high availability
Availability of a business application is very critical for the business. Application built on microservices architecture are generally huge and needs 24X7 availability. To achieve this, redundant deployment of services is a must. In some cases, redundancy increases the duplication of infrastructure which is not getting fully utilized. Redundancy helps in making the application highly available.
Each microservice should have a load balancer routing the traffic to different instances. The choice of load balancer algorithm could vary based on the service needs and sophistication of the load balancer. Common algorithm used in load balancer is Round Robin method. Load balancer does a hard beat check with the services and ensures they are working before allocating the traffic.
Business users use the applications across the world hence deployment strategy is very critical. The servers should be spread across the available zones with in a region and across the regions. Load balancing across the zones do not add noticeable network lag. Deployment architectures across the regions are complex. They result in additional spend on infrastructure costs. Careful though needs to go before choosing a deployment strategy across the regions. There are options to consider improving the application performance before going for cross region deployment.
If you are using AWS for hosting then services like API Gateway, Aurora RDS, Kinesis, Lambda etc support high availability by default. Choosing some of these services will help reduce complexity. Similar services are available in Azure and GCS is catching up fast.
Enabling autoscaling on the infrastructure helps in meeting the thresholds. Traffic on every application varies and there are situations where it peaks out. Having auto scaling helps our application to scale up and meet the high network traffic scenarios. Popular examples are the traffic that e-commerce firms experience during black Friday sales. Auto scaling helps to cut the cost by scaling down the infrastructure to minimum levels during low traffic scenarios.
The important considerations for auto scaling to work is Immutable infrastructure and Infrastructure as code. Instances provisioned on cloud should always be new and never perform updated to the existing instances. Services packaged as docker container helps scaling up by adding new containers. Containers gets destroyed while scaling down.
Infrastructure as code helps in reducing the human dependency in provisioning infrastructure. Manual intervention is prone to mistakes and needs continuous training to the infrastructure engineers. Infra as code approach reduces the issues and brings in more automation. It reduces the need of having huge infrastructure team monitoring the productions systems.
Patterns for Resilience
Failures in the system needs quick detection and graceful degradation of the service. Architect your systems for failures since the chances of failure in production is 100%. There are many patters that help in handing failures.
How do we maintain a transaction across the services? We do not maintain transaction across services. There are certain patterns that can help in achieving the transaction across the services.
Data consistency across microservices
Sagas pattern helps in maintaining the data consistency across the different microservices. Transactions across the services can be implements using Saga pattern. Implement the different service calls as a saga in a sequence. When the first service call is successful the subsequent service call in the saga is triggered. Failure in any of the service will trigger a compensating transaction to undo the changes already made.
One common scenario of failures is timeout errors. While as service is waiting for resources from external provider, there are chances of network errors, service availability errors. The service should be intelligent enough to retry the request during such errors. When a microservice tries to interact with external system (legacy systems mostly) there could be load spikes or network issues that cause request failure. Build in the “Retry pattern” to retry for a certain number of times before reporting a failure back.
However, there is a potential problem if we keep retrying the requests for all types of failures. Some failures like authorization failed can never succeed so the retry should be smart enough to try only when there are network or connection issues.
Another important consideration is the operations should be Idempotent. If a service call throws exception, but internally it completed the business transaction. Since the service has thrown error it might trigger "Retry". It will process the same request gain. These scenarios will lead to data inconsistency and issues with unique constraints in data base. If the operation is Idempotent, then this situation does not occur. One option is to use a unique identifier for every request and reject those requests which are already processed.
Another common scenario encountered is failure that creates cascading failures. Circuit breaker pattern applied to potentially-failing method calls helps to avoid cascading failures. During synchronous service calls there is a possibility the requested service is showing high latency and became unresponsive. Using Circuit breaker pattern, the circuit can be made "Open" after a certain threshold of requests fail. The response can fall back to a call back which sends a dummy response or fetched from cache. After the prescribed delay, The circuit breaker will start sending some requests to see if the service started functioning. If the requests succeed, gradually the calls to the external system is back to normalcy.
The design of services needs to ensure they function with less features when some of the services are down. There needs a fallback mechanism in place for service failures. Cache is a very good fall back source for majority of the scenarios. We can ensure the system is functioning with less features by passing the default data from cache.
Ownership for writing or modifying business data should always lie with one microservice. However, to improve the performance of the over all system other microservices can cache the data for reads. Cache the metadata commonly used by the service. It is a good approach to cache the authorization data of a logged in users. Effective cache management helps in building resilient and better performing services.
Huge systems are prone to phishing attacks. If a hacker is exploiting the weakness in the system and creating huge load for a service, it is detrimental for the complete system. Another type is the distributed denial of services (DDoS) attack, designed to overrun an organization cyber defenses through a sheet volume of requests. By effectively implementing the API rate limiting and other such patterns at API Gateway we can handle these scenarios. Security auditing should be in places to audit public API's at frequent intervals
This concluded the discussion on some of the patterns we could use for building resilient microservices. In this blog I was more inclined towards hosting on cloud and particularly AWS. It came only because of my personal experience implementing the microservices but these patterns can be applied to any cloud provide or even hosting in their own data centers.
I am eager to learn from your experience and opinion on this blog. Please drop a comment in case you want to talk more.