Sunday, 27 October 2019

Bringing the Chaos Monkey to heel

Introduction


Systems, microservices, services and applications can go wrong. Error messages in the middle of the night, important messages getting stuck on the system, memory and disk space overload, network glitches and any other number of unknown conditions that might befall these systems. 

To gain confidence in these systems being able to continue working - despite issues - is a focal consideration in any architecture. The ability for a system to be able to service consumers despite issues in one or several locations is known as its availability whilst the ability to restore working order after issues is the systems resilience. Through availability and resilience, a system can be created that is capable of dealing with any issues that befalls it.

Confidence is gained by performing stringent tests on a systems hardware, software and connectivity points. Sometimes these tests are spread across multiple environments with specific purposes i.e. integration test or user test. In other methods, testing is performed on a live system itself to gain the confidence that in these adverse scenarios the system can handle or recover itself.

Controlled destruction of parts of a live system – to prove its availability and resilience - is known as chaos engineering which has manifested in the container world through the Netflix model of the Chaos Monkey.

In this article we will look at what Chaos Monkey is, how it benefits systems, when it is appropriate to use chaos monkey, considerations when using chaos monkey and considerations for gaining confidence in systems generally.

What is Chaos Monkey?


In 2010, Netflix published an article referring to transition to the Cloud. One of the lessons learned was to fail constantly in a controlled manner in order to ensure that the system would continue working in the event of a failure. Thus, was born the application ‘Chaos Monkey’ whose job was to randomly kill instances and services.

A further article in 2011 described Chaos Monkey in greater detail. An actively chaotic application would randomly disable, kill and remove instances to prove the system could handle the failure and gracefully bring itself to a healthy status.  

The chaos was controlled by ensuring it only happened during the working day, with a team of specialists available to handle any fallout and delivery any fixes that might be required. This raises an understanding of how the system behaves in events that are not ‘expected’ and builds a wealth of knowledge by the teams handling the various systems of both cause and effect.

The article went on to describe a whole “Simian Army” of chaos which would independently test: 
·      Latency
·      Cross Regional Dependencies 
·      Clean-up of old systems
·      Best Practices Conformance
·      Health Checks

Chaotic State


What is not addressed in either of the articles is the ability to handle State. State being messages and configuration that need to be retained. State is important for a number of different reasons, these might be security, assurance of service delivery or even regulatory i.e. an audit.

There are three main types of State;
·      Long Term State - State that is held for a long period of time, usually in a database but occasionally in long term memory.
·      Persisted Transient State - tracking and recording a process through each of its parts so that it can be resolved in the event of a failure i.e. if a payment fails before completion
·      Non-Persisted Transient State – data that is required, but is not needed to be tracked i.e. stock check

The state might be a message, the configuration of an application or the configuration of a deployment management system itself. Chaos Monkey doesn’t mention how to handle these concerns, how the chaos is mitigated or the restriction you may need to place on the ‘chaos’. 

Consider in a container management system such as Kubernetes – how would you handle the deletion of a Helm chart from the system, would the helm chart be checked and pulled from another system? What about the reference and connection to the repository that holds all the state of the system?

Consider a system that handles a payment agreement between two systems – a salespoint and a bank. If the transaction is accepted by the bank just before the chaos monkey application kills the process, the sales point will not have confirmation of the payment and thus the customer making a purchase on the sales point will be charged but will not get their goods. 

This speaks more about managing state in cloud and container-based systems but also means that the chaos monkey application needs to be aware of state too and considered during the application design.

Chaos in Production


Chaos Monkey works well in systems like Netflix primarily because the loss of service during the working day to a live system is manageable from both a support and customer relations point of view. If, as a user, it isn’t possible to stream a video for a period of time the consumer might be annoyed by the inconvenience but little more. 

In a system where a vital service is being provided, the loss of service even for seconds may have adverse effects. Defence Agencies, Health Care, Bank Processing, Aviation and other high-risk systems it would be difficult to justify the risk of a system being down, even for testing. 

In high-risk systems the loss of service can have real life-changing effects to consumers. If, for example, a train line system was to drop a red-light signal the result could be injury or even death. To intentionally break the live system through a form of testing seems irresponsible when the costs of these systems failing are so high. 

To alleviate this risk, some of these systems will perform testing and upgrading in disaster recovery systems which resemble exactly the system in Production. This is not always possible though, as some systems require the disaster recovery system to be aware and available of production at all times which can lead to contention of which of the two is the active production system.

In the previous section, the discussion of state raised the issue around a transaction being killed mid process. In test environments this would only have an effect on test accounts that are being used, but in a live production system, real customers will face the effects of state failure.

A final thought is that the purposeful destruction of something that is live and working seems against what many infrastructure and solution teams try to build - resilient systems that are architected to run and survive. In the Netflix article the analogy of getting a flat tyre is used, the chaos monkey being that you give yourself a flat tyre on the driveway on a Sunday morning. If chaos monkey was truly chaotic, the analogy would really involve blowing the tyre out whilst driving along the motorway.

This seems dangerously unsafe for any system, indeed puncturing a tyre on the driveway is a significantly less random form of chaos. It is more like a test environment in any case and is a very drastic action to take. In the flat tyre analogy, the driver could get the same confidence by testing each of the component parts independently – inflating the tyre, changing the tyre, ensuring you have the correct tools etc. 

In all readings, chaos monkey comes across more as controlled production testing. Even without chaos monkey, systems should complete disaster recovery once a year to perform the most drastic of tests. As such, chaos monkey is not truly chaos and since it is testing should be treated as such.

Adequate Planning & Testing


If chaos monkey can be dealt with as testing, there are some reasons why it will not be described as ‘chaos’. The application performing the killing of services and instances can only do what it has been programmed to do and as such its remit is known and could be planned for without the actual destruction.

Whilst unknown consequences may occur from chaos monkey this can - and in most cases should – be performed early in the test cycle in lower environments. This also raises the importance of having a production like environment before production which is a replica in every way.

Since chaos monkey is testing, there are the typical positive and negative considerations that comes with it. The cost, value and risk. Testing on the scale of chaos monkey requires willing to perform testing in production by those who own and support it, the budget to spend time finding issues over completing new functionality.

Real Chaos


In the real world, chaos is just reality in play. There is no way to plan for the unthinkable, unplannable and unconsidered. Problems will always occur and by its nature chaos is chaotic. Unmanageable, unplannable, uncontrollable – in a way that cannot ever be represented through test.

If chaos cannot be controlled, then it should be instead be prepared for. Chaos monkey allows for this preparation by testing the fall backs systems have in place. There are several thought processes that can be put in place;
·      High Availability – what happens if each component part breaks? Can it fail over? Can I resume service quickly and easily?
·      Disaster Recover – what happens if an entire active system fails? Can it fail over? How long does it take to return to service and how much time should be set aside to get the system back to its full availability? (Recovery Time Objective and Recovery Point Objective)
·      State – how is state handled in the event of each failure? Is the system able to recover a transaction mid process? 
·      Risk – what risk is acceptable? How is each risk mitigated?
·      Testing – Is there a safe place to test all eventualities outside of Production? Can Production be replicated sincerely?

Conclusion


Whilst chaos monkey is not chaotic it has great worth as a testing facility. It can be used to gain confidence in a system, understand the consequences of actions and mitigate risk. Chaos monkey activities are relatively controlled and managed and as such constitutes an advanced cloud-based testing strategy.


In contrast to the teachings of chaos monkey, testing might not always be possible – or safe – to perform in Production but should be instead performed in a test environment. At least one of these test environments should look exactly like production in every way, where each component of each fail-safe strategy is tested.