Wednesday, 27 November 2019

A History of Application and Service Integration

For the last few years, I have been writing the knowledge I have been acquiring into a document, looking at how the products I work with in integration came to be, what led to their prominence and the problems they were and continue to try to solve.

This slowly got bigger and bigger as I incorporated the knowledge and discussions I had with my colleague Andy Garratt. The document looked at EAI, ESB, FTP, TCP/IP, SOA, Middleware, Hub and Spoke, ACID, AIA, Microservices, REST, SOAP, XML and many many more integration topics to boot. There was so much in there, that we felt we needed to share the knowledge with the wider world to help set the context of integration to those new to the industry, and maybe even those who are already in the industry.

As of this week, we now have published "A History of Application and Service Integration" which is available to download and read on iBooks:

Below is the accompanying video that describes A brief History of Integration in under 3 minutes.

Alternatively, you can download the pdf directly from here.

Sunday, 27 October 2019

Bringing the Chaos Monkey to heel


Systems, microservices, services and applications can go wrong. Error messages in the middle of the night, important messages getting stuck on the system, memory and disk space overload, network glitches and any other number of unknown conditions that might befall these systems. 

To gain confidence in these systems being able to continue working - despite issues - is a focal consideration in any architecture. The ability for a system to be able to service consumers despite issues in one or several locations is known as its availability whilst the ability to restore working order after issues is the systems resilience. Through availability and resilience, a system can be created that is capable of dealing with any issues that befalls it.

Confidence is gained by performing stringent tests on a systems hardware, software and connectivity points. Sometimes these tests are spread across multiple environments with specific purposes i.e. integration test or user test. In other methods, testing is performed on a live system itself to gain the confidence that in these adverse scenarios the system can handle or recover itself.

Controlled destruction of parts of a live system – to prove its availability and resilience - is known as chaos engineering which has manifested in the container world through the Netflix model of the Chaos Monkey.

In this article we will look at what Chaos Monkey is, how it benefits systems, when it is appropriate to use chaos monkey, considerations when using chaos monkey and considerations for gaining confidence in systems generally.

What is Chaos Monkey?

In 2010, Netflix published an article referring to transition to the Cloud. One of the lessons learned was to fail constantly in a controlled manner in order to ensure that the system would continue working in the event of a failure. Thus, was born the application ‘Chaos Monkey’ whose job was to randomly kill instances and services.

A further article in 2011 described Chaos Monkey in greater detail. An actively chaotic application would randomly disable, kill and remove instances to prove the system could handle the failure and gracefully bring itself to a healthy status.  

The chaos was controlled by ensuring it only happened during the working day, with a team of specialists available to handle any fallout and delivery any fixes that might be required. This raises an understanding of how the system behaves in events that are not ‘expected’ and builds a wealth of knowledge by the teams handling the various systems of both cause and effect.

The article went on to describe a whole “Simian Army” of chaos which would independently test: 
·      Latency
·      Cross Regional Dependencies 
·      Clean-up of old systems
·      Best Practices Conformance
·      Health Checks

Chaotic State

What is not addressed in either of the articles is the ability to handle State. State being messages and configuration that need to be retained. State is important for a number of different reasons, these might be security, assurance of service delivery or even regulatory i.e. an audit.

There are three main types of State;
·      Long Term State - State that is held for a long period of time, usually in a database but occasionally in long term memory.
·      Persisted Transient State - tracking and recording a process through each of its parts so that it can be resolved in the event of a failure i.e. if a payment fails before completion
·      Non-Persisted Transient State – data that is required, but is not needed to be tracked i.e. stock check

The state might be a message, the configuration of an application or the configuration of a deployment management system itself. Chaos Monkey doesn’t mention how to handle these concerns, how the chaos is mitigated or the restriction you may need to place on the ‘chaos’. 

Consider in a container management system such as Kubernetes – how would you handle the deletion of a Helm chart from the system, would the helm chart be checked and pulled from another system? What about the reference and connection to the repository that holds all the state of the system?

Consider a system that handles a payment agreement between two systems – a salespoint and a bank. If the transaction is accepted by the bank just before the chaos monkey application kills the process, the sales point will not have confirmation of the payment and thus the customer making a purchase on the sales point will be charged but will not get their goods. 

This speaks more about managing state in cloud and container-based systems but also means that the chaos monkey application needs to be aware of state too and considered during the application design.

Chaos in Production

Chaos Monkey works well in systems like Netflix primarily because the loss of service during the working day to a live system is manageable from both a support and customer relations point of view. If, as a user, it isn’t possible to stream a video for a period of time the consumer might be annoyed by the inconvenience but little more. 

In a system where a vital service is being provided, the loss of service even for seconds may have adverse effects. Defence Agencies, Health Care, Bank Processing, Aviation and other high-risk systems it would be difficult to justify the risk of a system being down, even for testing. 

In high-risk systems the loss of service can have real life-changing effects to consumers. If, for example, a train line system was to drop a red-light signal the result could be injury or even death. To intentionally break the live system through a form of testing seems irresponsible when the costs of these systems failing are so high. 

To alleviate this risk, some of these systems will perform testing and upgrading in disaster recovery systems which resemble exactly the system in Production. This is not always possible though, as some systems require the disaster recovery system to be aware and available of production at all times which can lead to contention of which of the two is the active production system.

In the previous section, the discussion of state raised the issue around a transaction being killed mid process. In test environments this would only have an effect on test accounts that are being used, but in a live production system, real customers will face the effects of state failure.

A final thought is that the purposeful destruction of something that is live and working seems against what many infrastructure and solution teams try to build - resilient systems that are architected to run and survive. In the Netflix article the analogy of getting a flat tyre is used, the chaos monkey being that you give yourself a flat tyre on the driveway on a Sunday morning. If chaos monkey was truly chaotic, the analogy would really involve blowing the tyre out whilst driving along the motorway.

This seems dangerously unsafe for any system, indeed puncturing a tyre on the driveway is a significantly less random form of chaos. It is more like a test environment in any case and is a very drastic action to take. In the flat tyre analogy, the driver could get the same confidence by testing each of the component parts independently – inflating the tyre, changing the tyre, ensuring you have the correct tools etc. 

In all readings, chaos monkey comes across more as controlled production testing. Even without chaos monkey, systems should complete disaster recovery once a year to perform the most drastic of tests. As such, chaos monkey is not truly chaos and since it is testing should be treated as such.

Adequate Planning & Testing

If chaos monkey can be dealt with as testing, there are some reasons why it will not be described as ‘chaos’. The application performing the killing of services and instances can only do what it has been programmed to do and as such its remit is known and could be planned for without the actual destruction.

Whilst unknown consequences may occur from chaos monkey this can - and in most cases should – be performed early in the test cycle in lower environments. This also raises the importance of having a production like environment before production which is a replica in every way.

Since chaos monkey is testing, there are the typical positive and negative considerations that comes with it. The cost, value and risk. Testing on the scale of chaos monkey requires willing to perform testing in production by those who own and support it, the budget to spend time finding issues over completing new functionality.

Real Chaos

In the real world, chaos is just reality in play. There is no way to plan for the unthinkable, unplannable and unconsidered. Problems will always occur and by its nature chaos is chaotic. Unmanageable, unplannable, uncontrollable – in a way that cannot ever be represented through test.

If chaos cannot be controlled, then it should be instead be prepared for. Chaos monkey allows for this preparation by testing the fall backs systems have in place. There are several thought processes that can be put in place;
·      High Availability – what happens if each component part breaks? Can it fail over? Can I resume service quickly and easily?
·      Disaster Recover – what happens if an entire active system fails? Can it fail over? How long does it take to return to service and how much time should be set aside to get the system back to its full availability? (Recovery Time Objective and Recovery Point Objective)
·      State – how is state handled in the event of each failure? Is the system able to recover a transaction mid process? 
·      Risk – what risk is acceptable? How is each risk mitigated?
·      Testing – Is there a safe place to test all eventualities outside of Production? Can Production be replicated sincerely?


Whilst chaos monkey is not chaotic it has great worth as a testing facility. It can be used to gain confidence in a system, understand the consequences of actions and mitigate risk. Chaos monkey activities are relatively controlled and managed and as such constitutes an advanced cloud-based testing strategy.

In contrast to the teachings of chaos monkey, testing might not always be possible – or safe – to perform in Production but should be instead performed in a test environment. At least one of these test environments should look exactly like production in every way, where each component of each fail-safe strategy is tested.

Tuesday, 27 August 2019

Beard Trimmer Salvaging

For my 18th Birthday (almost 9 years ago!) I got a Phillips QT4013/23 Series 3000 Beard Trimmer from my Auntie and Uncle, it has served me effortlessly since. I probably had no beard to trim until I was 20 which means it has worked hard these last seven years.

Recently the head snapped and unfortunately there were no replacements heads which meant it was to be consigned to the scrap heap. Instead, I thought I would see what I can salvage from within.

On a side note - its very very frustrating that you cannot just buy parts and end up buying something completely new - incredibly wasteful.

The razor head itself came off quite easily but as you can see from the top all it would be good for now is a hair comb. The razor blade clicks out, which evidently means that *it* can be replaced - it's made up of a spring and two white prongs which are moved from left to right by the rest of the razor.

The rest of the workings were trapped inside the three plastic cases on the left. The dial in the centre of the far left plastic pieces, pushes a motor higher or lower. The motor causes the prongs on the razor  to move from left and right causing the cutting motion.

The motor is powered by two batteries on a motherboard with a switch at the back which can be turned on or off by the right most plastic cover. The additional benefit of the battery setup is that it can be recharged by applying the charger to the bottom which will be useful in future projects - as I can make a solar charger charge it up.

Not a lot that can be saved here except the battery, metal pins and the motor. The rest will be consigned to the scrap head :'(

Sunday, 11 August 2019

Football Simulation Engine v3

2 Years ago I published my first node package module (npm) in npmjs to allow football match results to be simulated. Yesterday I uploaded the latest version FootballSimulationEngine-v3.0.1 which can be installed by running:
          npm install footballsimulationengine

What's New?

startPOS and relativePOS have changed

The biggest - and breaking - change was to update the arrays used to track where players currently are on the pitch (startPOS) and where players are moving towards (relativePOS). These were not very descriptive and resulted in several GitHub issues so they have been changed to better reflect what they do. "startPOS" is now "currentPOS", "relativePOS" is now "intentPOS".

"currentPOS": [340,0],
"fitness": 100,
"injured": false,
"originPOS": [340,0],
"intentPOS": [340,0],

Tests tests tests

I have been reading more and more about the importance of tests and clean code, in fact its been prominent on this blog:
As described, I have added lots and lots of tests into the code making it more readable, easier to make code changes, has led to variables and functions being named more accurately. The result is 143 passing (199ms) tests and roughly 55% test coverage.
55.88% Statements 1401/2507
44.32% Branches 655/1478
82.13% Functions 170/207
53.14% Lines 1227/2309

Penalty Position - Corner and Freekicks

Players no longer form an arbitary collection of bodies at a set location for freekicks and corners, instead they now gather in the penalty box which is a defined dimension on the pitch. This makes corner taking more realistic and improves goal opportunities from corners.


Previously, the current ball holder and the team they were playing were matched against their names for certain logic. i.e. can the player kick the ball? First, we need to know if that player has the ball. Now each match, player and team are given an identification number during initiation. 

"matchID": "78883930303030001",
"kickOffTeam": {
"teamID": "78883930303030002",
"name": "ThisTeam",
"rating": 88,
"players": [
"playerID": "78883930303030100",

Red Cards

Players who get red cards will no longer be part of the playing action. This is shown by current position being set to ['NP', 'NP'] which means 'Not Playing'. If the current position is set the player will not be targeted for passing and will not make any movements.

Skill Rating Utilization

Player skills are better utilised including to make shots on or off target depending on their skill against a random number generation i.e. players with shooting skill of 40 will be less likely to shoot on target than players with shooting skill of 70. This is also applied for saving shots that are in the box with the skills of the goalkeeper determining if the ball is 'saved' or not when in reach. 

Future iterations will take this further for more sophistication. If you have any thoughts for how to do this be sure to raise an issue in GitHub.

Player Stats Improvements

Player statistics from during the game have been greatly improved with statistics for fouls, passes, shots and tackles. This will improve reporting of player information. 

If there are others that can/should be added, raise an issue in GitHub.
'goals': 0,
'shots': {
'total': 0,
'on': 0,
'off': 0
'cards': {
'yellow': 0,
'red': 0
'passes': {
'total': 0,
'on': 0,
'off': 0
'tackles': {
'total': 0,
'on': 0,
'off': 0,
'fouls': 0


There are still some improvements to be made, to player movement (which still isn't quite there), passing statistics, intelligence of decisions being made and whole lotta tests to be added. I'll be blogging soon about how I've used this latest version to simulate the 2019/2020 season to showcase how the engine can be used to simulate a season for a football manager game.

Let me know what you think, either in the comments, or by raising an issue on Github or emailing me on! Happy footballing!

Saturday, 20 July 2019

Connecting to db2 with JDBC Driver in App Connect Enterprise (ACE)

Whilst recently connecting to an IBM DB2 through an App Connect Enterprise (ACE) bar file following (parts of) this tutorial on developer works - I hit a couple of issues due to using different naming standards and from (naughtily) plucking specific parts from the article but not following it properly. 

I have been using IBM Cloud Private where both an ACE helm chart ( and DB2 are both deployed with different names and data models than those defined in the example. If, like me, you bypassed parts of the example here are some issues you might have seen.

1. In containerised ACE deployment the JDBC Driver JAR should be loaded into the project to be packaged with the bar (as decribed in the article above)

a.    In the ACE Toolkit go to Window -> Show View-> Java
b.    Drag and drop the db2jcc4 JAR file into Project window
c.    Right Click (double finger click on the MAC) on the Java Project Folder and select Properties -> Java Build Path -> Click Add JARs -> Select db2jcc4 jar file -> Click OK
d.    Ensure the Java Project is included in the bar file
e.    Ensure “User JAR files that have been deployed in a .bar file” is set to true in the JDBC Policy as defined in the tutorial and that the policy is included in the bar file build

2. After the deployment there was still an error that persisted;
I was able to find greater debug logs in ICP in the following file, in the following file;
6231E: An error occurred in node: Broker 'integration_server'; 
Node Type 'DeleteProducts_JavaCompute Exception details:  message: stack trace: \ \ \njava.lang.ClassLoader.loadClassHelper(
 The error looks like the jar has not been correctly loaded, to check we can;

  • Looking in the bar file on the toolkit in the ‘Manage’ tab which should show the jar file in the REST API Application.
  • Convert the bar file into a zip and unzip it to see the files inside where the jar should be visible.
If that all looks good, next you should check the JDBC connection in the Java Compute Node.
Connection conn= getJDBCType4Connection("BLUDB",                                      JDBC_TransactionType.MB_TRANSACTION_AUTO);
The example makes it look like BLUDB should match the name of the policy since the policy, database name and connection definition are all the same. The error also doesn’t point very well at what the issue is.

The connection should refer to the Database name being connected to. In my personal view, I would follow the example and call the policy the same name as the database being connected to which can be referred to in getJDBCType4Connection.

Hopefully this has helped you with your development!