Tuesday, 23 March 2021

If Software Development Has Peaked, why do Projects still fail? Part 2

by Aiden Gallagher & Peter Reeves (our podcast)

Technical Considerations

There are technical specific considerations that cause projects to fail too. Such as software and hardware build and deploy processes. Sometimes it’s simple errors that haven’t been accounted for, other times it might be bad technical implementation estimates or a high knowledge gap to fill. But other issues are less obvious:

 1. Over Engineering of Automation

Automation is the golden standard of modern, agile development. The more consistent and quicker something can be taken to production the better. This might be catching issues earlier to reduce overall cost to test in each environment or having an automated rollback strategy that means issues can quickly be reverted or new features rolled out to a small sample size.


But getting overly hung up on automation can lead to setbacks. Forcing automation without a real use case can eat up valuable project time. For example, automating a yearly update that takes 5 minutes to complete is hardly worth the effort.


Additionally, where automation is too complex there can be little gain and lots of pain. Take automating a test that requires 5 or 6 different applications that regularly change, or where there is no control of the connecting service e.g. scraping a webpage. The overall effort to complete is more hassle than the test itself.


There are also other areas which still cannot be automated, like user acceptance testing for flow ease of use and ‘obviousness’. Or user test cases that require negative testing – all the permutations of which cannot be known, else they would be covered in the initial testing. 

Remediation:

  • Complete an analysis of savings vs cost to complete
  • Consider the complexity of the automation and what might be missed e.g. UAT we might be able to pass all technical flows but not visual tests. If it doesn’t feel intuitive or right, then people won’t want to use the application/site.
  • Be comfortable replacing automation with manual tests if the automation is taking too much time or fails regularly- this time should be factored back into the savings/cost analysis

Warning Signs:

  • Automation tasks overrunning
  • Lots of failures in the automation themselves that require team time to fix

2. Infrastructure still exists

Projects continue to fail because somewhere infrastructure still exists. The problem may be obfuscated away from the project team by being on a Software as a Service (SaaS) offering or controlled by a separate team, but failures and considerations of the infrastructure are still highly relevant to the project.


There’s also the hosting of the systems, moving to a cloud based off-premises network might mean severe limitations to the boundaries of the project. For example, an organisational project that wants to start working with a UK government entity might not be eligible if all the infrastructure lives in European datacentre. This can be resolved but there will a cost associated from switching providers whether it’s just one app or many as security, connectivity and skilling up are required.


As with all externalising of responsibilities to allow teams to focus on specific solutions - including with internal virtualization teams - there is a loss of control of system version, upgrade strategy and when outages will or might take place. If a project needs the latest version of Linux to create a new feature or expand an existing feature, the project success timelines might be pushed back whilst awaiting a convenient time for the external team to action.


There is also the case where an off-project infrastructure ownership can inhibit flexibility. A couple of examples might be, the ability to store a file such as a private key locally,  the ability to integrate with certain software such as an Active Directory Group. 

Remediation:

  • Make applications that are easily portable to other infrastructure providers
  • Understand Infrastructure limitations and requirements during design phases. This is especially important when doing iterative designs, so the constraints and boundaries are well understood going forward
  • Have alerts for infrastructure changes or updates setup that the team can evaluate any effects as needed

Warning Signs:

  • No mention of infrastructure in the design
  • Failure due to changes at the infrastructure level

3. Upgrade Acceleration

Many modern applications pride themselves on their ability to push out new updates very quickly with daily or even weekly changes to production code. This works fine until the project is tied to the same upgrade cycle as a connecting application. Take Kubernetes for example, which used to provide its version support for 9 months; up to 12 months at the time of writing. 


This means there will definitely be an upgrade required in the next 12 months, of which, the latest version could be completely different to the one being used. Different mechanisms for assigning storage, managing deployments etc. are hard to predict and do not offer the stability that might be required for some projects.


Given an example where a customer feedback form is hosted in a container on Kubernetes, after a year the support expires, but it took 2 months to productionise, 3 months to rollout to all customers after the new feature is proven to work and be of value. A normal project might want to put the feedback form into a legacy status and just feed and water it to update logos etc. However, 4 months in the team needs to divert resources to upgrade the Kubernetes version to a supported version.


A perfectly working app might ordinarily have been left for a long time, but short support timelines mean imposed additional work where the upgrade could have serious impacts on the feature/application. This increases the cost of the project and reduces the quantifiable monetary benefits.


This also extends to any dependencies which are subject to lots of change, and any integrating applications with short release cycles. This also makes it hard for application teams to support as the inner workings of the system e.g. Kubernetes changes too fast to stay up to date.


The final part of this problem is the impact it has on workers who are expected to become experts on the systems, this is more difficult when it is under constant change and when combined with a requirement to have a much broader knowledge base. 

Remediation:

  • Include upgrade cadence and complexity into solution design and selection
  • Create automation tools for upgrades that are maintainable and manageable by long term support teams
  • Select long term support options where possible

Warning Signs:

  • Multiple upgrade requirements over a short period of time e.g. a couple of development cycles
  • Lots of failures when performing upgrades 

4. Debugging takes time

With increased deployment velocity achieved in lean and agile projects, we are able to deploy quicker into production. But when bugs occur, they can be difficult to debug and understand. This might take time and may require that the production system is reserved for failure understanding only. If this time hasn’t been accounted for the whole project timescale can be knocked off balance.


In projects where debugging issues and fixing errors haven’t been accounted for, the onus to address defects can be pushed onto developers and testers who are forced to rush workload. This leads to further failures and an exacerbation of new, avoidable bugs getting into Production.


This can be alleviated by code ownership in production by the developing team. Project teams will be more likely to speak up and stop bad behaviours if they own the code in production, but this relies on the understanding at a management level.


Another issue can be caused when implementing ‘hot fixes’ based around assumptions of failure. This usually occurs because of a lack of time to debug and understand a problem. It also relies on an assumption that automated tests work and are up to date which requires trust. The less time a developer has to debug and get a fix that works, the less trust there will be, the more issues that will arise and the more likely for a project to fail. 

Remediation:

  • Ensure issues are well understood before implementing fixes by testing issue hypothesis rigorously and keeping logs for long term analysis if required.
  • Plan in time for defect management either a dedicated resource or dedicated time for the wider team

Warning Signs:

  • Bug fixes are getting into production but are ending up back with developers as issues continue
  • Unable to meet current cycle plan because of handling defects which suggests inadequate planning for defect
  • Fixes without adequate problem determination

No comments:

Post a Comment