fbpx

Don’t build stable IT systems

Today’s IT systems are often large and complex and it’s hard to predict what might go wrong or how the system will behave under unforeseen circumstances. To try and predict every situation in order to build robust systems is a rabbit hole, so we have to think differently when we design them.

What does this mean for system development? Should we stop building robust software through careful risk analysis, problem solving and testing?

No. Robust software is still required, but it requires an understanding for what is really needed to be achieved: dynamic systems, built to handle a changing environment and unforeseen problems. It’s not stable or robust (static) systems that we should be striving for.

One theory in risk analysis that has had a great impact in recent years is Nassim Taleb’s Antifragile. Fragile systems crash and disappear over time because they can’t handle change or strain. Antifragile systems, on the other hand, adapt and thrive on change. Nassim Taleb claims that change is a key feature of long-living systems, which also distinguishes antifragility from robustness or resilience – both of which strive to maintain existing functionality. Antifragile systems are so far from stable – they are adaptable and improved by external impact and stress (up to a certain limit).

Professor Kjell Jörgen Hole has taken Nassim Taleb’s ideas and researched how antifragility can be applied in the development of IT systems. According to Jörgen Hole, IT systems need to be able to handle unpredictable events and continuously improve to be modular, weak linked, diversified and redundant.

 

But what does that mean?

 

  • Modularity minimises the spread effects of problems and makes it easier to understand and change delimited parts of systems.

 

  • Weak linking facilitates change and further development of systems. This is an interesting area that is often not understood and applied to a sufficient extent (and it deserves its own article).

 

  • Diversity has to do with adaptability. A diversified system finds it easier to change tack. If change becomes necessary, options are already available.

 

  • Redundancy is required for the system to not burden when stressed. Certain overcapacity is important, as well as backup solutions, so unforeseeable events can be managed; respectively controlled and given time to adapt the system to the new conditions.

 

In addition to these four design principles, Jörgen Hole adds another method: Fail Fast. In order to create a customisable system that can learn and adapt to problems and stress, the system needs to be subjected to early stress. The system needs to fall early when the consequences are smaller – this way you can learn from the incidents, adjust the system and limit the consequences of future incidents.

 

I would also like to add one more important feature for how IT systems can become antifragile, which has to do with changeability. In order to make informed decisions about how the system will change, when new conditions or unforeseen events occur, good decision making is required. Therefore, the system needs feedback ability, a way to provide useful feedback on how parts of the system are operating, so that functioning parts of the system can continue to be improved or expanded. Many think primarily about logging, but being able to retrieve information from the system (software probing) is becoming increasingly common and feedback ability will be a very important tool for achieving antifragility.

 

 

Johannes Jansson, Consultant and Associated Partner, Tagore.