Complex systems – this is how we manage them in practice

The complexity in modern IT systems cannot be avoided – but the complexity can be managed. Here I go through theories and principles for managing complex systems and how we must think in the future to not be overcome by the ever more complex structures of IT systems. This is part two in the series on complexity.

Complex systems are systems with many interacting components which means that the system does not have predictable causality. The number of possible interactions between components, and that they can change over time, makes cause relationships impossible to predict. This shows up in unexpected problems, bugs, security failures or crashes. Just as important is that it can also result in unpredictable possibilities. That complex systems cannot be fully analysed or modelled has direct consequences for many of the structured methods we normally use to conduct system development. Detailed system specifications, risk analyses and requirement specifications are often incorrect since they are made for a system where unexpected events can occur and the system with high probability will develop in an unexpected direction.

How does the system work today?

The question is how the complexity is handled with complex systems if, by definition, they cannot be modelled exactly, and you can’t predict how the system will behave?

The only thing that we know for certain about a complex system is how it is working right now, and it is there that the focus needs to remain to gain control. How does the environment look for the system, how does the system hang together, what do you want with the system, how well does the system function and how do we change the system to better meet needs and desires? By focusing on this, the system can be directed properly and followed up on so that changes function. This deals with the current situation.

Anti-fragility – development with stress testing

If we dig deeper and look at the theories there are around complex systems, Nicholas Nassim Taleb’s theory anti-fragility has had a major impact in recent years. The term Anti-fragility builds on the opposite of fragility: A system that breaks if it is exposed to stress. An unexpected event or shock of some type. The opposite, anti-fragile, is a system that is improved when it is exposed to stress.

Look at the complex systems that persist over time, they are neither robust (static) or resilient (durable, elastic). These forms of systems are good at withstanding stress, but are not developed in general. Anti-fragile systems are developed instead with stress or shock. The stress affects the system so that it adapts to survive similar stresses in the future. The stress becomes important information for the system about how it should adapt and be hardened. Remember that unexpected events, simple or disruptive (so called black swan events_ are unavoidable with complex systems. By learning from previous stresses, systems become better prepared to manage future unexpected events.

Four design principles for anti-fragile systems

What are then the characteristics anti-fragile systems have which make them anti-fragile and are needed to manage unexpected events? The key to anti-fragility is that they system is exposed to stress and unexpected events. It must be able to handle the stress so that the system does not fail (which should mean that the system is a fragile system). Then, in addition, the system must make use of the stress as a tool for its further development and improvement.

Taleb’s theories about anti-fragility describes these characteristics from a philosophical and financial perspective. Apart from an IT perspective there is a professor, Kjell Jørgen Hole, who wrote a book about what characterises anti-fragile IT systems:” Anti-fragile ICT systems”.

Hole describes four design principles that anti-fragile IT systems need:

Modularity
Redundancy
Loose connections
Diversity

And a method which is important and popular called fail fast.

Diversity and loose connections

These four design principles are nothing new within IT development, especially not modularity and redundancy. The interesting thing is how they are important together to make a system anti-fragile. Diversity and loose connections can still be worth looking at a little more closely:

Because the IT industry goes hand in hand with efficiency and rationalization, diversity is not typical. Is it smart, for example, to use two different database solutions for one system? It can cost more in maintenance, complexity, demand for knowledge and licenses. For powerful changes, serious database problems, license or end-of life problems, it can pay off significantly to not be locked into one technology or supplier. To a large extent, all types of changes can profit from diversification. And the importance of adaptability for the system cannot be emphasized enough for complex systems. That the system consists of a healthy flora of technicians and solutions will make it much more adaptable given the uncertainty a complex system is unavoidably exposed to.

Loose connections mean that the connections between components is simple to change or reconnect and that the dependency between the components in general is small. In this way, a change of a component or how it is connected to other components does not have spreading effects. An example of the problem area is, for example, inheritance within object-oriented programming that often (unintentionally) creates dependencies between system parts and make the system unstable and difficult to change. Even the use of a so-called enterprise service bus, where business logic is stored that connects together system parts (the connections), also goes hand in hand with loose connections and should also be avoided since it leads to static systems where changes become complicated and costly.

The design principles have two important functions for the system:

It is important so that the system is able to survive unexpected events and shocks. If the system or parts of the system cannot continue to function in one or another way in the event of an unexpected stress, there isn’t space to manage, learn and improve the system based on the lessons that have to be drawn from the unexpected event. All efforts need to be on “putting out the fires”.
The system must be flexible and adaptable. We know that unexpected events or behaviour can occur with complex systems and then the system must change based on this. If the system isn’t adaptable at all times based on all the new information that comes from how the system functions and is used and how the environment is changed, the system will not be relevant and usable over time.

Errors are robust information to act on

Kjell Jørgen Hole also takes up the method fail fast. Based on the reasoning above, this is natural sine it is a fundamental part of the anti-fragile system – that stresses to the system is a part of the system’s development. Errors and problems are also an important asset for the system’s development. A way of looking at errors is that errors mean robust information to act on an error is simple to analyse and remedy. On the other hand, if you don’t know something is wrong, it is very difficult, and for complex systems impossible, to know with certainty how you should manage it. It is therefore errors or failures that are often celebrated by progressive IT developers, such as Google’s CEO Sunder Pichai, who states that Google’s developers should:” Wear your failure as a badge of honour”. Through errors, correct and true lessons are learned, not assumptions, and above all the lessons can be made effective.

This is how we achieve a correct technical maturity

In summary, we need to recognise that you can’t predict how complex systems need to be developed or which problems and possibilities will occur. Therefore, we need to focus on what we can control, which is how the system functions now. To be able to manage the uncertainty this means with complex environments, we need to achieve a technical maturity so problems and stresses to the system become a tool for change and improvement. Here Hole’s principles are an excellent technical basis for development of IT systems.

In this way we can build complex systems that meet the needs that our modern world poses.

Johannes Jansson, consultant and associate partner, Tagore

Read also part one in the series: Keep track of IT complexity – as you do with black swans and fat tails