Microsoft’s Azure mishap betrays an industry blind to a big problem

If a tiny typo brings down half of Brazil, perhaps we’re the nuts

Opinion “ELY (n.) – The first, tiniest inkling that something, somewhere has gone terribly wrong.” One the finest definitions from Douglas Adams and John Lloyd’s The Meaning of Liff, it describes perfectly the start of that nightmare scenario for ops in a big service provider - the rising tide of alerts after that update went live.

Some poor sod in Microsoft just had the mother of all elys. A careful, intricate, tested and approved rewiring of the Azure DevOps suite got sent out into the world, only for South Brazil to go dark as it started to eat customer instances. You can read the gory details here, they’re just as compelling as an episode of Air Crash Investigation. The skinny is simple. A typo triggered unforeseen cascading errors, which continued attempts to restore order dragged on for ten embarrassing hours.

It is easy to rag on Microsoft for oh so many reasons. Stuffing its OS with adware disguised as a help system. Leaning on everyone to move to Windows 11 while nearly half of Windows 10 PCs fail the hardware requirements. Teams. But these excrescences are corporate and cultural: the typo-induced Azure outage is an industry wide phenomenon that good people perpetrate. Simple typos and their cousin, Mr Misconfiguration, can unleash chaos to anyone.

How is this possible in the year of our AI overlords 2023, when our inventions are smart enough to write sonnets? More to the point, given that we humans are never going to stop mucking up, is there any way of spotting mistakes before they do damage?

Part of the problem is that our technology will do what we tell it, and the difference between very useful and existentially threatening can be wafer thin. Take the infamous Unix/Linux command 'rm -r *'. For those of you whose palms aren’t sweating instinctively at the sight of this little beauty, it means "Remove all files in this directory and all directories beneath it." It's hugely useful when freeing up space or removing old installations, and the system won't let you get rid of things you don't have access privileges to.

Run it with root privileges at the root directory, which is just a sudo and a cd / away, and you have wiped out your entire universe. You may not even notice at first, as it will set about its suicide mission with quiet efficiency, but eventually some weird error will appear as what's left of the running software reaches for an essential file that's not there. Ely. The error messages that follow as you try to make sense of your apocalypse will be a lesson in digital madness.

Don't try this at home? You absolutely should. Nobody who has seen this happen ever forgets - or repeats - it. Just spin up a new virtual Linux machine and have at it. (El Reg takes no responsibility if you type into the wrong window. Don't.) This principle, of making your mistakes in a place that mercilessly demonstrates their consequences without them being consequential, is the gold standard in safety nets. In aviation, those places are called flight simulators. In electronics, circuit simulators. In humans, Ibiza.

Why isn't this principle adopted in large, complex systems such as those in major service and network outfits? To some extent, they are - the testing and validation processes take place in some sort of system that tries to react to something like the real thing. This works, except when it doesn't - large, complex, dynamic systems are usually too much all those things to be modelled realistically, if they can be at all. It’s a lot of work to model them. Where such models are used, they necessarily involve a lot of abstraction, certainly too much to capture detailed configuration changes in components.

This sort of thinking is a failure of imagination and engineering. Go back to flight simulators, which for regulatory reasons have to be developed alongside the aircraft they train pilots for. In devops heaven, test scripts and protocols are developed alongside the actual software - well, maybe. Once the software's out in the wild and interacting with other systems, all that falls away. The typo that leads to the cascading fault chain across components which are just doing what they're told has no systemic safety net.

All software comes from a functional specification - or at least, let's pretend. That same spec is used in testing and validation. Why not use it further, to create a simulated model of the software that can be used in a virtual environment? It can pretend to do the work that takes up tons of physical resources, in order to model its behavior and test its logic. If you're managing a large service fabric with terabytes of customer data, you no more need to replicate the data in a virtual test environment than a flight simulator needs to replicate the planetary weather system. It just needs the local effects. You can't afford to replicate your internal network and its BGP routers - nor would it do much good. You can't even simulate it, because you don’t have good models of the components.

No aviation component company could do business if they held back on the functional specs that allow the simulator designers to do their work. Electronic components come with standard descriptions that can be virtually wired up. Software doesn't. Appliances don't. They could - the information exists - but there's no expectation of it, no standard way to express it, no tradition of delivering virtual components alongside the real.

If this changed, as it could, with automated tools to make costs manageable, we'd get a lot more than just safety nets for live systems. We could bring large systems into their own virtuous devops loop, we could explore what-ifs with new hardware and software components as they arrived on the market without having to cobble together expensive testbeds. VM and OS support for new devices would be revolutionized with standard specs and functional models. And the discipline of having to build and test both virtual and real against spec could only improve the quality of software. Heavens, it's almost like real engineering.

Things will still go wrong. The map is not the territory. Still, given that we know this approach works, not having a conversation about how it might happen should inspire a real sense of ely for the future. ®

Similar topics

TIP US OFF

Send us news


Other stories you might like