October 4 demonstrated that costly outages can arise from relatively minor causes. Here are two key principles to reduce the risk of it happening to your organization.
By Stuart Stent, HPE Cloud Specialist
Once again, a massive outage has impacted one of the world’s largest companies, with Facebook becoming unreachable for an extended period on October 4th, 2021 due to a network change. While the cause was relatively minor, the value of the company dropped in the region of 5%1 – so let’s casually place that at $50 billion!!2
Without getting into the deep technical details, a change was made to a central network which made Facebook’s DNS servers unreachable. Given the rigorous change policies and procedures at Facebook, how did this happen? From what has been reported so far,3 during routine maintenance, a bug in an audit tool incorrectly permitted a command to be issued which took all backbone connections offline. Facebook’s DNS servers reacted to the backbone being down and stopped advertising their IP addresses to the internet, effectively taking them offline. I suspect in the weeks to come, Facebook will be reviewing the architecture and making changes to remove this potential failure mode.
The question is: What can be learnt from this most recent outage for everyone else? And how should you adjust your enterprise architectures and processes to reduce the chances of it happening to your systems?
1. Beware single points of failure. This may seem obvious; however, SPOFs (‘Single Points of Failure’) often lurk in plain sight and are easily overlooked. The non-obvious SPOFs usually hide at scale; for example, you could say that the internet has a single point of failure in that it only runs on planet earth (at least for now). Humanity has accepted the risk of this design (although some, like Elon Musk, are working hard to address this by colonizing Mars). While this may seem a tongue-in-cheek example, the principle holds true as we look at smaller (but still large) scales.
A good example is the US power distribution system. Data center architects always aim to have multiple providers delivering power to data centers to ensure redundancy. However, consider that in the Lower 48 power grid, what’s underpinning these providers, in reality, is a small number of power distribution domains (East, West, and Texas) to which all the local providers are connected. And while failures are rare, they are not unheard of (Northeast Blackout of 1965).
You need to consider what your risk appetite is for this particular scenario. You may be comfortable that this is such an improbable event that you don’t need to mitigate and can accept the risk, or conversely you may decide that some form of mitigation is necessary.
While these are extreme examples, SPOFs are everywhere and should be considered when designing your architecture. Some good questions to consider are:
- Are we reliant on a single vendor or upstream system?
- What is common between systems?
- Is there more than one Go/No-Go checkpoint?
- What is the failure mode of the Go/No-Go checkpoint?
2. Limit the blast radius. The second thing we can do is look closely at the blast radius of our systems. This idea is closely related to the SPOF concept, but instead of looking for the choke point, you are looking at the connectedness of the systems. A computer virus gives us a useful way to think about this connectedness. It is not uncommon to hear of viruses running rampant through entire organizations and the millions of dollars it takes to clean up those incidents. So, to examine the connectedness (and subsequent blast radius for an incident), you can ask how far a virus could spread through connected systems and where are the permanent “fire breaks” to constrain it?
You might be thinking, “We have anti-virus; doesn’t that stop the spread?” The answer to that is yes. Well, most of the time. However, the propagation of a virus is very similar to an outage, where issues cascade from system to system. If there are no fire breaks in place or other limitations to the blast radius, the effects of a bad change can be devastating. These sorts of propagating changes/failures can be present in almost any type of system but are most prevalent in networking, automation, CI/CD pipelines and security systems.
Some good questions to consider here are:
- What systems are connected to this system/process (and do they need to be)?
- Does one system rely on another system?
- What happens when one component in the chain is down?
- How can we limit the blast radius?
- How can we insert Go/No-Go checkpoints?
An iterative approach to resiliency
Incidents like this most recent Facebook outage, while highly disruptive and costly, can offer a unique learning opportunity for the industry as a whole and prompt us to re-examine our own systems and processes for similar vulnerabilities. SPOFs can be lurking in plain sight and should always be considered when designing systems. In the case of Facebook, we saw that propagating changes can have large scale effects that we need to design around in order to limit them.
Ultimately, introducing inter-planetary redundancy for our systems might still be a few years off, but through open reporting and root-cause analyses there are numerous opportunities to make iterative improvements to the resiliency of our systems today. It is a small amount of effort to mitigate the possibility of significant impact on stock price.
Learn about IT risk management services from HPE Pointnext Services and how we can help you fortify your data’s confidentiality, integrity, and availability in hybrid IT and at the edge.
Learn more about HPE Pointnext Services.
1. MarketWatch article: Facebook’s very, very bad day: Services go dark and stock plunges in wake of whistleblower revelations
2. See this Fortune company profile for Facebook, which shows a market value close to $1 trillion.
3. Facebook Engineering article: More Details About the October 4 Outage
Stuart Stent is a Cloud Specialist with over 20 years of global experience designing and implementing complex, large-scale technology solutions. Stuart leads professional services engagements at HPE for Fortune 500 companies and brings particular expertise in designing cloud solutions for highly regulated entities in the financial services and healthcare sectors that touch all aspects of cloud-native IT. He is a contributing author to the Doppler publications, regularly delivers security and architecture workshops, and works with groups across HPE to develop new best practices in cloud architecture, security, and application modernization.
Hewlett Packard Enterprise