Single Point(s) of Failure (SPoF) is an issue worth considering in the design of any system. A single point of failure is as it sounds, a single _point_ (e.g. database, application, person, tool, etc) in your system whose failure can render the entire system incapable of performing its function.
[[Relevant XKCD]] illustrating a SPoF:
![[IMG_7028.png|Relevant XKCD]]
# Identification
Some ways to help identify your SPoFs. No matter what, you will never identify every possible SPoF.
- [[FMEA]]
- [[Diagram Types (index)|Systems Diagrams]]
- Researching SPoFs in similar systems
# Handling Strategies
- **Acceptance** - not all SPoFs are worth mitigating. For example, if they are/have:
- Low likelihood
- Impossible to mitigate
- **Notifications & Signaling** - monitoring of otherwise hidden SPoF in your system can be a "good enough" handling strategy, for anything important don't use "lack of alarms" as an indicator of success, [[Lack of Signal is a Bad Signal]].
- **Single Point Hardening** - you can make your SPoF less likely to fail
- **Redundancy** - you can have a backup copy of the tool that might fail, or an alternative tool that's capable of doing the same thing
- **Load Balancing** - in line with having redundant capabilities, having something that's capable of routing workload can help manage/mitigate failures
- **Fallback Process** - you can design a specific exception process for handling a SPoF. E.G. if a person you depend on isn't available, you might be able to find a backup
- **[[Have a Buffer|Buffering]]** - expect failures and build in additional [[Overhead]] for time, capacity, etc
****
# More
## Source
- [[Myself]]