Single Point of Failure

Single Point(s) of Failure (SPoF) is an issue worth considering in the design of any system. A single point of failure is as it sounds, a single _point_ (e.g. database, application, person, tool, etc) in your system whose failure can render the entire system incapable of performing its function. [[Relevant XKCD]] illustrating a SPoF: ![[IMG_7028.png|Relevant XKCD]] # Identification Some ways to help identify your SPoFs. No matter what, you will never identify every possible SPoF. - [[FMEA]] - [[Diagram Types (index)|Systems Diagrams]] - Researching SPoFs in similar systems # Handling Strategies - **Acceptance** - not all SPoFs are worth mitigating. For example, if they are/have: - Low likelihood - Impossible to mitigate - **Notifications & Signaling** - monitoring of otherwise hidden SPoF in your system can be a "good enough" handling strategy, for anything important don't use "lack of alarms" as an indicator of success, [[Lack of Signal is a Bad Signal]]. - **Single Point Hardening** - you can make your SPoF less likely to fail - **Redundancy** - you can have a backup copy of the tool that might fail, or an alternative tool that's capable of doing the same thing - **Load Balancing** - in line with having redundant capabilities, having something that's capable of routing workload can help manage/mitigate failures - **Fallback Process** - you can design a specific exception process for handling a SPoF. E.G. if a person you depend on isn't available, you might be able to find a backup - **[[Have a Buffer|Buffering]]** - expect failures and build in additional [[Overhead]] for time, capacity, etc **** # More ## Source - [[Myself]]