Estimated reading time: 3 minutes
Thank you for reading this post, don't forget to subscribe! Happy New Year 2024!
Fault Tolerance…What does it mean? Let me break it down simply. Pictured above is just a bad design, not fault tolerance. This is not really what fault tolerance means. Having two or more of something is one factor, but how it’s implanted is just as important. Fault Tolerance incorporates two very important principles, High availability and Redundancy.
Now if we had a few toilets side by side and kept only 1 open and the other 2 on standby. Also, if it could move the user automatically to another toilet during a failure, then it technically it would be fault tolerant. Anyways, let’s move on from toilets to the real world. 🙂
Simply, Fault Tolerance is the ability to continue non-stop when a hardware failure occurs. A fault-tolerant system is designed from the ground up for reliability by building multiples of all critical components, such as CPUs, memories, disks and power supplies into the same computer. In the event one component fails, another takes over without skipping a beat.
Many systems are designed to recover from a failure by detecting the failed component and switching to another computer system. These systems, although sometimes called fault tolerant, are more widely known as “high availability” systems, requiring that the software re-submits the job when the second system is available.
True fault tolerant systems with redundant hardware are the most costly because the additional components add to the overall system cost. However, fault tolerant systems provide the same processing capacity after a failure as before, whereas high availability systems often provide reduced capacity. Ok, let move on to fault tolerance in S2D.
Fault Tolerance in S2D
Storage Space Direct (S2D) uses 3-way mirroring and will spread those mirrors across 3 different servers in the cluster. S2D supports full chassis and rack awareness and gives you the option to distribute data copies across these fault domains.
For disk failures, S2D also uses a self-healing approach… in basic terms, S2D offlines the disk and rebuilds the data copy on another node in the cluster. Replacing a drive adds capacity back into the system. This is important note as not all HCI vendors support self-healing, For example, on VSAN and some other vendors, disk failures take out entire vDisks.
Multisite Replication
S2D uses Storage Replica (that ships with Windows Server 2016) for synchronous or async replication. They support both stretched clusters and cluster to cluster DR. Storage Replica is part of Windows Server can be used for other data replication needs outside of S2D.
Ok…Next up, Storage QOS and Networking..
Until next time, Rob….