Fault Domain Awareness with Storage Spaces Direct

Fault Domain Awareness is a new feature in Failover Clustering since Windows Server 2016. Fault Domain Awareness brings a new approach of the high availability which is more flexible and Cloud oriented. In previous edition, the high availability was based only on node: if a node failed, the resources were moved to another node. With Fault Domain Awareness, the point of failure can be a node (as previously), a chassis, a rack or a site. This enables a greater flexibility and a modern approach of the high availability. Datacenters which are Cloud oriented, require this kind of flexibility to change the point of failure of the cluster from the single node to an entire rack which contains several nodes.

In Microsoft definition, a fault domain is a set of hardware that shares the same point of failure. The default fault domain in a cluster is the node. You can also create the fault domain based on chassis, rack and site. Moreover, a fault domain can belong to another fault domain. For example, you can create racks fault domains and configure them to specify that the parent is a site.

Storage Spaces Direct (S2D) can leverage Fault Domain Awareness to spread block replicas across fault domains (unfortunately it is not yet possible to spread block replicas across sites because Storage Spaces Direct doesn’t support stretched cluster with Storage Replica). Let think about a three-way mirroring implementation of S2D: this means that we have three times the data (the original and two replicas). S2D is able for example to create the original data on a first rack, and each replica are copied in several racks. In this way, even if you lose a rack, the storage keeps working.

In S2D documentation, Microsoft doesn’t say anymore the number nodes required for each resilience type:

  • 2-Way Mirroring: two fault domains
  • 3-way Mirroring: three fault domains
  • Erasure Coding: from four fault domains.

These statements are really important for design consideration. If you plan to use fault domain awareness with racks and you plan to use erasure coding, you need also four racks at least. Each rack must have the same number of nodes. So, in the case of there is four racks, the number of nodes per cluster can be 4, 8, 12 or 16. So by using fault domain awareness, you lose some flexibility for deployment, but you increase the availability capabilities.

Configure Fault Domain Awareness

This section introduces how to configure fault domain in the cluster. It is heavily recommended to make this configuration, before that you enable Storage Spaces Direct!

By using PowerShell

In this example, I show you how to configure the fault domain in your cluster with a two nodes cluster. It is not really useful for a two-node cluster to create fault domain but I just want to show you how to create them in the cluster configuration.

Before running the below cmdlet, I have initialized $CIM variable by using the following command (Cluster-Hyv01 is the name of my cluster):

$CIM = New-CimSession -ComputerName Cluster-Hyv01

Then I gather fault domain information by using the Get-ClusterFaultDomain cmdlet:

As you can see above, a fault domain is automatically created for each node. To create an additional fault domain, you can use the cmdlet New-ClusterFaultDomain as below.

If I run again the Get-ClusterFaultDomain cmdlet, you can see each fault domain.

Then I run the following cmdlet to set the Fault Domain parents:

Set-ClusterFaultDomain -Name Rack-22U -Parent Lyon
Set-ClusterFaultDomain -Name Chassis-Fabric -Parent Rack-22U
Set-ClusterFaultDomain -Name pyhyv01 -Parent Chassis-Fabric
Set-ClusterFaultDomain -Name pyhyv02 -Parent Chassis-Fabric

In the Failover Clustering manager, you can see the result by opening the node tab. As you can see below, each node belongs to Rack-22U and the site Lyon.

By using XML

You can also declare your physical infrastructure by using a XML File as below:

<Topology>
    <Site Name="Lyon" Location="Lyon 8e">
        <Rack Name="Rack-22U" Location="Restroom">
            <Node Name="pyhyv01" Location="Rack 6U" />
            <Node Name="pyhyv02" Location="Rack 12U" />
        </Rack>
    </Site>
</Topology>

Once your topology is written, you can configure your cluster with the XML File:

$xml = Get-Content <XML File> | Out-String
Set-ClusterFaultDomainXML -XML $xml

Conclusion

Fault Domain Awareness is a great feature to improve the availability of your infrastructure, especially with Storage Spaces Direct. The fault domain can be oriented on racks instead of nodes. This means that you can lose a higher number of nodes and keep the service running. On the other hand, it is necessary to be careful during the design phase because an equivalent number of nodes must be installed in each rack. If you need erasure coding, you require 4 racks.

About Romain Serre

Romain Serre works in Lyon as a Senior Consultant. He is focused on Microsoft Technology, especially on Hyper-V, System Center, Storage, networking and Cloud OS technology as Microsoft Azure or Azure Stack. He is a MVP and he is certified Microsoft Certified Solution Expert (MCSE Server Infrastructure & Private Cloud), on Hyper-V and on Microsoft Azure (Implementing a Microsoft Azure Solution).

4 comments

  1. Is any option to cluster FaultDomain after Create S2D cluster ?

  2. In Windows 2019 we have FaultDomainType StorageNode, where it fits ? Its parent to Node ?
    And what when i have multiple JBOD-s Per node. How can I ensure that a copy of data is not on a few enclosures in this same Node, when i want to have FaultDomain = Node

    • StorageNode refers to the Node. If you have several JBOD connected to the server, S2D will handle that by using SES Enclosure identification to be sure to not right a block and its copy on a single StorageNode.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

x

Check Also

The cluster resource could not be deleted since it is a core resource

The last month, I wanted to change the Witness of a cluster from a Cloud ...

Cluster Health Service in Windows Server 2016

Before Windows Server 2016, the alerting and the monitoring of the cluster were managed by ...

Understand Failover Cluster Quorum

This topic aims to explain the Quorum configuration in a Failover Clustering. As part of ...