Design the network for a Storage Spaces Direct cluster

In a Storage Spaces Direct cluster, the network is the most important part. If the network is not well designed or implemented, you can expect poor performance and high latency. All Software-Defined are based on a healthy network whether it is Nutanix, VMware vSAN or Microsoft S2D. When I audit S2D configuration, most of the time the issue comes from the network. This is why I wrote this topic: how to design the network for a Storage Spaces Direct cluster.

Network requirements

The following statements come from the Microsoft documentation:

Minimum (for small scale 2-3 node)

  • 10 Gbps network interface
  • Direct-connect (switchless) is supported with 2-nodes

Recommended (for high performance, at scale, or deployments of 4+ nodes)

  • NICs that are remote-direct memory access (RDMA) capable, iWARP (recommended) or RoCE
  • Two or more NICs for redundancy and performance
  • 25 Gbps network interface or higher

As you can see, for a 4-Node S2D cluster or more, Microsoft recommends 25 Gbps network. I think it is a good recommendation, especially for a full flash configuration or when NVMe are implemented. Because S2D uses SMB to establish communication between nodes, RDMA can be leveraged (SMB Direct).

RDMA: iWARP and RoCE

You remember about DMA (Direct Memory Access)? This feature allows a device attached to a computer (like an SSD) to access to memory without passing by CPU. Thanks to this feature, we achieve better performance and reduce CPU usage. RDMA (Remote Direct Memory Access) is the same thing but across the network. RDMA allows a remote device to access to the local memory directly. Thanks to RDMA the CPU and latency is reduced while throughput is increased. RDMA is not a mandatory feature for S2D but it’s recommended to have it. Last year Microsoft stated RDMA increases S2D performance about 15% in average. So, I recommend heavily to implement it if you deploy a S2D cluster.

Two RDMA implementation is supported by Microsoft: iWARP (Internet Wide-area RDMA Protocol) and RoCE (RDMA over Converged Ethernet). And I can tell you one thing about these implementations: this is war! Microsoft recommends iWARP while lot of consultants prefer RoCE. In fact, Microsoft recommends iWARP because less configuration is required compared to RoCE. Because of RoCE, the number of Microsoft cases were high. But consultants prefer RoCE because Mellanox is behind this implementation. Mellanox provides valuable switches and network adapters with great firmware and drivers. Each time a new Windows Server build is released, a supported Mellanox driver / firmware is also released.

If you want more information about RoCE and iWARP, I suggest you this series of topics from Didier Van Hoye.

Switch Embedded Teaming

Before choosing the right switches, cables and network adapters, it’s important to understand what is the software story. In Windows Server 2012R2 and prior, you had to create a teaming. When the teaming was implemented, a tNIC was created. The tNIC is a sort of virtual NIC but connected to the Teaming. Then you were able to create the virtual switch connected to the tNIC. After that, the virtual NICs for management, storage, VMs and so on were added.

In addition to complexity, this solution prevents the use of RDMA on virtual network adapter (vNIC). This is why Microsoft has improved this part with Windows Server 2016. Now you can implement Switch Embedded Teaming (SET):

This solution reduces the network complexity and vNICs can support RDMA. However, there are some limitations with SET:

  • Each physical network adapter (pNIC) must be the same (same firmware, same drivers, same model)
  • Maximum of 8 pNIC in a SET
  • The following Load Balancing mode are supported: Hyper-V Port (specific case) and Dynamic. This limitation is a good thing because Dynamic is the appropriate choice for most of the case.

For more information about Load Balancing mode, Switch Embedded Teaming and limitation, you can read this documentation. Switch Embedded Teaming brings another great advantage: you can create an affinity between vNIC and pNIC. Let’s think about a SET where two pNICs are member of the teaming. On this vSwitch, you create two vNICs for storage purpose. You can create an affinity between one vNIC and one pNIC and another for the second vNIC and pNIC. It ensures that each pNIC are used.

The design presented below are based on Switch Embedded Teaming.

Network design: VMs traffics and storage separated

Some customers want to separate the VM traffics from the storage traffics. The first reason is they want to connect VM to 1Gbps network. Because storage network requires 10Gbps, you need to separate them. The second reason is they want to dedicate a device for storage such as switches. The following schema introduces this kind of design:

If you have 1Gbps network port for VMs, you can connect them to 1Gbps switches while network adapters for storage are connected to 10Gbps switches.

Whatever you choose, the VMs will be connected to the Switch Embedded Teaming (SET) and you have to create a vNIC for management on top of it. So, when you will connect to nodes through RDP, you will go through the SET. The physical NIC (pNIC) that are dedicated for storage (those on the right on the scheme) are not in a teaming. Instead, we leverage SMB MultiChannel which allows to use multiple network connections simultaneously. So, both network adapters will be used to establish SMB session.

Thanks to Simplified SMB MultiChannel, both pNICs can belong to the same network subnet and VLAN. Live-Migration is configured to use this network subnet and to leverage SMB.

Network Design: Converged topology

The following picture introduces my favorite design: a fully converged network. For this kind of topology, I recommend you 25Gbps network at least, especially with NVMe or full flash. In this case, only one SET is created with two or more pNICs. Then we create the following vNIC:

  • 1x vNIC for host management (RDP, AD and so on)
  • 2x vNIC for Storage (SMB, S2D and Live-Migration)

The vNIC for storage can belong to the same network subnet and VLAN thanks to simplified SMB MultiChannel. Live-Migration is configured to use this network and SMB protocol. RDMA are enabled on these vNICs as well as pNICs if they support it. Then an affinity is created between vNICs and pNICs.

I love this design because it really simple. You have one network adapter for BMC (iDRAC, ILO etc.) and only two network adapters for S2D and VM. So, the physical installation in datacenter and the software configuration are easy.

Network Design: 2-node S2D cluster

Because we are able to direct-attach both nodes in a 2-Node configuration, you don’t need switch for storage. However, Virtual Machines and host management vNIC requires connection so switches are required for these usages. But it can be 1Gbps switches to drastically reduce the solution cost.

About Romain Serre

Romain Serre works in Lyon as a Senior Consultant. He is focused on Microsoft Technology, especially on Hyper-V, System Center, Storage, networking and Cloud OS technology as Microsoft Azure or Azure Stack. He is a MVP and he is certified Microsoft Certified Solution Expert (MCSE Server Infrastructure & Private Cloud), on Hyper-V and on Microsoft Azure (Implementing a Microsoft Azure Solution).

11 comments

  1. Do you have the powershell commands (or links to other sites) to set up these different types of network designs?

  2. Great write up. I’m still getting to grips with storage spaces direct. I’m soon going to buy two servers for a two node setup. They will have at least dual 100gb mellonex nics. I’m unsure how to go about designing the network though. Would it be best to physically connect both servers together over the 100gb mellonex and create a virtual SET switch to handle both storage and migration traffic making use of RDMA? or should I seperate storage traffic away from everything else? If so, the servers will also be fitted with 10gb intel nics. Do I leave the mellonex for storage without creating a vswitch, and then directly cable the intels together for migration albeit losing RDMA functionality? I think the first option would be best?

    • Hello,

      The best way to achieve this installation is to attached both node directly (switchless) by using 100G Mellanox Ethernet. You configure two differents network subnet on them (one for each port) and you use these NICs to handle storage and Live-Migration. Then you create a SET by using 10G network adapter and a virtual network adapter for management. So the VM will use the SET to communicate with the physical world.

      • Hi Romain,
        We’ve been setting up a hyper converged S2D cluster with 2 inter connected nodes. The nodes are connected via a 1GB switch for host and VM internet connectivity, and inter connected with 25GB DAC cables. I’ve created a SET switch for the management on the 1GB NICs and also a SET Team for the 25GB NIC for the Live Migration and storage. But reading your answer above it seems a SET team is not necessary for the 25GB NICs after all?

  3. Hello Romain,

    Great article thanks.

    The “Full converged” setup is attractive.

    I am not a security specialist so please forgive my lack of knowledge.

    May I have your opinion (and readers too) about DMZ VMs in a such scenario ? I assume that they will be “kind of protected” (note the quotes) by different vlans.

    Which techniques would you implement to really isolate DMZ VMs from the others ? (knowing that DMZ traffic will hit the same physical switches as legitimate traffic does.)

    Or would you simply recommend a dedicated host and physical network for DMZ ?

    Thank you

    Nico

    • Our DMZ was a standalone physical server with local disks, connected to a dedicated switch run off from our firewall. We recently moved it over to our HyperV cluster. Now we have VMs connected to a DMZ Virtual Switch which itself is connected to a dedicated NIC on the hosts. It’s all on it’s own VLAN, yet shares the same physical underlying infrastructure. Many businesses are going down this route now. There are pros and cons, but I think the pros far outweigh any cons. The benefit of this model is that your DMZ VMs can live migrate and access the same storage as everything else. In my mind it’s actually a less complex approach when stepping back and thinking about the overall design and workloads (server warranties, backups, DR, updates). I know it’s a bit ‘all eggs in one basket’ but the team can concentrate on the HyperV platform well rather than looking after that, and a seperate DMZ system.

    • Hello,

      VLAN is a feature to isolate a subnet from another. By extrapolating, we can say a VLAN is just a “virtual switch” in a physical switch. If you want to isolate subnet without using VLAN, you have to buy more and more physical switches connected to different ports on your firewall/router. To avoid that, we create VLAN and then we create a virtual network adapter on firewall/router. Some security guys don’t trust this statement by lacking knowledge or simply because they don’t trust this technologies (bug, vulnerability, 0-Day etc.)

      So to be sure, you can dedicate hosts for DMZ (highly secure environment) but most of the time, you can host the DMZ VM and LAN VM together.

  4. Hallo, I need some clearance about SET configuration.
    I’ve found no information about the pyhsical connetion limitations, settings and the behavior of the SET Team switch. Has it per default RSTP enabled?
    Is it possible to build (for high availability) a SET Team and connect each of the members to another switch (the Switches are member of an RSTP Infrastructure and uplinked (not stacked))? Your picture above (Network Design: 2-node S2D cluster) suggest that (but it is not possible with classic Windows-Team-Adapters).

    Should RSTP/STP enabled on this switchports?

    thx

  5. Hi, great article ! can you give us the powershell cmlets to ensure that the storage and live migration using the dedicated nic? I have a similar setup but the LAN networks are 10 Gbps.
    Thanks!

  6. Romain,
    Enjoyed the article. I have been using Servers with DAS for sometime, now all SSD. I am considering move to Storage Spaces Direct Cluster but have some questions. Ironically DAS solution is VERY FAST and HAS BEEN Very Reliable, but obviously single fault, server could die, but has not happened.

    That said the questions:
    1. Do you find Two Node Cluster with Nested Resiliency and Switchless Direct Attached Ethernet (Crossover/MDIX) SFP+ Cables give really good uptime and Dependable for Production Environment and Reliable Storage?

    2. Most 100GBe or 40GBe boards are Dual, do you recommend that we use these one backup to the other, or better, one IP Network for Storage and one IP Network for Migration.

    We have already 10GBe Network for (Intel X710-T4) for Main IP VM and Management 30GBe LACP with 10GBe Backup Switch (Redundant).

    We have already 1GBe LACP on Stacked Switches for External and DMZ functions, controlled by Firewall.

    Look forward to your input/response no the two questions and other side thoughts about moving from DAS to S2D cluster, as it is a big move and leap in faith.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

x

Check Also

Configure Dell S4048 switches for Storage Spaces Direct

When we deploy Storage Spaces Direct (S2D), either hyperconverged or disaggregated, we have to configure ...

Update Mellanox network adapter firmware

Like the others, Mellanox network adapters should be updated to the latest firmware to solve ...

Switch Embedded Teaming

Switch Embedded Teaming (SET) is a new feature in the Software-Defined Networking stack that will ...