S2D – Tech-Coffee

Design the network for a Storage Spaces Direct cluster

Romain Serre — Mon, 07 Jan 2019 12:57:57 +0000

In a Storage Spaces Direct cluster, the network is the most important part. If the network is not well designed or implemented, you can expect poor performance and high latency. All Software-Defined are based on a healthy network whether it is Nutanix, VMware vSAN or Microsoft S2D. When I audit S2D configuration, most of the time the issue comes from the network. This is why I wrote this topic: how to design the network for a Storage Spaces Direct cluster.

Network requirements

The following statements come from the Microsoft documentation:

Minimum (for small scale 2-3 node)

10 Gbps network interface
Direct-connect (switchless) is supported with 2-nodes

Recommended (for high performance, at scale, or deployments of 4+ nodes)

NICs that are remote-direct memory access (RDMA) capable, iWARP (recommended) or RoCE
Two or more NICs for redundancy and performance
25 Gbps network interface or higher

As you can see, for a 4-Node S2D cluster or more, Microsoft recommends 25 Gbps network. I think it is a good recommendation, especially for a full flash configuration or when NVMe are implemented. Because S2D uses SMB to establish communication between nodes, RDMA can be leveraged (SMB Direct).

RDMA: iWARP and RoCE

You remember about DMA (Direct Memory Access)? This feature allows a device attached to a computer (like an SSD) to access to memory without passing by CPU. Thanks to this feature, we achieve better performance and reduce CPU usage. RDMA (Remote Direct Memory Access) is the same thing but across the network. RDMA allows a remote device to access to the local memory directly. Thanks to RDMA the CPU and latency is reduced while throughput is increased. RDMA is not a mandatory feature for S2D but it’s recommended to have it. Last year Microsoft stated RDMA increases S2D performance about 15% in average. So, I recommend heavily to implement it if you deploy a S2D cluster.

Two RDMA implementation is supported by Microsoft: iWARP (Internet Wide-area RDMA Protocol) and RoCE (RDMA over Converged Ethernet). And I can tell you one thing about these implementations: this is war! Microsoft recommends iWARP while lot of consultants prefer RoCE. In fact, Microsoft recommends iWARP because less configuration is required compared to RoCE. Because of RoCE, the number of Microsoft cases were high. But consultants prefer RoCE because Mellanox is behind this implementation. Mellanox provides valuable switches and network adapters with great firmware and drivers. Each time a new Windows Server build is released, a supported Mellanox driver / firmware is also released.

If you want more information about RoCE and iWARP, I suggest you this series of topics from Didier Van Hoye.

Switch Embedded Teaming

Before choosing the right switches, cables and network adapters, it’s important to understand what is the software story. In Windows Server 2012R2 and prior, you had to create a teaming. When the teaming was implemented, a tNIC was created. The tNIC is a sort of virtual NIC but connected to the Teaming. Then you were able to create the virtual switch connected to the tNIC. After that, the virtual NICs for management, storage, VMs and so on were added.

In addition to complexity, this solution prevents the use of RDMA on virtual network adapter (vNIC). This is why Microsoft has improved this part with Windows Server 2016. Now you can implement Switch Embedded Teaming (SET):

This solution reduces the network complexity and vNICs can support RDMA. However, there are some limitations with SET:

Each physical network adapter (pNIC) must be the same (same firmware, same drivers, same model)
Maximum of 8 pNIC in a SET
The following Load Balancing mode are supported: Hyper-V Port (specific case) and Dynamic. This limitation is a good thing because Dynamic is the appropriate choice for most of the case.

For more information about Load Balancing mode, Switch Embedded Teaming and limitation, you can read this documentation. Switch Embedded Teaming brings another great advantage: you can create an affinity between vNIC and pNIC. Let’s think about a SET where two pNICs are member of the teaming. On this vSwitch, you create two vNICs for storage purpose. You can create an affinity between one vNIC and one pNIC and another for the second vNIC and pNIC. It ensures that each pNIC are used.

The design presented below are based on Switch Embedded Teaming.

Network design: VMs traffics and storage separated

Some customers want to separate the VM traffics from the storage traffics. The first reason is they want to connect VM to 1Gbps network. Because storage network requires 10Gbps, you need to separate them. The second reason is they want to dedicate a device for storage such as switches. The following schema introduces this kind of design:

If you have 1Gbps network port for VMs, you can connect them to 1Gbps switches while network adapters for storage are connected to 10Gbps switches.

Whatever you choose, the VMs will be connected to the Switch Embedded Teaming (SET) and you have to create a vNIC for management on top of it. So, when you will connect to nodes through RDP, you will go through the SET. The physical NIC (pNIC) that are dedicated for storage (those on the right on the scheme) are not in a teaming. Instead, we leverage SMB MultiChannel which allows to use multiple network connections simultaneously. So, both network adapters will be used to establish SMB session.

Thanks to Simplified SMB MultiChannel, both pNICs can belong to the same network subnet and VLAN. Live-Migration is configured to use this network subnet and to leverage SMB.

Network Design: Converged topology

The following picture introduces my favorite design: a fully converged network. For this kind of topology, I recommend you 25Gbps network at least, especially with NVMe or full flash. In this case, only one SET is created with two or more pNICs. Then we create the following vNIC:

1x vNIC for host management (RDP, AD and so on)
2x vNIC for Storage (SMB, S2D and Live-Migration)

The vNIC for storage can belong to the same network subnet and VLAN thanks to simplified SMB MultiChannel. Live-Migration is configured to use this network and SMB protocol. RDMA are enabled on these vNICs as well as pNICs if they support it. Then an affinity is created between vNICs and pNICs.

I love this design because it really simple. You have one network adapter for BMC (iDRAC, ILO etc.) and only two network adapters for S2D and VM. So, the physical installation in datacenter and the software configuration are easy.

Network Design: 2-node S2D cluster

Because we are able to direct-attach both nodes in a 2-Node configuration, you don’t need switch for storage. However, Virtual Machines and host management vNIC requires connection so switches are required for these usages. But it can be 1Gbps switches to drastically reduce the solution cost.

The post Design the network for a Storage Spaces Direct cluster appeared first on Tech-Coffee.

S2D Real case: detect a lack of cache

Romain Serre — Tue, 04 Dec 2018 18:11:39 +0000

Last week I worked for a customer who went through a performance issue on a S2D cluster. The customer’s infrastructure is composed of one compute cluster (Hyper-V) and one 4-node S2D cluster. First, I checked if it was related to the network and then if it’s a hardware failure that produces this performance drop. Then I ran the script watch-cluster.ps1 from VMFleet.

The following screenshot comes from watch-cluster.ps1 script. As you can see, a CSV has almost 25ms of latency. A high latency impacts overall performance especially when intensive IO applications are hosted. If we look into the cache, a lot of miss per second are registered especially on the high latency CSV. But why Miss/sec can produce a high latency?

What happens in case of lack of cache?

The solution I troubleshooted is composed of 2 SSD and 8 HDD per node. The cache ratio is 1:4 and its capacity is almost of 6,5% of the raw capacity. The IO path in normal operation is depicted in the following schema:

Now in the current situation, I have a lot Miss/Sec, that means that SSD cannot handle these IO because there is not enough cache. Below the schema depicts the IO path for miss IO:

You can see that in case of miss, the IO go to HDD directly without being cached in SSD. HDD is really slow compared to SSD and each time IO works directly with this kind of storage device, the latency is increased. When the latency is increased, the overall performance decrease.

How to resolve that?

To resolve this issue, I told to customer to add two SSD in each node. These SSD should be equivalent (or almost) than those already installed in nodes. By adding SSD, I improve the cache ratio to 1:2 and the capacity to 10% compared to raw capacity.

It’s really important to size kindly the cache tier when you design your solution to avoid this issue. As said a fellow MVP: storage is cheap, downtime is expensive.

The post S2D Real case: detect a lack of cache appeared first on Tech-Coffee.

Storage Spaces Direct: performance tests between 2-Way Mirroring and Nested Resiliency

Romain Serre — Wed, 17 Oct 2018 09:38:52 +0000

Microsoft has released Windows Server 2019 with a new resiliency mode called nested resiliency. This mode enables to handle two failures in a two-node S2D cluster. Nested Resiliency comes in two flavors: nested two-way mirroring and nested mirror-accelerated parity. I’m certain that two-way mirroring is faster than nested mirror-accelerated parity but the first one provides only 25% of usable capacity while the second one provides 40% of usable capacity. After having discussed with some customers, they prefer improve the usable capacity than performance. Therefore, I should expect to deploy more nested mirror-accelerated parity than nested two-way mirroring.

Before Windows Server 2019, two-way mirroring (provide 50% of usable capacity) was mandatory in two-node S2D cluster. Now with Windows Server 2019, we have the choice. So, I wanted to compare performance between two-way mirroring and nested mirror-accelerated parity. Moreover, I want to know if compression and deduplication has an impact on performance and CPU workloads.

N.B: I executed tests on my lab which is composed of Do It Yourself servers. What I want to show is a “trend” to know what could be the bottleneck in some cases and if nested resiliency has an impact on performance. So please, don’t blame me in comment section

Test platform

I run my tests on the following platform composed of two nodes:

CPU: 1x Xeon 2620v2
Memory: 64GB of DDR3 ECC Registered
Storage:
- OS: 1x Intel SSD 530 128GB
- S2D HBA: Lenovo N2215
- S2D storage: 6x SSD Intel S3610 200GB
NIC: Mellanox Connectx 3-Pro (Firmware 5.50)
OS: Windows Server 2019 GA build

Both servers are connected to two Ubiquiti ES-16-XG switches. Even if it doesn’t support PFC/ETS and so one, RDMA is working (I tested it with test-RDMA script). I have not enough traffic in my lab to disturb RDMA without a proper configuration. Even if I implemented that in my lab, it is not supported and you should not implement your configuration in this way for production usage. On Windows Server side, I added both Mellanox network adapters in a SET and I created three virtual network adapters:

1x Management vNIC for RDP, AD and so one (routed)
2x SMB vNIC for live-migration and SMB traffics (not routed). Each vNIC is mapped to a pNIC.

To test the solution I use VMFleet. First I created volumes in two-way mirroring without compression, then I enabled deduplication. After I deleted and recreated volumes in nested mirror-accelerated parity without deduplication. Finally, I enabled compression and deduplication.

I run the VM Fleet with a block size of 4KB, an outstanding of 30 and on 2 threads per VM.

Two-Way Mirroring without deduplication results

First, I ran the test without write workloads to see the “maximum” performance I can get. My cluster is able to deliver 140K IOPS with a CPU workload of 82%.

In the following test, I added 30% of write workloads. The total IOPS is almost 97K for 87% of CPU usage.

As you can see, the RSS and VMMQ are well set because all Cores are used.

Two-Way Mirroring with deduplication

First, you can see that deduplication is efficient because I saved 70% of total storage.

Then I run a VMFleet test and has you can see, I have a huge drop in performance. By looking closely to the below screenshot, you can see it’s because of my CPU that reach almost 97%. I’m sure with a better CPU, I can get better performance. So first trend: deduplication has an impact on CPU workloads and if you plan to use this feature, don’t choose the low-end CPU.

By adding 30% write, I can’t expect better performance. The CPU still limit the overall cluster performance.

Nested Mirror-Accelerated Parity without deduplication

After I recreated volumes I run a test with 100% read. Compared to two-way mirroring, I have a slightly drop. I lost “only” 17KIOPS to reach 123KIOPS. The CPU usage is 82%. You can see also than the latency is great (2ms).

Then I added 30% write and we can see the performance drop compared to two-way mirroring. My CPU usage reached 95% that limit performance (but the latency is content to 6ms in average). So nested mirror-accelerated parity require more CPU than two-way mirroring.

Nested Mirror-Accelerated Parity with deduplication

First, deduplication works great also on nested mirror-accelerated parity volume. I saved 75% of storage.

As two-way mirroring with compression, I have poor performance because of my CPU (97% usage).

Conclusion

First, deduplication works great if you need to save space at the cost of a higher CPU usage. Secondly, nested mirror-accelerated parity requires more CPU workloads especially when there are write workloads. The following schemas illustrate the CPU bottleneck. In the case of deduplication, the latency always increases and I think because of CPU bottleneck. This is why I recommend to be careful about the CPU choice. Nested Mirror Accelerated Parity takes also more CPU workloads than 2-Way Mirroring.

Another interesting thing is that Mirror-Accelerated Parity produce a slightly performance drop compared to 2-Way Mirroring but brings the ability to support two failures in the cluster. With deduplication enabled we can save space to increase the usable space. In two-node configuration, I’ll recommend to customer Nested Mirror-Accelerated Parity by paying attention to the CPU.

The post Storage Spaces Direct: performance tests between 2-Way Mirroring and Nested Resiliency appeared first on Tech-Coffee.

Storage Spaces Direct: Parallel rebuild

Romain Serre — Fri, 22 Jun 2018 15:39:59 +0000

Parallel rebuild is a Storage Spaces features that enables to repair a storage pool even if the failed disk is not replaced. This feature is not new to Storage Spaces Direct because it exists also since Windows Server 2012 with Storage Spaces. This is an automatic process which occurs if you have enough free space in the storage pool. This is why Microsoft recommends to leave some free space in the storage pool to allow the parallel rebuild. This amount of free space is often forgotten when designing Storage Spaces Direct solution, this is why I wanted to write this theoretical topic.

How works parallel rebuild

Parallel rebuild needs some free spaces to work. It’s like spare free space. When you create a RAID6 volume, a disk is in spare in case of failure. In Storage Spaces (Direct), instead of spare disk, we have spare free space. Parallel rebuild occurs when a disk fails. If enough of capacity is available, parallel rebuild runs automatically and immediately to restore the resiliency of the volumes. In fact, Storage Spaces Direct creates a new copy of the data that were hosted by the failed disk.

When you receive the new disk (4h later because you took a +4h support :p), you can replace the failed disk. The disk is automatically added to the storage pool if the auto pool option is enabled. Once the disk is added to the storage pool, an automatic rebalance process is run to spread data across all disks to get the best efficiency.

How to calculate the amount of free spaces

Microsoft recommends to leave free space equal to one capacity disk per node until 4 drives:

2-node configuration: leave free the capacity of 2 capacity devices
3-node configuration: leave free the capacity of 3 capacity devices
4-node and more configuration: leave free the capacity of 4 capacity devices

Let’s think about a 4-node S2D cluster with the following storage configuration. I plan to deploy 3-Way Mirroring:

3x SSD of 800GB (Cache)
6x HDD of 2TB (Capacity). Total: 48TB of raw storage.

Because, I deploy a 4-node configuration, I should leave free space equivalent to four capacity drives. So, in this example 8TB should be the amount of free space for parallel rebuild. So, 40TB are available. Because I want to implement 3-Way Mirroring, I divide the available capacity per 3. So 13.3TB is the useable storage.

Now I choose to add a node to this cluster. I don’t need to reserve space for parallel rebuild (regarding the Microsoft recommendation). So I add 12TB capacity (6x HDD of 2TB) in the available capacity for a total of 52TB.

Conclusion

Parallel rebuild is an interesting feature because it enables to restore the resiliency even if the failed disk is not yet replaced. But parallel rebuild has a cost regarding the storage usage. Don’t forget the reserved capacity when you are planning the capacity.

The post Storage Spaces Direct: Parallel rebuild appeared first on Tech-Coffee.

Storage Spaces Direct and deduplication in Windows Server 2019

Romain Serre — Tue, 05 Jun 2018 14:58:14 +0000

When Windows Server 2016 has been released, the data deduplication was not available for ReFS file system. With Storage Spaces Direct, the volume should be formatted in ReFS to get latest features (Accelerated VHDX operations) and to get the best performance. So, for Storage Spaces Direct data deduplication was not available. Data Deduplication can reduce the storage usage by removing duplicated blocks by replacing by metadata.

Since Windows Server 1709, Data deduplication is supported for ReFS volume. That means that it will be also available in Windows Server 2019. I have updated my S2D lab to Windows Server 2019 to show you how it will be easy to enable deduplication on your S2D volume.

Requirements

To implement data deduplication on S2D volume in Windows Server 2019 you need the following:

An up and running S2D cluster executed by Windows Server 2019
(Optional) Windows Admin Center: it will help to implement deduplication
Install the deduplication feature on each node: Install-WindowsFeature FS-Data-Deduplication

Enable deduplication

Storage Spaces Direct in Windows Server 2019 will be fully manageable from Windows Admin Center (WAC). That means that you can also enable deduplication and compression from WAC. Connect to your hyperconverged cluster from WAC and navigate to Volumes. Select the volume you want and enable Deduplication and compression.

WAC raises an information pop-up to explain you what deduplication and compression is. Click on Start.

Select Hyper-V as deduplication mode and click on Enable deduplication.

Once it is activated, WAC should tell you the percent of saved state. Currently it is not working in WAC 1804.

Get information by using PowerShell

To get deduplication information on S2D node, open a PowerShell prompt. You can get the list of data deduplication command by running Get-Command *Dedup*.

If you run Get-DedupStatus, you should get the following data deduplication summary. As you can see in the following screenshot, I have saved some spaces in my CSV volume.

By running Get-DedupVolume, you can get the saving rate of the data deduplication. In my lab, data deduplication helps me to save almost 50% of storage space. Not bad.

Conclusion

Data deduplication on S2D was expected by many of customers. With Windows Server 2019, the feature will be available. Currently when you deploy 3-way Mirroring for VM, only 33% of raw storage is available. With Data Deduplication we can expect 50%. Thanks to this feature, the average cost of a S2D solution will be reduced.

The post Storage Spaces Direct and deduplication in Windows Server 2019 appeared first on Tech-Coffee.

Configure Dell S4048 switches for Storage Spaces Direct

Romain Serre — Thu, 26 Apr 2018 09:57:59 +0000

When we deploy Storage Spaces Direct (S2D), either hyperconverged or disaggregated, we have to configure the networking part. Usually we work with Dell hardware to deploy Storage Spaces Direct and the one of the switches supported by the Dell reference architectures is the Dell S4048 (Force 10). In this topic, we will see how to configure this switch from scratch.

This topic has been co-written with Frederic Stefani – Dell architect solution.

Stack or not

Usually, customers know the stack feature which is common to all network vendors such as Cisco, Dell, HP and so on. This feature enables to add several identical switches in a single configuration managed by a master switch. Because all switches share the same configuration, for the network administrators, all these switches are seen like a single one. So, the administrators connect to the master switch and then edit the configuration on all switches member of the stack.

If the stacking is sexy on the paper, there is a main issue especially with storage solution such as S2D. With S4048 stack, when you run an update, all switches reload at the same time. Because S2D highly relies on the network, your storage solution will crash. This is why the Dell reference architecture for S2D recommends to deploy a VLT (Virtual Link Trunking).

With Stacking you have a single control plane (you configure all switches from a single switch) and a single data plane in a loop free topology. In a VLT configuration, you have also a single data plane in a loop free topology but several control planes, which allow you to reboot switches one by one.

For this reason, the VLT (or MLAG) technology is the preferred way for Storage Spaces Direct.

S4048 overview

A S4048 switch has 48x 10GB/s SFP+ ports, 6x 40GB/s QSFP+ ports, a management port (1GB/s) and a serial port. The management and the serial ports are located on the back. In the below diagram, there is three kinds of connection:

Connection for S2D (in this example from port 1 to 16, but you can connect until port 48)
VLTi connection
Core connection: the uplink to connect to core switches

In the below architecture schema, you can find both S4048 interconnected by using VLTi ports and several S2D nodes (hyperconverged or disaggregated, that doesn’t matter) connected to port 1 to 16. In this topic, we will configure these switches regarding this configuration.

Switches initial configuration

When you start the switch for the first time you have to configure the initial settings such as switch name, IP address and so on. Plug a serial cable from the switch to your computer and connect through Telnet with the following settings:

Baud Rate: 115200
No Parity
8 data bits
1 stop bit
No flow control

Then you can run the following configuration:

Enable
Configure

# Configure the hostname
hostname SwitchName-01

# Set the IP address to the management ports, to connect to switch through IP
interface ManagementEthernet 1/1
ip address 192.168.1.1/24
no shutdown
exit

# Set the default gateway
ip route 0.0.0.0/0 192.168.1.254/24

# Enable SSH
ip ssh server enable

# Create a user and a password to connect to the switch
username admin password 7 MyPassword privilege 15

# Disable Telnet through IP
no ip telnet server enable
Exit

# We leave enabled Rapid Spanning Tree Protocol.
protocol spanning-tree rstp
no disable
Exit

Exit

# Write the configuration in memory
Copy running-configuration startup-configuration

After this configuration is applied, you can connect to the switch through SSH. Apply the same configuration to the other switch (excepted the name and IP address).

Configure switches for RDMA (RoCEv2)

N.B: For this part we assume that you know how RoCE v2 is working, especially DCB, PFC and ETS.

Because we implement the switches for S2D, we have to configure the switches for RDMA (RDMA over Converged Ethernet v2 implementation). Don’t forget that with RoCE v2, you have to configure DCB and PFC end to end (on servers and on switches side). In this configuration, we assume that you use the Priority ID 3 for SMB traffic.

# By default the queue value is 0 for all dot1p (QoS) traffic. We enable this command globally to change this behavivor.
service-class dynamic dot1p

# Data-Center-Bridging enabled. This enable to configure Lossless and latency sensitive traffic in a Priority Flow Control (PFC) queue.
dcb enable

# Provide a name to the DCB buffer threshold
dcb-buffer-threshold RDMA
priority 3 buffer-size 100 pause-threshold 50 resume-offset 35
exit

# Create a dcb map to configure the PFC and ETS rule (Enhanced Transmission Control)
dcb-map RDMA

# For priority group 0, we allocate 50% of the bandwidth and PFC is disabled
priority-group 0 bandwidth 50 pfc off

# For priority group 3, we allocate 50% of the bandwidth and PFC is enabled
priority-group 3 bandwidth 50 pfc on

# Priority group 3 contains traffic with dot1p priority 3.
priority-pgid 0 0 0 3 0 0 0 0

Exit

Exit
Copy running-configuration startup-configuration

Repeat this configuration on the other switch.

VLT domain implementation

First of all, we have to create Port Channel with two QSFP+ ports (port 1/49 and 1/50):

Enable
Configure

# Configure the port-channel 100 (make sure it is not used)
interface Port-channel 100

# Provide a description
description VLTi

# Do not apply an IP address to this port channel
no ip address

#Set the maximum MTU to 9216
mtu 9216

# Add port 1/49 and 1/50
channel-member fortyGigE 1/49,1/50

# Enable the port channel
no shutdown

Exit

Exit
Copy Running-Config Startup-Config

Repeat this configuration on the second switch Then we have to create the VLT domain and use this port-channel. Below the configuration on the first switch:

# Configure the VLT domain 1
vlt domain 1

# Specify the port-channel number which will be used by this VLT domain
peer-link port-channel 100

# Specify the IP address of the other switch
back-up destination 192.168.1.2

# Specify the priority of each switch
primary-priority 1

# Give an used MAC address for the VLT
system-mac mac-address 00:01:02:01:02:05

# Give an ID for each switch
unit-id 0

# Wait 10s before the configuration saved is applied after the switch reload or the peer link restore
delay-restore 10

Exit

Exit
Copy Running-Configuration Startup-Configuration

On the second switch, the configuration looks like this:

vlt domain 1
peer-link port-channel 100
back-up destination 192.168.1.1
primary-priority 2
system-mac mac-address 00:01:02:01:02:05
unit-id 1
delay-restore 10

Exit

Exit
Copy Running-Configuration Startup-Configuration

No the VLT is working. You don’t have to specify VLAN ID on this link. The VLT manage itself tagged and untagged traffic.

S2D port configuration

To finish the switch configuration, we have to configure ports and VLAN for S2D nodes:

Enable
Configure
Interface range Ten 1/1-1/16

# No IP address assigned to these ports
no ip address

# Enable the maximum MTU to 9216
mtu 9216

# Enable the management of untagged and tagged traffic
portmode hybrid

# Enable switchport Level 2 and this port is added to default VLAN to send untagged traffic.
Switchport

# Configure the port to Edge-Port
spanning-tree 0 portfast

# Enable BPDU guard on these port
spanning-tree rstp edge-port bpduguard

#Apply the DCB policy to these port
dcb-policy buffer-threshold RDMA

# Apply the DCB map to this port
dcb-map RDMA

# Enable port
no shutdown

Exit

Exit
Copy Running-Configuration Startup-Configuration

You can copy this configuration to the other switch. Now just VLAN are missing. To create VLAN and assign to port you can run the following configuration:

Interface VLAN 10
Description "Management"
Name "VLAN-10"
Untagged TenGigabitEthernet 1/1-1/16
Exit

Interface VLAN 20
Description "SMB"
Name "VLAN-20"
tagged TenGigabitEthernet 1/1-1/16
Exit

[etc.]
Exit
Copy Running-Config Startup-Config

Once you have finished, copy this configuration on the second switch.

The post Configure Dell S4048 switches for Storage Spaces Direct appeared first on Tech-Coffee.

Real Case: Implement Storage Replica between two S2D clusters

Romain Serre — Fri, 06 Apr 2018 09:08:17 +0000

This week, in part of my job I deployed a Storage Replica between two S2D Clusters. I’d like to share with you the steps I followed to implement the storage replication between two S2D hyperconverged cluster. Storage Replica enables to replicate volumes at the block-level. For my customer, Storage Replica is part of a Disaster Recovery Plan in case of the first room is down.

Architecture overview

The customer has two rooms. In each room, a four-node S2D cluster has been deployed. Each node has a Mellanox Connectx3-Pro (dual 10GB ports) and an Intel network adapter for VMs. Currently the Mellanox network adapter is used for SMB traffic such as S2D and Live-Migration. This network adapter supports RDMA. Storage Replica can leverage SMB Direct (RDMA). So, the goal is to use also the Mellanox adapters for Storage Replica.

In each room, two Dell S4048S switches are deployed in VLT. Then the switches in both room are connected by two fiber optics of around 5Km. The latency is less than 5ms so we can implement synchronous replication. The Storage Replica traffic must use the fiber optics. Currently the storage traffic is in a VLAN (ID: 247). We will use the same VLAN for Storage Replica.

Each S2D cluster has several Cluster Shared Volume (CSV). Among all these CSV, two CSV will be replicated in each S2D cluster. Below you can find the name of volumes that will be replicated:

(S2D Cluster Room 1) PERF-AREP-01 -> (S2D Cluster Room 2) PERF-PREP-01
(S2D Cluster Room 1) PERF-AREP-02 -> (S2D Cluster Room 2) PERF-PREP-02
(S2D Cluster Room 2) PERF-AREP-03 -> (S2D Cluster Room 1) PERF-PREP-03
(S2D Cluster Room 2) PERF-AREP-04 -> (S2D Cluster Room 1) PERF-PREP-04

In order to work, each volume (source and destination) are strictly identical (same capacity, same resilience, same file system etc.). I will create a log volume per volume so I’m going to deploy 4 log volumes per S2D cluster.

Create log volumes

First of all, I create the log volumes by using the following cmdlet. The log volumes must not be converted to Cluster Shared Volume and a drive letter must be assigned:

New-Volume -StoragePoolFriendlyName "" `
           -FriendlyName "" `
           -FileSystem ReFS `
           -DriveLetter "" `
           –Size

As you can see in the following screenshots, I created four log volumes per cluster. The volumes are not CSV.

In the following screenshot, you can see that for each volume, there is a log volume.

Grant Storage Replica Access

You must grant security access between both cluster to implement Storage Replica. To grant the access, run the following cmdlets:

Grant-SRAccess -ComputerName "" -Cluster ""
Grant-SRAccess -ComputerName "" -Cluster ""

Test Storage Replica Topology

/!\ I didn’t success to run the test storage replica topology. It seems there is a known issue in this cmdlet

N.B: To run this test, you must move the CSV on the node which host the core cluster resources. In the below example, I moved the CSV to replicate on HyperV-02.

To run the test, you have to run the following cmdlet:

Test-SRTopology -SourceComputerName "" `
                -SourceVolumeName "c:\clusterstorage\PERF-AREP-01\" `
                -SourceLogVolumeName "R:" `
                -DestinationComputerName "" `
                -DestinationVolumeName "c:\ClusterStorage\Perf-PREP-01\" `
                -DestinationLogVolumeName "R:" `
                -DurationInMinutes 10 `
                -ResultPath "C:\temp"

As you can see in the below screenshot, the test is not successful because a path issue. Even if the test didn’t work, I was able to enable Storage Replica between cluster. So if you have the same issue, try to enable the replication (check the next section).

Enable the replication between two volumes

To enable the replication between the volumes, you can run the following cmdlets. With these cmdlets, I created the four replications.

New-SRPartnership -SourceComputerName "" `
                  -SourceRGName REP01 `
                  -SourceVolumeName c:\ClusterStorage\PERF-AREP-01 `
                  -SourceLogVolumeName R: `
                  -DestinationComputerName "" `
                  -DestinationRGName REP01 `
                  -DestinationVolumeName c:\ClusterStorage\PERF-PREP-01 `
                  -DestinationLogVolumeName R:

New-SRPartnership -SourceComputerName "" `
                  -SourceRGName REP02 `
                  -SourceVolumeName c:\ClusterStorage\PERF-AREP-02 `
                  -SourceLogVolumeName S: `
                  -DestinationComputerName "" `
                  -DestinationRGName REP02 `
                  -DestinationVolumeName c:\ClusterStorage\PERF-PREP-02 `
                  -DestinationLogVolumeName S:

New-SRPartnership -SourceComputerName "" `
                  -SourceRGName REP03 `
                  -SourceVolumeName c:\ClusterStorage\PERF-AREP-03 `
                  -SourceLogVolumeName T: `
                  -DestinationComputerName "" `
                  -DestinationRGName REP03 `
                  -DestinationVolumeName c:\ClusterStorage\PERF-PREP-03 `
                  -DestinationLogVolumeName T:

New-SRPartnership -SourceComputerName "" `
                  -SourceRGName REP04 `
                  -SourceVolumeName c:\ClusterStorage\PERF-AREP-04 `
                  -SourceLogVolumeName U: `
                  -DestinationComputerName "" `
                  -DestinationRGName REP04 `
                  -DestinationVolumeName c:\ClusterStorage\PERF-PREP-04 `
                  -DestinationLogVolumeName U:

Now that replication is enabled, if you open the failover clustering management, you can see that some volumes are source or destination. A new tab called replication is added and you can check the replication status. The destination volume is no longer accessible until you reverse storage replica way.

Once the initial synchronization is finished, the replication status is Continuously replicating.

Network adapters used by Storage Replica

In the overview section, I said that I want use the Mellanox network adapters for Storage Replica (for RDMA). So I run the following cmdlet to check the Storage Replica is using the Mellanox Network Adapters.

Reverse the Storage Replica way

To reverse the Storage Replica, you can use the following cmdlet:

Set-SRPartnership -NewSourceComputerName "" `
                  -SourceRGName REP01 `
                  -DestinationComputerName "" `
                  -DestinationRGName REP01

Conclusion

Storage Replica enables to replicate a volume at the block-level to another volume. In this case, I have two S2D clusters where each cluster hosts two source volumes and two destination volumes. Storage Replica helps the customer to implement a Disaster Recovery Plan.

The post Real Case: Implement Storage Replica between two S2D clusters appeared first on Tech-Coffee.

Use Honolulu to manage your Microsoft hyperconverged cluster

Romain Serre — Wed, 21 Feb 2018 11:09:19 +0000

Few months ago, I have written a topic about the next gen Microsoft management tool called Honolulu project. Honolulu provides management for standalone Windows Server, failover clustering and hyperconverged. Currently hyperconverged management works only on Windows Server Semi-Annual Channel (SAC) versions (I cross finger for Honolulu support on Windows Server LTSC). I have upgraded my lab to latest technical preview of Windows Server SAC to show you how to use Honolulu to manage your Microsoft hyperconverged cluster.

In part of my job, I deployed dozen of Microsoft hyperconverged cluster and to be honest, the main disadvantage of this solution is the management. Failover Clustering console is archaic and you have to use PowerShell to manage the infrastructure. Even if the Microsoft solution provides high-end performance and good reliability, the day-by-day management is tricky.

Thanks to Honolulu project we have now a modern management which can compete with other solutions on the market. Currently Honolulu is still in preview version and some of features are not yet available but it’s going to the right direction. Moreover, Honolulu project is free and can be installed on your laptop or on a dedicated server. As you wish !

Honolulu dashboard for hyperconverged cluster

Once you have added the cluster connection to Honolulu, you get a new line with the type Hyper-Converged Cluster. By clicking on it, you can access to a dashboard.

This dashboard provides a lot of useful information such as latest alerts provided by the Health Service, the overall performance of the cluster, the resource usage and information about servers, virtual machines, volumes and drives. You can see that currently the cluster performance charts indicate No data available. It is because the Preview of Windows Server that I have installed doesn’t provide information yet.

From my point of view, this dashboard is pretty clear and provides global information about the cluster. At a glance, you get the overall health of the cluster.

N.B: the memory usage indicated -35,6% because of a custom motherboard which not provide memory installed on the node.

Manage Drives

By clicking on Drives, you get information about the raw storage of your cluster and your storage devices. You get the total drives (I know I don’t follow the requirements because I have 5 drives on a node and 4 on another, but it is a lab ). Honolulu provides also the drive health and the raw capacity of the cluster.

By clicking on Inventory, you have detailed information about your drives such as the model, the size, the type, the storage usage and so on. At a glance, you know if you have to run an Optimize-StoragePool.

By clicking on a drive, you get further information about it. Moreover, you can act on it. For example, you can turn light on, retire the disk or update the firmware. For each drive you can get performance and capacity charts.

Manage volumes

By clicking on Volumes, you can get information about your Cluster Shared Volume. At a glance you get the health, the overall performance and the number of volumes.

In the inventory, you get further information about the volumes such as the status, the file system, the resiliency, the size and the storage usage. You can also create a volume.

By clicking on create a new volume, you get this:

By clicking on a volume, you get more information about it and you can make action such as open, resize, offline and delete.

Manage virtual machines

From Honolulu, you can also manage virtual machines. When you click on Virtual Machines | Inventory, you get the following information. You can also manage the VMs (start, stop, turn off, create a new one etc.). All chart values are in real time.

vSwitches management

From the Hyper-Converged cluster pane, you have information about virtual switches. You can create a new one, delete rename and change settings of an existing one.

Node management

Honolulu provides also information about your nodes in the Servers pane. At a glance you get the overall health of all your nodes and resource usage.

In the inventory, you have further information about your nodes.

If you click on a node, you can pause the node for updates or hardware maintenance. You have also detailed information such as performance chartsm drives connected to the node and so on.

Conclusion

Project Honolulu is the future of Windows Server in term of management. This product provides great information about Windows Server, Failover Clustering and Hyperconverged cluster in a web-based form. From my point of view, Honolulu eases the Microsoft hyperconverged solution management and can help administrators. Some features are missing but Microsoft listen the community. Honolulu is modular because it is based on extensions. Without a doubt, Microsoft will add features regularly. Just I cross finger for Honolulu support on Windows Server 2016 released in October 2016 but I am optimistic.

The post Use Honolulu to manage your Microsoft hyperconverged cluster appeared first on Tech-Coffee.

Monitor S2D with Operations Manager 2016

Romain Serre — Thu, 30 Nov 2017 18:24:17 +0000

Storage Spaces Direct (S2D) is the Microsoft Software-Defined Storage solution. Thanks to S2D, we can deploy hyperconverged infrastructure based on Microsoft technologies such as Hyper-V. This feature is included in Windows Server 2016 Datacenter edition. You can find a lot of blog posts about S2D on this website. In this topic, I’ll talk about how to monitor S2D.

S2D is a storage solution and so it is a critical component. The S2D availability can also affect the virtual machines and applications. Therefore, we have to monitor S2D to ensure availability but also performance. When you enable Storage Spaces Direct, a new cluster role is also enabled: the Health Service. This cluster role gathers metrics and alerts of all cluster nodes and provide them from a single pane of glass (an API). This API is accessible from PowerShell, .Net and so on. Even if Health Service is a good idea, it is not usable for day-to-day administration because health service provides real time metrics and no historical. Moreover there is no GUI with health service.

Microsoft has written a management pack for Operations Manager which get information from health service API on a regular basis. In this way, we are able to make chart based on these information. Moreover, SCOM is able to raise alerts regarding S2D state. If you are using SCOM and S2D in your IT, I suggest you to install the Storage Spaces Direct management pack

Requirements

The below requirements come from the management pack documentation. To install the Storage Spaces Direct management pack you need:

System Center Operations Manager 2016
A S2D cluster based on Windows Server 2016 Datacenter with KB3216755 (Nano Server not supported)
Enable agent proxy settings on all S2D nodes
A working S2D cluster (hyperconverged or disaggregated)

You can download the management pack from this link.

Management pack installation

After you have downloaded and installed the management pack, you get the following files

File	Description
Storage Spaces Direct 2016	Microsoft Windows Server 2016 Storage Spaces Direct Management Pack.
Storage Spaces Direct 2016 Presentation	This Management Pack adds views and dashboards for the management pack.
Microsoft System Center Operations Manager Storage Visualization Library	This Management Pack contains basic visual components required for the management pack dashboards.
Microsoft Storage Library	A set of common classes for Microsoft Storage management packs.

Then, open an Operations Manager console and navigate in Administration. Right click on Management Packs and select import Management Packs. Then select Add from disk.

If you have Internet on your server, you can select Yes in the following pop-up to resolve dependencies with the online catalog.

In the next Window, Click on Add and select the Storage Spaces Direct management pack files. Then click on Install.

In monitoring pane, you should get a Storage Spaces Direct 2016 “folder” as below. You should also get the following error. It is because the management pack is not yet fully initialized and you have to wait a few minutes.

Operations Manager configuration

First, be sure that on agent proxy is enabled for each S2D nodes. Navigate in Administration | Agent Managed. Then right click on the node and select properties. In Security, be sure that Allow this agent to act as a proxy and discover managed objects on other computers is enabled.

Now I need to create a group for S2D nodes. I’d like this group is dynamic to not populate it manually. To create a group, navigate in Authoring. Right click on groups and select Create a new Group.

Provide a name for this group then select a destination management pack. I have created a dedicated management pack for overrides. I have called this management pack Storage Spaces Direct – Custom.

In the next window, I just click next because I don’t want to provide explicit group members.

Next I create a query for dynamic inclusion. The rule is simple: each server which has an Active Directory DN containing the word Hyper-V is added to the group.

As you can see in the below screen, my S2D nodes are added in a specific OU called Hyper-V. Each time I’ll add a node, the node is moved to this OU, and so my Operations Manager group is populated.

In the next screen of the group wizard, I just click on next.

Then I click again on next because I don’t want to exclude objects from this group.

Then your group is added and should be populated with S2D nodes. Now navigate to Monitoring | Storage | Storage Spaces Direct 2016 | Storage Spaces Direct 2016. Click on the “hamburger” menu on the right and select Add Group.

Then select the group you have just created.

From this point, the configuration is finished. Now you have to wait a long time (I’ve been waiting for 2 or 3 hours) before getting all information.

Monitor S2D

After a few hours, you should get information as below. You can get information about storage sub system, volume, nodes and file shares for disaggregated infrastructure. You can click on each square to get more information.

On the below screenshot, you can get information about volume. They are really valuable because you have the state, the total capacity, IOPS, throughput and so on. Active alerts on volume are also displayed.

Below screenshot shows information about Storage Sub System:

Conclusion

If you are already using Operations Manager 2016 and Storage Spaces Direct, you can easily monitor your storage solution. The management pack is free so I really suggest you to install it. If you are not using Operations Manager, you should find a solution to monitor S2D because the storage layer is a critical component.

The post Monitor S2D with Operations Manager 2016 appeared first on Tech-Coffee.

Dell R730XD bluescreen with S130 adapter and S2D

Romain Serre — Thu, 09 Nov 2017 17:31:59 +0000

This week I worked for a customer which had issue with his Storage Spaces Direct cluster (S2D). When he restarted a node, Windows Server didn’t start and a bluescreen appeared. It is because the operating system disks were plugged on S130 while Storage Spaces Direct devices were connected to HBA330mini. It is an unsupported configuration especially with Dell R730XD. In this topic, I’ll describe how I have changed the configuration to be supported.

This topic was co-written with my fellow Frederic Stefani (Dell – Solution Architect). Thanks for the HBA330 image

Symptom

You have several Dell R730XD added to a Windows Server 2016 failover cluster where S2D is enabled. Moreover the operating system is installed on two storage devices connected to a S130 in software RAID. When you reboot a node, you have the following bluescreen.

How to resolve issue

This issue occurs because S2D and operating system connected to S130 is not supported. You have to connect the operating system on HBA330mini. This means that the operating system will not be installed on a RAID 1. But a S2D node is redeployed quickly if you have written proper PowerShell script .

To make the hardware change, you need a 50cm SFF-8643 cable to connect operating system to HBA330mini. Moreover, you have to reinstall the operating system (sorry about that). Then a HBA330mini firmware image must be applied otherwise enclosure will not be present in operating system.

Connect the operating system disk to HBA330mini

First place the node into maintenance mode. In the cluster manager, right click on the node, select Pause and Drain Roles.

Then stop the node. When the node is shutdown, you can evict the node from the cluster and delete the Active Directory computer object related to the node (you have to reinstall the operating system).

First we have to remove the cable where connector are circle in red in the below picture. This cable connects the both operating system storage devices to S130.

To connect the operating system device, you need a SFF-8643 cable as below

So disconnect the SAS cable between operating system devices and S130.

Then we need to remove those fans to be able to plug the SFF-8643 on the backplane. To remove the fans, turn the blue items which are circle in red in the below picture

Then connect operating system device to backplane with SFF-8643 cable. Plug the cable in SAS A1 port on the backplane. Remove also the left operating system device from the server (the top left one in the below picture). This is now your spare device.

Add the fans in the server and turn the blue items.

Start the server and open the BIOS settings. Navigate to SATA Settings and set Embedded SATA to Off. Restart the server.

Start again the BIOS settings and open device settings. Check that S130 is not in the menu and select the HBA330mini device.

Check that another physical disk is connected to the HBA as below.

Reinstall operating system and apply HBA330mini image

Now that the operating system disk is connected to HBA330mini, it’s time to reinstall operating system. So use your favorite way to install Windows Server 2016. One the OS is installed, mount your virtual media from iDRAC and mount this image to the removable virtual media:

Next change the next boot to Virtual floppy.

On the next boot, the image is loaded and applied to the system. The server thinks that the HBA has been changed.

Add the node to the cluster

Now you can add the node to the cluster again. You should see enclosure.

Conclusion

When you build a S2D solution based on Dell R730XD, don’t connect your operating system disks to S130: you will get bluescreen on reboot. If you have already bought servers with S130, you can follow this topic to resolve your issue. If you plan to deploy S2D on R740XD, you can connect your operating system disks to BOSS and S2D devices to HBA330+.

The post Dell R730XD bluescreen with S130 adapter and S2D appeared first on Tech-Coffee.