S2D Real case: detect a lack of cache

Last week I worked for a customer who went through a performance issue on a S2D cluster. The customer’s infrastructure is composed of one compute cluster (Hyper-V) and one 4-node S2D cluster. First, I checked if it was related to the network and then if it’s a hardware failure that produces this performance drop. Then I ran the script watch-cluster.ps1 from VMFleet.

The following screenshot comes from watch-cluster.ps1 script. As you can see, a CSV has almost 25ms of latency. A high latency impacts overall performance especially when intensive IO applications are hosted. If we look into the cache, a lot of miss per second are registered especially on the high latency CSV. But why Miss/sec can produce a high latency?

What happens in case of lack of cache?

The solution I troubleshooted is composed of 2 SSD and 8 HDD per node. The cache ratio is 1:4 and its capacity is almost of 6,5% of the raw capacity. The IO path in normal operation is depicted in the following schema:

Now in the current situation, I have a lot Miss/Sec, that means that SSD cannot handle these IO because there is not enough cache. Below the schema depicts the IO path for miss IO:

You can see that in case of miss, the IO go to HDD directly without being cached in SSD. HDD is really slow compared to SSD and each time IO works directly with this kind of storage device, the latency is increased. When the latency is increased, the overall performance decrease.

How to resolve that?

To resolve this issue, I told to customer to add two SSD in each node. These SSD should be equivalent (or almost) than those already installed in nodes. By adding SSD, I improve the cache ratio to 1:2 and the capacity to 10% compared to raw capacity.

It’s really important to size kindly the cache tier when you design your solution to avoid this issue. As said a fellow MVP: storage is cheap, downtime is expensive.

About Romain Serre

Romain Serre works in Lyon as a Senior Consultant. He is focused on Microsoft Technology, especially on Hyper-V, System Center, Storage, networking and Cloud OS technology as Microsoft Azure or Azure Stack. He is a MVP and he is certified Microsoft Certified Solution Expert (MCSE Server Infrastructure & Private Cloud), on Hyper-V and on Microsoft Azure (Implementing a Microsoft Azure Solution).

One comment

  1. thanks for sharing…very helpfull

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.


Check Also

Storage Spaces Direct: Parallel rebuild

Parallel rebuild is a Storage Spaces features that enables to repair a storage pool even ...

Storage Spaces Direct and deduplication in Windows Server 2019

When Windows Server 2016 has been released, the data deduplication was not available for ReFS ...

Real Case: Implement Storage Replica between two S2D clusters

This week, in part of my job I deployed a Storage Replica between two S2D ...