Last week I worked for a customer who went through a performance issue on a S2D cluster. The customer’s infrastructure is composed of one compute cluster (Hyper-V) and one 4-node S2D cluster. First, I checked if it was related to the network and then if it’s a hardware failure that produces this performance drop. Then I ran the script watch-cluster.ps1 from VMFleet.
The following screenshot comes from watch-cluster.ps1 script. As you can see, a CSV has almost 25ms of latency. A high latency impacts overall performance especially when intensive IO applications are hosted. If we look into the cache, a lot of miss per second are registered especially on the high latency CSV. But why Miss/sec can produce a high latency?
What happens in case of lack of cache?
The solution I troubleshooted is composed of 2 SSD and 8 HDD per node. The cache ratio is 1:4 and its capacity is almost of 6,5% of the raw capacity. The IO path in normal operation is depicted in the following schema:
Now in the current situation, I have a lot Miss/Sec, that means that SSD cannot handle these IO because there is not enough cache. Below the schema depicts the IO path for miss IO:
You can see that in case of miss, the IO go to HDD directly without being cached in SSD. HDD is really slow compared to SSD and each time IO works directly with this kind of storage device, the latency is increased. When the latency is increased, the overall performance decrease.
How to resolve that?
To resolve this issue, I told to customer to add two SSD in each node. These SSD should be equivalent (or almost) than those already installed in nodes. By adding SSD, I improve the cache ratio to 1:2 and the capacity to 10% compared to raw capacity.
It’s really important to size kindly the cache tier when you design your solution to avoid this issue. As said a fellow MVP: storage is cheap, downtime is expensive.