Update – 26 Nov 2015 – Well things can move very fast in the Linux world when they want to! Since I wrote this article an improved, but still pre-production, version of the polling code for the block layer and NVMe driver have made it into the Linux kernel and will go mainline in 4.4. There is a really nice overview of how it works here and Jens’ patch-set comments and some of his testing results can be found here. It is worth stressing that the results we present should only improve as the polling mode evolves. Stay tuned for updated performance results in due course!
I love SSDs! They have transformed the data center by providing high-performance, low latency access to storage. Low latency is transforming the data center stack. I will be digging into latency in my next few blog posts, starting with Driver latency here.
Characterizing the latency of QD=1 random NVM Express reads is harder than one might expect. However you can break this down:
- Media latency
- Controller latency
- Fabric latency
- Driver/OS latency/CPU latency
In the next few blog posts I will expand on these four. In this post I’m going to start with point #4: Driver/OS latency/CPU latency. This gets really fun because there are so many variables. What OS? Assuming Linux what version of the kernel? What type of CPU (x86, ARM, PowerPC, Sparc, etc.)? How many CPUs, how many HW threads, memory subsystem, how is interrupt handling done, are you polling or interrupt driven, etc.?
The NVM Express Driver – It’s fast, but is it fast enough?
A bunch of companies, including PMC, have worked hard to design NVM Express to be fast. It is much faster than legacy storage protocols (have a look at how one PCIe SSD outperforms configurations of eight and four SATA drives in an OTLP database application) but is it fast enough for Next-Generation NVM (NG-NVM)? To test that, we stuck a PCIe Logic Analyzer between an Intel x86 CPU and a PMC Flashtec NVRAM card and did some measuring. We obtained some interesting results.
The latency of a random read is composed of multiple parts. Those fall into two bins: the bin the SSD controls and the bin the host controls.
- For the SSD the latency is very well controlled. You can see in the plot below how we control that latency to be both quick (under 9us on average) and tightly bounded (always better than 11us).
- For the Host the latency is not very well controlled. You can see the average for non-SSD latency is only 5us, but the maximum in the measured period was more than 30us.
Where does this non-SSD latency variability come from? After some digging, we discovered it comes from the handling of the MSI-X interrupt and the passing of that interrupt back to the OS. There are many things that can impact how long this takes and have implications for latency and QoS.
Fixing the Driver!
So how do we fix the interrupt issue? Interestingly there are a couple of separate activities going on right now to address this:
Intel just launched a Storage Performance Developer Kit (SPDK) that attempts to improve the performance of NVMe SSDs. You can learn more about SDPK here. SPDK tries to address the issue I raised above in two ways:
- It polls the completion queues rather than using MSI-X interrupts.
- It runs in user-space which avoids the context switching associated with jumping from kernel-space to user-space.
Separately there is work going on within the Linux kernel block layer to add polling to block devices (including NVMe devices). You can view the current codebase for this here.
We compared these 2 methods (SPDK and polling driver) with the traditional driver. The results for latency, CPU load and throughput (at QD=1) are summarized below.
As you can see SPDK and the polling driver achieve better IOPS and latency QoS at the cost of increased CPU load. The polling driver performs slightly worse than SPDK but does have the advantage that it is tied into the block layer of the Linux kernel and can therefore provide services SDPK cannot. In addition, the kernel community is working to improve the polling driver prior to its integration into the upstream kernel.
As SSDs get faster and faster and start to take advantage of NG-NVM (e.g. PMC Flashtec NVRAM or Intel’s Optane SSDs) the overhead of servicing the I/O in a driver and OS is becoming more marked. The things we used to do because storage was slow (interrupts) make less sense now that storage can be very fast. This has implications all across the compute stack, from caching to tiering to fast primary striate. This is a fundamental shift that offers unprecedented opportunity. This work is feeding its way into operating systems, applications and even computer hardware. I will touch on this more in the next blog post.
Leave a Reply
You must be logged in to post a comment.