Last month I attended Flash Memory Summit (FMS) 2014 in Santa Clara, California. FMS is probably the biggest conference and exposition of NVM technology. It combines technical tracks with a huge exposition and is a great place to catch up and hobnob with like-minded experts in NVM.
PMC was very well represented at FMS. We presented eight technical papers, gave a keynote speech and launched our Flashtec NVRAM product. I gave a talk entitled “Accelerating Data Centers Using NVMe and CUDA” which is based on a PMC CTO project codenamed Donard. In this blog post I want to dig a little deeper into the paper I presented and some of the implications of this for acceleration in data center (DC) environments.
As most of you know, Solid-State Drives (SSDs) change the dynamics of a server considerably. Pre-SSDs, we used HDDs which, at best, can read a few hundred MB/s. Now a single enterprise PCIe SSD, such as those based on the PMC Flashtec Controllers, can reach several GB/s, the equivalent of almost a million 4KB IOPS. This deluge of data will swamp any CPU as soon as it is asked to manipulate this data stream in any way. In addition, the CPU is required to stage this data in its DRAM prior to processing, as illustrated in Figure 1, and this has three undesirable consequences:
- A portion of the DRAM bandwidth is consumed solely with the staging and de-staging of this data stream. This is bandwidth that cannot be used for more useful tasks.
- The staging requires some volume of DRAM and that means that the system needs more DRAM than might have been desired. DRAM is expensive and is often a scarce resource in DC servers. Adding even a few extra GB per server across a large DC could add hundreds of thousands of dollars to the cost and requires more power and cooling.
- The CPU needed may need to be more powerful than desired, just to achieve the DRAM capacities and bandwidths.
The Donard solution to this problem is to add some computation elements on the PCIe fabric itself and enable PCIe Peer-to-Peer (p2p) communication between all PCIe devices. This leads to the rebalanced situation given in Figure 2, where data no longer needs to flow into the CPU and be staged in DRAM. Instead, data can flow directly from the SSD or network device into the processing element.
The analogy I like to use for the Donard approach is that the CPU goes from being all the musicians in an orchestra to becoming the conductor. The CPU can still manage flows, impose security policies and quality of service, as well as provide management/reliability services. However, since it is no longer in the path of the data itself, the requirements upon it are much less onerous. This has several interesting consequences:
- The server may be able to use a less powerful CPU and hence save money, power and cooling.
- The server may be able to use less DRAM and hence save money, power and cooling.
- The server may be able to simplify its PCIe fabric and hence save money, power and cooling.
Enabling p2p between PCIe devices is not trivial but it is very possible. In fact companies like Nvidia have been doing this for some time with technologies like GPU Direct which allows p2p communication between certain types of Nvida GPU devices. In addition, Nvidia has worked with Mellanox to demonstrate that p2p can also work between heterogeneous devices.
What we did in our paper was to enable p2p communication to and from a block storage device and do it using an open-source driver (the NVM Express Linux driver) as opposed to a proprietary driver. The advantage of using an open-source driver is that this can be shared and worked on by the entire developer community and is not locked to any specific hardware. In fact the work we present here should work with any NVM Express (NVMe) compliant device.
In order to enable p2p communication between NVMe SSDs and other PCIe devices, we needed to make some modifications to the Linux kernel and extend the NVMe driver to add new IOCTLs that tie into the p2p capabilities of other PCIe devices. Our first experiments connected an NVMe SSD (based on our Flashtec controllers) to an Nvidia K20c GPU card. The results of this p2p enablement versus the normal approach are shown in Figure 3.
As you can see from Figure 3, the p2p approach both increased bandwidth and reduced the DRAM requirements associated with transferring data from the NVMe SSD to the GPU.
My next blog post will continue to explore the Donard approach and add RDMA into the mix. Stay tuned for that and, in the meantime, feel free to reach out to me via a comment and follow me on Twitter @stepbates.
Leave a Reply
You must be logged in to post a comment.