Adding Remote Direct Memory Access
In my last blog post I introduced a project called Donard that implements peer-2-peer (p2p) data transfers between NVM Express devices and PCIe GPU cards. We showed how the p2p transfers could improve performance and offload the CPU, which can save power when compared to classical non-p2p transfers.
In this article we extend Donard to add RDMA-capable Network Interface Cards (NICs) into the mix of PCIe devices that can talk in a p2p fashion. This is very important as it brings the third part of what I call the PCIe trifecta to the Donard program. The three elements of the trifecta being storage, compute and networking.
Donard and RDMA
RDMA capabilities come in many flavors, and we wanted to be as agnostic as possible to the choice of protocol underneath RDMA. At the time of writing, the three main flavors of RDMA are iWARP (which runs over TCP/IP), Infiniband and RoCE (which runs on Converged Ethernet, a specially modified version of the base Ethernet protocol). I don’t want to get too involved in the pros and cons of these three flavors, but all of them implement the Infiniband Verbs; as such, that API became a suitable point to tie-in our p2p capabilities.
As we also aspired to use open standards as much as possible, we implemented our work using driver-level code from the Open Fabrics Alliance. This means the work presented here should work on any future fabric that is compliant to the Open Fabrics Enterprise Distribution (OFED) framework (e.g. OmniFabric from Intel).
Experimental Setup and Results
Implementing a p2p memory read or write from an RDMA-capable NIC in/out of a NVM Express 1.1 SSD is not possible because the NVMe SSD, unlike a GPU card, does not expose I/O memory (i.e. a PCIe Base Address Register). However, as discussed in the last Donard blog, PMC has a product called the Flashtec NVRAM Drive that DOES provide the ability to expose a BAR so users can access its contents via both block semantics (using NVMe), or character semantics (using a driver provided by PMC). This is what we used for the experiment discussed below.
It is worth pointing out that the NVM Express organization ratified NVMe 1.2 in November 2014, and it has added the ability to expose I/O Memory (a Controller Memory Buffer). This standardizes what PMC currently does in its NVRAM card and should mean more NVM Express products with both block and character access modes will appear on the market in due course.
Our experimental setup is given in Figure 3. We used the T-520CR Chelsio iWARP cards as our RDMA-capable NICs and the PMC NVRAM card as our RDMA target. We also repeated the experiment with the GPU as the target device. Results are given in Table 1.
Note that since the Chelsio T520-CR is a 10GbE NIC, we saturate the physical link between the remote client and the server at 1170MB/s (~9360Mb/s), so we do plan to repeat these experiments using 25GbE or 40GbE. We also plan to repeat the experiment using other fabric technologies, such as RoCE and Infiniband as soon as time and equipment allow.
DRAM utilization was measured using the likwid-perfctr tool, and it is clear that the Donard approach has the lowest DRAM utilization of all three methods. Donard beats the GPU, as Nvidia require certain overhead operations be performed when accessing memory on their card (pinning being an obvious example). In future experiments we plan to add latency measurements and look at RDMA reads as well as writes. We also aim to repeat the experiments using a PCIe switch to connect the PCIe devices rather than relying on the p2p capabilities of the (in this case), Intel Sandy Bridge root complex.
Donard Code Base
As previously mentioned, we implemented Donard with open-source paradigms in mind, and it would be remiss of us to not consider opening the Donard code to the community. As such, we have placed the Donard code on GitHub under a mix of Apache 2.0 and GPL 2.0 licensing. Any code we modified that was originally GPL we were required to keep GPL, but all our new code is available under Apache to allow others to take and use as they wish.
It is our hope that people in the community will use this code, make further improvements to it, and contribute it back to the code base. The git repository for the code is available here.
In this blog we have added networking (via RDMA) to the existing Donard elements of storage and compute. We feel this makes the Donard concept much more powerful and showcases a simple application (an RDMA write cache), where the p2p access methods implemented in Donard lead to improvements in performance.
My next blog post will continue to explore the Donard approach. Stay tuned for that and, in the meantime, feel free to reach out to me via a comment here and follow me on Twitter@stepbates.
Leave a Reply
You must be logged in to post a comment.