Storage Technology in Depth - DAFS
The emergence of network transports has given rise to new applications and services that take advantage of low-overhead access to the network interface from user or kernel address space; remote direct memory access (RDMA); transport protocol offloading; and, hardware support for event notification and delivery. The Direct Access File System (DAFS) is a new commercial standard for file access over this new class of networks. DAFS grew out of the DAFS Collaborative, an industrial and academic consortium led by Network Appliance and Intel. DAFS file clients are usually applications that link with user-level libraries that implement the file access Application Programming Interface (API). DAFS file servers are implemented in the kernel.
DAFS clients use a lightweight Radio Port Controller (RPC) to communicate file requests to servers. In direct read or write operations, the client provides virtual addresses of its source or target memory buffers; and, data transfer is done using RDMA operations. RDMA operations are always issued by the server.
Optimistic DAFS Clients
In DAFS direct read and write operations, the client always uses an RPC to communicate the file access request along with memory references to client buffers that will be the source or target of a server-issued RDMA transfer. The cost associated with always having to do a file access RPC is manifested as an unnecessarily high latency for small accesses from server memory. A way to reduce this latency is to allow clients to access the server file and virtual memory (VM) cache directly, rather than having to go each time through the server vnode interface via a file access RPC.
An Optimistic DAFS improves on the existing DAFS specification by reducing the number of file access RPC operations needed to initiate file I/O and replaces them with memory accesses using client-issued RDMA. Memory references to server buffers are given out to clients or other servers that maintain cache directories. They are then allowed to use those references to directly issue RDMA operations with server memory. To build cache directories, the server returns to the client a description of buffer locations in its VM cache. These buffer descriptions are returned either as a response to specific queries (i.e., client asks: give me the locations of all your resident pages associated with file'') or piggybacked in the response to a read or write request (i.e., server responds: ``here's the data you asked for, and by the way, these are their memory locations that you can directly use in the future''). In Optimistic DAFS, clients use remote memory references found in their cache directories, but accesses succeed only when directory entries have not become stale. For example, this is the result of actions of the server pageout daemon. There is no explicit notification to invalidate remote memory references previously given out on the network. Instead, remote memory access exceptions thrown around by the target NIC and caught by the initiator NIC, can be used to discover invalid references and a switch to a slower access path using file access RPC. Therefore, by maintaining the NIC memory management unit in the case where RDMA can be remotely initiated by a client at any time is tricky, and needs special NIC and OS support. Finally, an Optimistic DAFS requires maintenance of a directory on file clients (in user-space) and on other servers (in the kernel).
Kernel Support For DAFS Servers
Special capabilities and requirements of networking transports used by DAFS servers expose a number of kernel design and structure issues. In general, a DAFS file server needs to be able to:
- Do asynchronous file I/O.
- Integrate network and disk I/O event delivery.
- Lock file buffers while RDMA is in progress.
- Avoid memory copies.
Now, lets look at various kernel support mechanisms for DAFS servers. What follows is a description of each kernel support mechanism:
- Event-Driven Design Support: Consists of kernel asynchronous file I/O interfaces and integrating network and file event notification and delivery.
- Vnode Interface Support: P Consists of a vnode interface designed to address these needs.
- VM System Support: Consists of kernel support for memory management of the asymmetric multiprocessor system that consists of the NIC and the host CPU.
- Buffer Cache Locking: Consists of modifications to buffer cache locking.
- Device Driver Support: Consists of device driver requirements of memory-to-memory NIC.
Event-Driven Design Support
An area of considerable interest in recent years has been that of event-driven application design. Event-driven servers avoid much of the overhead associated with multithreaded designs, but require truly asynchronous interfaces coupled with efficient event notification and delivery mechanisms integrating all types of events. The DAFS server requires such support in the kernel.
Vnode Interface Support
Vnode/VFS is a kernel interface that separates generic file-system operations from specific file-system implementations. It was conceived to provide applications with transparent access to kernel file-systems, including network file-system clients such as a Network File System (NFS). The vnode/VFS interface consists of two parts: VFS defines the operations that can be done on a file-system. Vnode defines the operations that can be done on a file within a file-system.
VM System Support
Maintaining virtual/physical address mappings and page access rights, used by the main CPU memory-management hardware, is done by the machine-dependent physical mapping (pmap) module. Low level machine-independent kernel code such as the buffer cache, kernel malloc and the rest of the VM system, are using pmap to add or remove address mappings and alter page access rights.
Symmetric multiprocessor (SMP) systems sharing main memory can use a single pmap module as long as translation lookaside buffers (TLB) on each CPU are kept consistent. Pmap operations apply to page tables shared by all CPU. TLB miss exceptions thrown by a CPU, result in a lookup for mappings in the shared page tables. Invalidations of mappings are applied to all CPUs.
Memory-to-memory NIC, store virtual-to-physical address translations, and access rights for all user and kernel memory regions directly addressable and accessible by the NIC. Main CPUs use their on-chip translation look-aside buffer (TLB) to translate virtual to physical addresses. A typical TLB page entry includes a number of bits such as verification/validation (V) and Analog Control Channel (ACC)--signifying whether the page translation is valid, and what the access rights to the page are, along with the physical page number. A miss on a TLB lookup requires a page table lookup in main memory. NIC on the Protocol Capability Indicator (PCI--(or other I/O)) bus have their own translation and protection (TPT) tables. Each entry in the TPT includes bits enabling RDMA Read or Write (i.e., the W bit in the diagram) operations on the page; the physical page number; and, a Ptag value identifying the process that owns the pages (or the kernel). Whereas the TLB is a high-speed associative memory, the TPT is usually implemented as a dynamic random access memory (DRAM) module on the NIC board. To accelerate lookups on the TPT, remote memory access requests carry a Handle index that helps the NIC find the right TPT entry.
Buffer Cache Locking
In an RDMA-based data transfer, the server sets up the RDMA transfer in the context of the requesting RPC. Once issued, the RDMA proceeds asynchronously to the RPC. The latter does not wait for RDMA completion. To serialize concurrent access to shared files in the face of asynchrony, the vnode (vp) of a file needs to be locked for the duration of the RPC. However, the data buffers (bp's) transferred need to be locked for the full duration of the RDMA. Locking the vp (i.e., the entire file) for the duration of the RDMA would also work, but would limit performance in case of sharing, since requests for non-overlapping regions of a file would have to be serialized.
A multithreaded event-driven kernel server that directly uses the buffer cache and does event processing in kernel process context, faces problems in the following circumstances: When a thread tries to lock a buffer, it is already locking (because a transfer is in progress on that buffer) and expecting to block until that lock is released by some other thread; and, when a buffer is released from a different thread than the one that locked it.
Transferring lock ownership to the kernel during asynchronous network I/O, does not help, since the lock release is done by some kernel process (whichever happens to have polled for that particular event), rather than by the kernel itself. The solution presently used, is for the kernel process that issued an RDMA operation to wait until the transfer is done in order to release the lock. This also prohibits that process from trying to lock the same buffer again, thus causing a deadlock panic. A better solution is to enable recursive locking and allow lock release by any of the server threads.
Device Driver Support
Memory-to-memory network adapters virtualize the NIC hardware and are directly accessible from user space. One such example is the virtual interface (VI)--where the NIC implements a number of VI contexts. Each VI is the equivalent of a socket in traditional network protocols, except that a VI is directly supported by the NIC hardware and usually has a memory-mapped rather than a system call interface. The requirement to create multiple logical instances of a device, each with its own private state (separate from the usual device softcopy state), and to map those devices in user address spaces, requires new support from Berkeley Software Design (BSD) kernels.
Network Driver Model
Finally, network drivers in BSD systems are traditionally accessed through sockets and do not appear in the file-system name space (i.e., under /dev). User-level libraries for memory-to-memory network transports require these devices to be opened and closed multiple times, with each opened instance appearing as a separate logical device, maintaining a private state, and be memory-mapped.
Summary And Conclusions
As previously explained, the Direct Access File System (DAFS) is an emerging commercial standard for network-attached storage on server cluster interconnects. The DAFS architecture and protocol, leverage network interface controller (NIC) support for user-level networking, remote direct memory access, efficient event notification, and reliable communication. This article demonstrated how the current server structure can attain read throughput of more than 100 MB/s over a 1.25 Gb/s network, even for small (i.e., 4K) block sizes, when pre-fetching and using an asynchronous client API. Finally, to reduce multithreading overhead, you should integrate the NIC with the host virtual memory system.
About the Author :John Vacca is an information technology consultant and author. Since 1982, John has authored 36 technical books including The Essential Guide To Storage Area Networks, published by Prentice Hall. John was the computer security official for NASA's space station program (Freedom) and the International Space Station Program, from 1988 until his early retirement from NASA in 1995. John can be reached at firstname.lastname@example.org.
See All Articles by Columnist John Vacca