I recently saw the slides that Fujitsu is using for the Hot Chips conference and noted that Fujitsu is using the Lustre file system in its planned exascale project which will be competing with U.S. exascale plans. A number of fairly large companies, actually many leaders in our industry, are working on large or parallel file systems. Some of these industry storage leaders that are working on parallel file systems solutions include Intel, EMC, Seagate, Hitachi, NetApp all with the Lustre file system and IBM with their GPFS file system.
So what do these high-performance computing (HPC) file systems have to do with you and why should you care?
The Problem with REST and HDFS
The industry is rapidly moving to REST interfaces, but there are still some limitations on using REST. However, because REST protocol is not as rich as the POSIX framework, some current applications that have to rewrite data with system calls for writing data that has already been written.
Also, though it is really an application-specific file system, many people are using HDFS. The problem is, in my opinion, that HDFS is very limited in what can do beyond supporting MapReduce. HDFS is good a large block I/O, but other than that, it is pretty limited.
In addition, we have at least three decades of older applications that will need to be rewritten for new file system interfaces, such as those with a REST interface. Making the transition and porting code is going to take many billions of dollars and is not going to happen overnight.
Why HPC File Systems and What Is Missing from Network Storage
Many current NFS- and CIFS-based NAS systems lack scaling for both data and metadata. Some might support a petabyte or two or even 8 or maybe 100 million files, but what currently supported NFS- or CIFS-based NAS system supports more than 1 billion files, 1 TB/sec of sustained bandwidth and 50 PB of storage space? There are a few that might be able to do the 50 PB or and might even do the 1 billion+ files, but not a single one can meet the bandwidth numbers. And bandwidth performance is important, as is having all of the files in a single namespace.
Scale Up Performance
In order to use the system efficiently, scaling bandwidth performance with the number of PB is important. There are NAS systems that might be able to look up thousands of clients via NFS, but do they operate as efficiently with tens or hundreds of clients as they do with thousands? NFS and CIFS are limiting factors given how the protocols work and how much CPU and overhead they require.
Parallel file systems do not use NFS or CIFS but instead have native optimized client interfaces that allow them to scale the performance and efficiently use more than ten thousand disk drives, getting a high percentage of bandwidth from each drive. Supporting 10, 20 or 30 thousand drives is great, but you also need a file system and, of course,the underlying hardware need to scale nearly linearly with drive, controller and client count.
If you think you can solve the performance issue by using flash rather than disk drives then think again because file system scalability problems cannot be completely solved with flash.
File system developers need to address allocation issues, metadata consistency issues and data streaming issues. The streaming I/O issues can be solved with faster storage, but locking metadata and allocations are a function of design. They might be sped up somewhat with improved hardware, but hardware cannot solve the underlying problems.
Namespace management is not fun for administrators and is costly in terms of the overhead of managing lots of file systems. Supporting more than a billion files is no easy task, given how most if not all NAS file systems are designed. Additionally, NFS and CIFS were not designed to efficiently support say 50,000 open/create system calls for new files nor say 200,000 stat() system calls per second, all at the same time.
HPC file systems are designed with consistency in mind. You have had the experience writing data from one NFS or CIFS mount and trying to read from another. Because of the performance limitations of NFS and CIFS, most systems administrators use caching to improve the performance, which is fine if only one client is accessing a file or for read only. If you are doing reads and writes to the same file, then this becomes a problem. HPC file systems solve the problem; NFS and CIFS do not. With REST it is unclear to me what happens in the protocol if there are reads and rewrites to the same file.