If you are a regular reader you know Jeff Layton and I have similar approaches to technology issues and problem solving. We also both like file systems and work on storage. We recently ran into each other at a file system user group meeting and were discussing Linux file systems and file systems in general over a great Cuban dinner. Jeff and I came to the conclusions that, as it is often said, "Houston we have a file system scaling problem."
A few years ago, I wrote an article in which I made some disparaging remarks about the scalability of Linux file systems. I received more than 30 of emails telling me I did not know what I was talking about, some of them explicit and nearly threatening. Back in April, while I was doing a bit of research for a customer on the maximum size support for a number of file systems, I found the follwing information on Red Hat's web site.
File Systems and Storage Limits
|Maximum filesize (Ext3)||2TB||2TB||2TB||2TB|
|Maximum filesystem size (Ext3)||2TB||8TB||16TB||16TB|
|Maximum filesize (Ext4)||--||--||16TB||16TB|
|Maximum filesystem size (Ext4)||--||--||16TB||16TB|
|Maximum filesize (GFS)||2TB||16TB/8EB||16TB/8EB||N/A|
|Maximum filesystem size (GFS)||2TB||16TB/8EB||16TB/8EB||N/A|
|Maximum filesize (GFS2)||--||--||25TB||100TB|
|Maximum filesystem size (GFS2)||--||--||25TB||100TB|
|Maximum filesize (XFS)||--||--||100TB||100TB|
|Maximum filesystem size (XFS)||--||--||100TB||100TB|
Clearly, the file system community as group has not taken my concerns to heart. There has been progress in some areas, but the goal of a 500 TB single name space Linux file system still seems years away.
What Is the Problem?
When Jeff and I talked over dinner about the merits and demerits of the file systems listed above, the biggest issue we had was that the size limits of the file systems are unbelievably small given the storage sizes we have today.
With 3TB, drives ext 3/4 maxes out at five disk drives. Jeff and I thought that was just insane, given you can buy five 3TB drives at Fry's and put them in your desktop. XFS maxes out at 33 3TB disk drives, and even that is far too small in our opinion. Clearly, supported file system sizes have not scaled with disk drives sizes or the demand for big data. I have a home NAS device with six drives, and it is a good thing I did not get an eight-drive NAS; XFS is not supported by my NAS vendor, and I could be over the ext3 limit and thus have to create multiple file systems.
Laying the Groundwork
We already know there will be a number of naysayers, so let's answer some of your comments upfront (the proverbial "pre-emptive strike" as it were). A first possible comment is, "why do guys need any stinking support, just download and go debug yourself? You know how." That might be true, but the fact of the matter is, it is not about us -- it is about the market reality in business today (and remember, we are both proponents of Linux, and Jeff writes a weekly article series about Linux and storage).
A second possible response is, "you guys are stupid and should not want file systems that big." Our response to this is that the reason you tell us not want these big file system is because they do not scale. We want them and our customers want them. When an eight-drive NAS must be broken into multiple file systems, I think we have a broken file system development and support model.
Jeff and I did some checking and we believe, ultimately, there are two problems.
- In the current file system model, the listed file system has metadata scaling problems for large counts of files with these large sizes. Although XFS is supported to 100TB, what we were told and have seen is that performance degrades with large file counts, especially when the metadata gets fragmented
- As these file systems grow near their limits, performance for streaming does not seem to scale linearly with size and degrades with fragmentation
Jeff and I decided to look at the first problem because without scalable metadata, sustained large block performance does not really matter much in our opinion. I have always been a big proponent of fsck performance. Now some vendors have stated in the past that all you must do is check the logs, and you never need to verify the file system.
That is total and complete garbage. In all the times Jeff or I have seen hardware have a problem, neither of us have ever seen an operational file system be able to recover from a RAID or storage hardware problems 100 percent of the time. It is the nature of POSIX file systems, and the fact of the matter is that only metadata is logged -- not data given the performance implications. Jeff and I speculated during dinner that one of the main reasons Linux puts size limitation on file systems is the amount of time it takes to run an fsck after a hardware incident. It is necessary to run fsck against the metadata (e.g., superblock, inodes, extents and directories) after a crash, and it is critically important after a hardware incident on the storage.