Analysis of Metadata About Metadata

Tuesday Aug 20th 2013 by Jeff Layton
Share:

Gathering and analyzing metadata about your metadata can help administrators make important decisions about storage needs.

You may have heard that metadata is data about data (if that makes any sense). But how about metadata about metadata (data about data about data)?

Actually this is an important topic—understanding and monitoring how and when files are created, used, modified and removed. This information can tell you a great deal about what's happening with your data.

Quick questions: Can you quickly tell me the average file size on your storage system? Can you tell me the average file age?

Being able to answer these questions quickly can help you determine if you need new storage, what type of storage you need, and if you need to archive or move old data off your system, to something cheaper or perhaps just get rid of it (I know, heresy, right?). Gathering the information about the files on a file system, that is, gathering metadata about the metadata, is a critical—and very neglected—task in the storage world.

Suppose you are responsible for the budget for data storage. An admin or a user comes to you and says they are running out of space or that the performance of the storage is slow and they need new storage. I don't know about you, but I would want them to explain why. That means they need to explain what's going on with the storage. A friend of mine calls it "showing your math." If you are the administrator you will need to be able to present some understandable and useful information. For example,

  • How quickly is the capacity is being consumed? Be sure to show this with some detail (not just a global view). For example, a simple chart that could illustrate which user or group of users is consuming the data the quickest is important.
  • How much "old" data is there? (Maybe create a histogram showing the age of the files.)
  • Who has the oldest files? (histogram by user and/or group)
  • If you want a faster storage solution, can you show why you need one? Maybe you could show that you have a large number of files, lots of them are really small or really large, etc.

Plus, if I were responsible for the purse strings for new storage I would not like to be blind-sided with such a request at the last minute. I would like to have some idea that this request is coming so I can prep my management and also prep the funding channels. Therefore, it is a good idea to present this information (trends) to management over time. Maybe once a quarter or so you can do a quick presentation of the information?

There are some commercial tools that can do some of this. For example, spaceobserver can provide a great deal of metadata information about Windows file systems with the ability to sort it and view it. But it can't describe everything and isn't flexible enough (plus it doesn't do non-Windows file systems).

There are some open-source tools that might help as well. For example, fsstats is a Perl script that walks file trees and gathers some file system information and presents it to the user with a breakdown of the file data.

These tools, while providing some information, didn't provide everything I wanted in a way that was useful to me. Therefore, I decided to write my own tools so that I can understand a bit more about my data. My goal is to help you learn a bit about how important metadata of your metadata can be.

Tools, Tools, Tools

I chose to divide the tool into two pieces. The first one just walks a file tree and gathers metadata. The second piece takes two or more of these collections and performs a statistical analysis on them and creates a report. These are not intended to be final tools for production use, but rather starting points for developing your own tools or your own approach.

I've chosen to write the tools in Python because there are a wide variety of libraries, modules and tools that can be used, making my life easier. A few years ago, I wrote a simple tool, FS_scan that would walk a file tree and gather the file statistics and create a Comma Separate Values (csv) file that you could use in a spreadsheet. I'm going to start with that tool but separate the fundamental tasks of (1) gathering the data, and (2) processing the scan(s) for statistical information. The tools can be downloaded from this page.

Just a quick note about the tools and the programming style. I'm sure my Python coding style isn't really what is considered "pythonic." I've tried to use more modern features of Python such as iterators, but my overall style is not typical of Python developers. Plus, I use comments to indicate the end of a loop or an if statement. I've found that these help my coding. Comments and suggestions about the coding are always appreciated.

Gathering the Data

I won't spend too much time on the data gathering tool since much of the detail is in the older article. But for completeness sake, let me explain a bit about it.

Python has a module called os that allows you to walk a file system and also gather "stat()" information on the file using the os.stat function (method). As a result, the tool can easily gather the following information:

  • The size of the file in bytes
  • The ctime of the file (change time)
  • The mtime of the file (modify time)
  • The atime of the file (access time)
  • The uid and gid of the file

The three "times" are output as seconds since the epoch but are easy enough to convert to something more meaningful using the Python time module.

The uid and gid are converted to "real" names, if possible, using the Python module/functions, pwd.getpwuid and grp.getgrgid. I like to do this because if you just store the numeric values and process the files on a different system, you may get different mappings from uid/gid to actual names. So I like to do the uid/gid "decoding" before I store the data.

The data gathering code, which I refer to as fsscan, stores the data in what is called a pickle file. This process takes Python objects and converts them to byte streams for actual writing. Reading the byte stream is just the reverse. Pickling allows you to take Python data structures and write them to a file in a single function.

Fsscan takes two options, "-d " and "-o ." The first option allows you to specify the root directory for the scan by just passing the full path to the starting directory. If you don't specify a directory, the code will use the current working directory (cwd). The second option specifies the name of the output pickle file. By default it uses "file.pickle" and puts it in the directory where the code is executed.

The advantage of this scanning code is that you can break up a file system into a number of pieces and either scan each piece at the same time or scan them one at a time to reduce the load on the file system hardware.

Processing the Data

This is where I want to spend a majority of the explanation in this article, discussing how I process the file system metadata and what type of metadata I want to create from it. As an example of what one could do, I wrote a simple postprocessing code in Python as well because I want an easy way to create plots. The analysis code, which I call mdpostp (metadata post-processing), reads in a file that contains a list of the pickle files to be analyzed. It then reads each pickle file in turn, doing a statistical analysis on each one (recall that a single pickle file is a file tree scan). At this time the following aspects are analyzed by mdpostp:

  • Mtime age statistics where mtime age is the time difference from between when the analysis is run and the mtime (modify time) of the particular file. The age is presented in days.
    • The oldest file
    • The youngest file
    • The average file age
    • The standard deviation of mtime age
    • Intervals for file age (in days)
    • There is also a list of the Top 10 oldest files (the number of files in the "top" list is controllable in the script).
    • A histogram of the mtime age of all files
  • Ctime age statistics where ctime age is the time difference from between when the analysis is run and the ctime (change time) of the particular file. The age is presented in days.
    • The oldest file
    • The youngest file
    • The average file age
    • The standard deviation of ctime age
    • Intervals for file age (in days)
    • There is also a list of the Top 10 oldest files (the number of files in the "top" list is controllable in the script).
    • A histogram of the ctime age of all files
  • Ctime–Mtime time difference statistics (difference between the two times). The differences are presented in days.
    • The oldest file based on ctime-mtime
    • The youngest file based on ctime-mtime
    • The average file age based on ctime-mtime
    • The standard deviation of ctime-mtime
    • Intervals for ctime-mtime file age (in days)
    • There is also a list of the Top 10 oldest files (the number of files in the "top" list is controllable in the script).
    • A histogram of the ctime-mtime age of all files
  • Atime age statistics where atime age is the time difference from between when the analysis is run and the atime (access time) of the particular file. The age is presented in days.
    • The oldest file
    • The youngest file
    • The average file age
    • The standard deviation of atime age
    • Intervals for file age (in days)
    • There is also a list of the Top 10 oldest files (the number of files in the "top" list is controllable in the script).
    • A histogram of the atime age of all files
  • Largest files statistics:
    • The smallest file (in KB)
    • The largest
    • The average file size in KB
    • Intervals for file size (in KB)
    • A list of the Top 10 largest files (the number of files in the "top" list is controllable in the script).
    • A histogram of all the file sizes
  • Biggest users list. This is the Top 10 biggest users in terms of capacity (the number of files in the "top" list is controllable in the script).
  • Biggest group users list. This is the Top 10 biggest group users in terms of capacity (the number of files in the "top" list is controllable in the script).
  • Duplicate list. The analysis code will search the scan file for duplicate files. It determines if the file are the same by comparing the file name and the file size in bytes. If they are both the same then the file is said to be a duplicate. The output lists the "root file" which is the first file in the list and then the duplicate files that match the "root file."

This type of information is just a start of the things I like to initially examine, but it gives me a good snapshot of what is happening in the file system before I dive in deeper. There are lots of other statistics we could develop, but that starts to get a little more specific to your needs. I hope the Python scripts are easy to understand so that you can add your spin on things.

The code produces some output to stdout but it also creates an html file that has all of the same data, as well as some plots. The file, report.html, is in a subdirectory, HTML_REPORT. You can just open the file with your browser to read the report. While I like seeing stdout for immediate results I also like to create an html file for a more detailed report.

Let Loose the Hounds!

The point of this article is not to develop tools for analyzing the file system, but to actually start analyzing the file system. As an example of this, I wanted to analyze my home directory on my home system. It might not be very interesting in some respects because it's a single user, but I think it serves the purpose of exploring how one looks at file system information.

I have an external USB drive that I use for booting Linux on a laptop (I actually like it better than running Linux in a VM). I do some coding, article writing, etc. on the drive, so I will use it as my test example. I ran fsscan on my home directory and then processed it with mdpostp. The first bit of output from the analysis reveals that there were 54,036 files in my home directory (didn't know I had that many).

Then the analysis focused on analyzing mtime age. That is, the difference in time from when I ran mdpostp and when I gathered the file system data. The first output is a summary of the overall statistics.

  • Average mtime age in days: 433.871 days
  • Oldest mtime age file in days: 3,131.168 days
  • Yonugest mtime age file in days: 1.665 days
  • Standard deviation mtime age in days: 112.1403 days

The newest file is only 1.665 days old, and the oldest is almost 8.57 years old.

The next section of the output counts the number of files in certain time intervals. For example, it will count the number of files that are 1 day or younger based on mtime. The "interval" output for the example is the following:

[   0-   1 days]:    43  (  0.08%)  (  0.08% cumulative)
[   1-   2 days]:     0  (  0.00%)  (  0.08% cumulative)
[   2-   4 days]:    81  (  0.15%)  (  0.23% cumulative)
[   4-   7 days]:    43  (  0.08%)  (  0.31% cumulative)
[   7-  14 days]:    30  (  0.06%)  (  0.36% cumulative)
[  14-  28 days]:     9  (  0.02%)  (  0.38% cumulative)
[  28-  56 days]:  2854  (  5.28%)  (  5.66% cumulative)
[  56- 112 days]:     8  (  0.01%)  (  5.68% cumulative)
[ 112- 168 days]:   450  (  0.83%)  (  6.51% cumulative)
[ 168- 252 days]:  1449  (  2.68%)  (  9.19% cumulative)
[ 252- 365 days]:   269  (  0.50%)  (  9.69% cumulative)
[ 365- 504 days]: 48754  ( 90.23%)  ( 99.91% cumulative)
[ 504- 730 days]:    12  (  0.02%)  ( 99.94% cumulative)
[ 730-1095 days]:    29  (  0.05%)  ( 99.99% cumulative)
[1095-1460 days]:     0  (  0.00%)  ( 99.99% cumulative)
[1460-1825 days]:     0  (  0.00%)  ( 99.99% cumulative)
[1825-2190 days]:     2  (  0.00%)  ( 99.99% cumulative)
[2190-2920 days]:     1  (  0.00%)  (100.00% cumulative)
[2920-3650 days]:     2  (  0.00%)  (100.00% cumulative)
[3650-4380 days]:     0  (  0.00%)  (100.00% cumulative)
[4380-5110 days]:     0  (  0.00%)  (100.00% cumulative)
[5110-5840 days]:     0  (  0.00%)  (100.00% cumulative)

The vast majority of the files—about 90 percent—are between 365 days and 504 days (1 to 1.5 years).

Then the output lists the 10 oldest files. Table 2 below lists the top 10 files based on the mtime, modify time and when the files were scanned in the original pickle file.

Table 1 - Top 10 oldest files based on mtime

RankFileMtime Age
  (days)
#1  /home/laytonjb/.libreoffice/3/user/database/biblio/biblio.dbt    3,131.167  
#2  /home/laytonjb/.libreoffice/3/user/database/biblio/biblio.dbf    3,050.282  
#3  /home/laytonjb/.libreoffice/3/user/database/biblio.odb    2,645.208  
#4  /home/laytonjb/Documents/FEATURES/HPC_021/padnums.py    2,174.615  
#5  /home/laytonjb/Documents/FEATURES/HPC_021/test_padnums.py    2,174.612  
#6  /home/laytonjb/.libreoffice/3/user/basic/Standard/Module1.xba    822.345  
#7  /home/laytonjb/.libreoffice/3/user/basic/Standard/script.xlb    822.345  
#8  /home/laytonjb/.libreoffice/3/user/basic/Standard/dialog.xlb    822.345  
#9  /home/laytonjb/.libreoffice/3/user/basic/dialog.xlc    822.345  
#10  /home/laytonjb/.libreoffice/3/user/basic/script.xlc    822.345  



These same results are repeated for ctime age (change time). The first set of results provides the overall statistical summary.

  • Average ctime age in days: 432.346 days
  • Oldest ctime age file in days: 468.926 days
  • Youngest ctime age file in days: 1.665 days
  • Standard deviation ctime age in days: 110.9462 days

The mtime file age interval counts looks like this:

[   0-   1 days]:    46  (  0.09%)  (  0.09% cumulative)
[   1-   2 days]:     0  (  0.00%)  (  0.09% cumulative)
[   2-   4 days]:    91  (  0.17%)  (  0.25% cumulative)
[   4-   7 days]:   125  (  0.23%)  (  0.48% cumulative)
[   7-  14 days]:     0  (  0.00%)  (  0.48% cumulative)
[  14-  28 days]:     1  (  0.00%)  (  0.49% cumulative)
[  28-  56 days]:  2808  (  5.20%)  (  5.68% cumulative)
[  56- 112 days]:     8  (  0.01%)  (  5.70% cumulative)
[ 112- 168 days]:   460  (  0.85%)  (  6.55% cumulative)
[ 168- 252 days]:  1736  (  3.21%)  (  9.76% cumulative)
[ 252- 365 days]:   270  (  0.50%)  ( 10.26% cumulative)
[ 365- 504 days]: 48491  ( 89.74%)  (100.00% cumulative)
[ 504- 730 days]:     0  (  0.00%)  (100.00% cumulative)
[ 730-1095 days]:     0  (  0.00%)  (100.00% cumulative)
[1095-1460 days]:     0  (  0.00%)  (100.00% cumulative)
[1460-1825 days]:     0  (  0.00%)  (100.00% cumulative)
[1825-2190 days]:     0  (  0.00%)  (100.00% cumulative)
[2190-2920 days]:     0  (  0.00%)  (100.00% cumulative)
[2920-3650 days]:     0  (  0.00%)  (100.00% cumulative)
[3650-4380 days]:     0  (  0.00%)  (100.00% cumulative)
[4380-5110 days]:     0  (  0.00%)  (100.00% cumulative)
[5110-5840 days]:     0  (  0.00%)  (100.00% cumulative)

Notice that the majority of the files are in the range of 365 to 504 days old (1 to 1.5 years).

Finally, the Top 10 oldest files based on ctime age are listed below.



Table 2 - Top 10 oldest files based on ctime

RankFileCtime Age
  (days)
#1  /home/laytonjb/.mozilla/firefox/profiles.ini    468.926  
#2  /home/laytonjb/.wine/run_configure-wine_if_no_directories_present    468.926  
#3  /home/laytonjb/.config/chromium/Default/Extensions/bjbifppfjimgbjjanagdkdgdpnlcfpcp/1.0_0/script.js    468.926  
#4  /home/laytonjb/.config/chromium/Default/Extensions/bjbifppfjimgbjjanagdkdgdpnlcfpcp/1.0_0/manifest.json    468.926  
#5  /home/laytonjb/.config/chromium/Default/databases/Databases.db    468.926  
#6  /home/laytonjb/.config/chromium/Default/User StyleSheets/Custom.css    468.926  
#7  /home/laytonjb/.config/chromium/Default/Web Data    468.926  
#8  /home/laytonjb/.config/chromium/Default/Current Session    468.926  
#9  /home/laytonjb/.config/chromium/Default/History    468.926  
#10  /home/laytonjb/.config/chromium/Default/Login Data    468.926  



Notice that some of the mtime ages are pretty old—over 3,100 days (about 8.57 years).

The next set of data will require some explanation, but I will cover that later in the article. The data is the ctime-mtime. That is the difference between the change time (ctime) and the modify time (mtime). The fundamental statistics for ctime-mtime are as follows:

  • Number of non-zero difference files: 1,339 of 54,036 files: (2.48%)
  • Average ctime-mtime age in days: 1.000 days
  • Oldest ctime-mtime age file in days: 2,665.000 days
  • Youngest ctime-mtime age file in days: 0.000 days
  • Standard deviation ctime-mtime age in days: 27.9886 days

Notice that 1,339 files out of the total of 54,036 files have the exact same ctime and mtime so that the difference is 0. Also notice that the average ctime-mtime is 1 day and the youngest file, the smallest ctime-mtime value, is 0.000 days (very small indeed).

The ctime-mtime file age interval counts look like this:

[   0-   1 days]: 52891  ( 97.88%)  ( 97.88% cumulative)
[   1-   2 days]:   641  (  1.19%)  ( 99.07% cumulative)
[   2-   4 days]:    69  (  0.13%)  ( 99.19% cumulative)
[   4-   7 days]:     5  (  0.01%)  ( 99.20% cumulative)
[   7-  14 days]:    11  (  0.02%)  ( 99.22% cumulative)
[  14-  28 days]:    61  (  0.11%)  ( 99.34% cumulative)
[  28-  56 days]:     3  (  0.01%)  ( 99.34% cumulative)
[  56- 112 days]:     2  (  0.00%)  ( 99.35% cumulative)
[ 112- 168 days]:   291  (  0.54%)  ( 99.89% cumulative)
[ 168- 252 days]:    11  (  0.02%)  ( 99.91% cumulative)
[ 252- 365 days]:    38  (  0.07%)  ( 99.98% cumulative)
[ 365- 504 days]:     8  (  0.01%)  ( 99.99% cumulative)
[ 504- 730 days]:     0  (  0.00%)  ( 99.99% cumulative)
[ 730-1095 days]:     0  (  0.00%)  ( 99.99% cumulative)
[1095-1460 days]:     0  (  0.00%)  ( 99.99% cumulative)
[1460-1825 days]:     0  (  0.00%)  ( 99.99% cumulative)
[1825-2190 days]:     3  (  0.01%)  (100.00% cumulative)
[2190-2920 days]:     2  (  0.00%)  (100.00% cumulative)
[2920-3650 days]:     0  (  0.00%)  (100.00% cumulative)
[3650-4380 days]:     0  (  0.00%)  (100.00% cumulative)
[4380-5110 days]:     0  (  0.00%)  (100.00% cumulative)
[5110-5840 days]:     0  (  0.00%)  (100.00% cumulative)

Notice that a vast majority of the files have a ctime and an mtime that differ by less than 1 day.

Finally, the Top 10 oldest files based on mtime age are listed below:



Table 3 - Top 10 oldest files based on ctime-mtime

RankFileCtime-Mtime
Differences
  (days)
#1  /home/laytonjb/.libreoffice/3/user/database/biblio/biblio.dbt    2,665.394  
#2  /home/laytonjb/.libreoffice/3/user/database/biblio/biblio.dbf    2,584.508  
#3  /home/laytonjb/.libreoffice/3/user/database/biblio.odb    2,179.434  
#4  /home/laytonjb/Documents/FEATURES/HPC_021/padnums.py    2,169.680  
#5  /home/laytonjb/Documents/FEATURES/HPC_021/test_padnums.py    2,169.678  
#6  /home/laytonjb/.local/share/akonadi/mysql.conf    467.252  
#7  /home/laytonjb/CLUSTERBUFFER/OTHER2/(file)    418.744  
#8  /home/laytonjb/CLUSTERBUFFER/OTHER2/(file)    412.095  
#9  /home/laytonjb/Videos/(file)    411.090  
#10  /home/laytonjb/Videos/(file)    411.089  



Some file names have been changed so that you are not subjected to my musical tastes.

After ctime-mtime, the next set of results are for atime age. The fundamental statistics for atime age are:

  • Average atime age in days: 344.213 days
  • Oldest atime age file in days: 676.095 days
  • Youngest atime age file in days: 1.665 days
  • Standard deviation atime age in days: 151.4849 days

The atime file age interval counts look like this:


[   0-   1 days]:    66  (  0.12%)  (  0.12% cumulative)
[   1-   2 days]:     0  (  0.00%)  (  0.12% cumulative)
[   2-   4 days]:   264  (  0.49%)  (  0.61% cumulative)
[   4-   7 days]:   147  (  0.27%)  (  0.88% cumulative)
[   7-  14 days]:     0  (  0.00%)  (  0.88% cumulative)
[  14-  28 days]:     1  (  0.00%)  (  0.88% cumulative)
[  28-  56 days]:  2938  (  5.44%)  (  6.32% cumulative)
[  56- 112 days]:    57  (  0.11%)  (  6.43% cumulative)
[ 112- 168 days]:  1195  (  2.21%)  (  8.64% cumulative)
[ 168- 252 days]: 17521  ( 32.42%)  ( 41.06% cumulative)
[ 252- 365 days]:   260  (  0.48%)  ( 41.54% cumulative)
[ 365- 504 days]: 31585  ( 58.45%)  (100.00% cumulative)
[ 504- 730 days]:     2  (  0.00%)  (100.00% cumulative)
[ 730-1095 days]:     0  (  0.00%)  (100.00% cumulative)
[1095-1460 days]:     0  (  0.00%)  (100.00% cumulative)
[1460-1825 days]:     0  (  0.00%)  (100.00% cumulative)
[1825-2190 days]:     0  (  0.00%)  (100.00% cumulative)
[2190-2920 days]:     0  (  0.00%)  (100.00% cumulative)
[2920-3650 days]:     0  (  0.00%)  (100.00% cumulative)
[3650-4380 days]:     0  (  0.00%)  (100.00% cumulative)
[4380-5110 days]:     0  (  0.00%)  (100.00% cumulative)
[5110-5840 days]:     0  (  0.00%)  (100.00% cumulative)

Finally, the Top 10 oldest files based on atime age are listed below:



Table 4 - Top 10 oldest files based on atime

RankFileAtime Age
  (days)
#1  /home/laytonjb/.libreoffice/3/user/extensions/bundled/registry/com.sun.star.comp.deployment.bundle.PackageRegistryBackend/backenddb.xml    676.094  
#2  /home/laytonjb/.libreoffice/3/user/extensions/bundled/registry/com.sun.star.comp.deployment.component.PackageRegistryBackend/backenddb.xml    676.094  
#3  /home/laytonjb/.wine/run_configure-wine_if_no_directories_present    468.926  
#4  /home/laytonjb/.config/chromium/Default/Extensions/bjbifppfjimgbjjanagdkdgdpnlcfpcp/1.0_0/script.js    468.926  
#5  /home/laytonjb/.config/chromium/Default/Extensions/bjbifppfjimgbjjanagdkdgdpnlcfpcp/1.0_0/manifest.json    468.926  
#6  /home/laytonjb/.config/chromium/Default/databases/Databases.db    468.926  
#7  /home/laytonjb/.config/chromium/Default/User StyleSheets/Custom.css    468.926  
#8  /home/laytonjb/.config/chromium/Default/Web Data    468.926  
#9  /home/laytonjb/.config/chromium/Default/Current Session    468.926  
#10  /home/laytonjb/.config/chromium/Default/History    468.926  



After the data ages, the next section of results summarizes the file sizes. The overall statistics for the file sizes are:

  • Average file size in KB: 1,223.000 KB
  • Largest file in KB: 1,964,525.000 KB
  • Smallest file size in KB: 0.000 KB
  • Standard deviation file size in KB: 28,756.4219 KB

The average file size is about 1.2 MB with the smallest file being virtually zero. The largest file is 1.964 GB. The file size intervals are listed below:


[      0-      1 KB]: 17343  ( 32.10%)  ( 32.10% cumulative)
[      1-      2 KB]:  8924  ( 16.51%)  ( 48.61% cumulative)
[      2-      4 KB]:  2635  (  4.88%)  ( 53.49% cumulative)
[      4-      8 KB]:  8574  ( 15.87%)  ( 69.35% cumulative)
[      8-     16 KB]:  2261  (  4.18%)  ( 73.54% cumulative)
[     16-     32 KB]:  4688  (  8.68%)  ( 82.21% cumulative)
[     32-     64 KB]:  1929  (  3.57%)  ( 85.78% cumulative)
[     64-    128 KB]:  1758  (  3.25%)  ( 89.04% cumulative)
[    128-    256 KB]:  1186  (  2.19%)  ( 91.23% cumulative)
[    256-    512 KB]:  1000  (  1.85%)  ( 93.08% cumulative)
[    512-   1024 KB]:  1021  (  1.89%)  ( 94.97% cumulative)
[   1024-   2048 KB]:   544  (  1.01%)  ( 95.98% cumulative)
[   2048-   4096 KB]:   659  (  1.22%)  ( 97.20% cumulative)
[   4096-   8192 KB]:   585  (  1.08%)  ( 98.28% cumulative)
[   8192-  16384 KB]:   521  (  0.96%)  ( 99.24% cumulative)
[  16384-  32768 KB]:   207  (  0.38%)  ( 99.63% cumulative)
[  32768-  65536 KB]:   101  (  0.19%)  ( 99.81% cumulative)
[  65536- 131072 KB]:    47  (  0.09%)  ( 99.90% cumulative)
[ 131072- 262144 KB]:    15  (  0.03%)  ( 99.93% cumulative)
[ 262144- 524288 KB]:    10  (  0.02%)  ( 99.95% cumulative)
[ 524288-1048576 KB]:    28  (  0.05%)  (100.00% cumulative)

Notice that the majority of the files are 1KB or smaller but there are some files larger than 1 MB.

The 10 largest files are listed below in Table 5.

Table 5 - Top 10 largest files

RankFileSize (KB)
#1  /home/laytonjb/CLUSTERBUFFER/STRACENG/EXAMPLES/eclipse/strace.out8044    1,964,525  
#2  /home/laytonjb/CLUSTERBUFFER/STRACENG/EXAMPLES/eclipse/strace.out8042    1,930,658  
#3  /home/laytonjb/CLUSTERBUFFER/STRACENG/EXAMPLES/lsdyna-strace.tar.gz    1,645,833  
#4  /home/laytonjb/CLUSTERBUFFER/STRACE_PY/EXAMPLES/cesm-strace/strace.janus018.tar.bz2    1,613,158  
#5  /home/laytonjb/CLUSTERBUFFER/STRACE_PY/EXAMPLES/cesm-strace/strace.janus023.tar.bz2    1,593,503  
#6  /home/laytonjb/CLUSTERBUFFER/STRACE_PY/EXAMPLES/cesm-strace/strace.janus020.tar.bz2    1,574,873  
#7  /home/laytonjb/CLUSTERBUFFER/STRACE_PY/EXAMPLES/cesm-strace/strace.janus021.tar.bz2    1,567,006  
#8  /home/laytonjb/CLUSTERBUFFER/STRACE_PY/EXAMPLES/cesm-strace/strace.janus024.tar.bz2    1,457,958  
#9  /home/laytonjb/CLUSTERBUFFER/STRACE_PY/EXAMPLES/cesm-strace/strace.janus017.tar.bz2    1,453,175  
#10  /home/laytonjb/CLUSTERBUFFER/STRACE_PY/EXAMPLES/cesm-strace/strace.janus019.tar.bz2    1,387,286  



There are three sections after file sizes. The first one lists the Top 10 biggest users (most file space) and the second one lists the Top 10 biggest group users (most file space). The third section, which is optional by using the "-dup" option, checks for duplicate files. It compares the file name and the size of the files. If both match then it's considered a duplicate file even though it may not actually be one (need to do checksums to be sure). Since this example is only a single user, I will skip these sections.

Interpretation of Results

After first examining the results I like to compare the ages to better understand what's happening in the file system.

Recall that ctime, change time, reports the last time a change was made to the file. This includes the data and/or the metadata. But the mtime of a file is only changed when the data itself is changed (not the metadata). I like to use both times when examining the ages of the files.

In general, I look at the mtime data first because it tells me about the data itself. I use it as the primary measure of the "age" of the data as well as the trends in the data. The mtime data can be used for answering questions such as the average age of the data, how quickly the data is growing, etc.

For the example, the vast majority of the files have their mtime between 365 and 504 days (1 to 1.5 years). But there are some a few files, 2,854, that have mtimes between 28 and 56 days. And finally there are some files that are between 2,920 and 3,650 days (8 to 10 years). One has to wonder if a number of the files couldn't be archived or maybe moved to slower storage.

After examining the mtime age data, I like to examine the ctime age data. Moreover, comparing the difference between the ctime age and mtime age, we can get an idea of how often and how quickly the file metadata changes. This information is important because it indicates the relative age of file metadata changes. If you measure this information over time, it can help you see the pace of metadata changes.

For the example, the vast majority of the files have a ctime age of 365 to 504 days (1 to 1.5 years). Additionally, if we look at the ctime-mtime ages, we see that almost all files have not changed much because the difference is 0 to 1 day (1,339 of the 54,036 files have the ctime age match the mtime age). This tells me that there is little changing of the metadata of files (i.e. permissions, etc.). Consequently, I don't think there is much worry that this user is thrashing metadata.

The atime age information can tell us about file access age. For example, if a file has a very old atime age, then the file hasn’t been accessed for some time. Maybe files like this are good candidates for archiving to reduce the capacity used. To prevent this, users may try to use the "touch" command to update the time stamp on the file. But the touch command just updates the ctime and not the mtime. [check on this via an experiment].

For the example scan, about one third of the files have an atime age in the 168-252 day range (6-9 months) and about 58 percent of the files are between 365 and 504 days (1 to 1.5 years). It appears as though a reasonable number of files are being accessed but over half are likely not being accessed, or at least not having their atime changed. I think this is most likely expected behavior in a user's home directory.

The file size distribution for the scan is also very interesting. The average file size is 1.22 MB, but the files range in size from virtually 0 KB to 1.964 GB. The standard deviation of the file size is 28.7MB, which is about 20 times the size of the average perhaps indicating large variation in file size. This is also supported by examining the distribution of file systems. About 32 percent of the files are 1KB or smaller, but the distribution appears to have a reasonable "tail" into the MB range.

This data and it's interpretation are just a snapshot in time of the state of the files (metadata of the metadata). There is some interesting information in the analysis, indicating that there are a large number of files that are between 1 and 1.5 years old that might be good candidates for archiving, if possible. Plus, there are a large number of smaller files on the file system but a few fairly large ones too (up to 1.9 GB). Smaller file access tends to really beat on metadata performance of a file system but in this particular case the vast majority of files have not been accessed in over a year, so the metadata performance is perhaps not a key issue at this time.

Summary

Paying attention to what's happening with the files on your storage system is a key task of administrators. Monitoring and understanding how and when the files on the storage are created, used, modified and removed by all of the users helps you understand how the storage is being used.

From this, one can get an idea of trends over time. How much capacity is being used? Who are the big users of the storage? Is there much old data on the storage? And ultimately, when and how much storage will I need in the future? The overall theme is really metadata about metadata.

In this article I've tried to illustrate that it's possible to write simple tools to gather information about the state of the file system. But, more importantly, I hope I've illustrated what kind of information is useful to gather and how you can use that information to begin to understand what is happening on the file system.

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved