A Trip Down the Data Path: RAID and Data Layout
This month we are going to put everything together for the entire data path. If you haven't read the earlier articles in the series, it might be best to review them to get a better understanding of the data path and the equipment/software involved. The data path is the path I/O takes across the:
Volume Manager and File System
The hypothesis I originally made when this series started was that if you want to tune I/O performance, you first have to understand the entire data path, so this month we are going to cover file system layout with respect to the applications and RAID hardware used (the data path).
Where to Start
You might think the place to start is building RAID volumes, but as you will see, that actually is the worst place to start. The first thing you need to do is understand how the application(s) will be performing I/O and how that will affect the file system and/or volume manager. Factors that need to be accounted for include:
- The number of files within the file system
- The size of the files within the file system
- Request size(s) to the files
- Ratio of read to write operations
- Access patterns of files that will be used the most (for databases, these are often the index files)
Before we get too far along, you might want to review the first Storage and I/O article which discusses applications I/O.
Let's say you have a database application and a file system that stripes all of the data (please see "Choosing a File System or Volume Manager" for a discussion of striped file systems and round-robin file systems). Some of the things you will need to consider are:
- How often are indexes created and/or rebuilt
- The size and number of the index file(s)
- The amount of data in indexes verses data within the database
Keeping these things in mind, you might want to consider having different file systems for different types of data. It is often beneficial to have several file systems. For example, you might have:
- One file system for the index files
- One file system for the data within the database
- One file system for the installation of the database (I suggest this as the database application itself usually has many small files, and creating a file system with small allocations reduces the size needed for the installation and allows the other file system to be used with a larger allocation)
- One or more file systems or raw devices for database logs (these are often not under the file system controller, as raw devices are used to ensure the data is not cached in memory in case the system crashes)
If performance is an issue, then having a number of file systems and tuning each file system and LUN within the RAID will provide the best performance for each type of file and I/O associated with it. Let's step through the "whys" for each file system type using the database example.
Page 2: Index Files
The first thing to note about index files is that they are very often much smaller than the database files themselves. Index files in many databases are two Gigabytes in size, so if the file system supports large allocations and all you have within the file system are index files, you could set the allocation size equal to two Gigabytes and have one allocation per file. Most file systems do not support allocation that large, but you can make them as large as possible. On the other hand, index files, though cached in memory, are often searched with small random I/O requests. Small random I/O requests perform much better on RAID-1 than on RAID-5 with 8+1 or even 4+1.
If you can assign a RAID cache to a LUN and the index files are able to fit within that cache, you can significantly reduce the latency to complete the searches. This is often done with enterprise RAID systems, as they support very large caches. How many LUNs do you need? How should they be laid out within the RAID, and how do you tune the volume manager and/or file system? Consider the following example.
Let's say you need 200 GB of index file space and you have 72 GB disk drives in your RAID. For most RAIDs you have two choices in laying out the LUNs for the six disks (200/72=~3 disk drives, and since it is RAID-1, you need 6 drives) you are going to use for RAID-1. You can either:
- Create a LUN with 6 drives and let the RAID controller manage the striping across the 6 devices
- Create 3 RAID-1 LUNs and let the volume manage or file system manage the 3 LUNs
Things are starting to get complex now, as you also have to determine the internal allocation or RAID block size for each of the LUN(s) that will be created. Let's start with the internal allocation, sometimes called segment size and/or element size. If your database indexes are actually small block random, these I/O are often 8 KB requests, so you want to make the internal block size as close as possible to match that number. This ensures that the RAID is not reading data it does not use, as the RAID reads and writes on these internal block sizes. Additionally, if you turn off the RAID controller readahead cache, assuming that the I/O is truly random, performance will improve. The problem is that I/O is often somewhat sequential, or sequential with a skip increment, so having a readahead cache which encompasses the skip increment will improve performance.
OK, we know what the RAID settings should be, but we still need to build the LUNs. I generally suggest that it is always better to do things in hardware than software. So for most RAIDs, I would let the RAID controller stripe the data, but there is a very important gotcha. The data is now striped across 3 disks (6 total for the mirror), and if you were doing sequential I/O to a single index file, your I/O is not physically sequential until you perform I/O to the first, second, and third disks before you go back to the first one.
So if you were doing sequential I/O, it is no longer physically sequential on the physical disks. Of course, this depends on the stripe size within the volume. This has an advantage if the index files are random searched and nothing is sequential, as it statistically spreads the I/O across the three disks.
On the other hand, you could have three LUNs and use a volume manager or file system to either stripe or round-robin the access to the data. One trick that I have used for volume managers that only stripe data is to set the stripe size in the volume manager equal to the database file size (2 Gigabytes in many cases). You can effectively round-robin the index file access, as each index file will be allocated on a separate device. This works only because the index files are of a fixed size, which is often the case for databases. Volume managers and file systems that support round-robin access will also work. The advantage here is that if the indexes are searched sequentially and the files are allocated sequentially, you can match your readahead cache and cache usage to the index file usage, significantly improving the performance.
I have seen cases whereby using this process resulted in eliminating over 80% of the I/O from cache to disk, which translates into better response time and allows more usage from users.
Page 3: Data Files
Going through the same set of steps is necessary for the actual data itself, but there are often numerous differences between how the data is accessed and how the index files are accessed. The process, however, is the same. Here are the steps:
- Determine how to lay out the LUNs based on:
- The file system and volume manager that you plan to use. Think about allocation sizes, round-robin, and striping
- The size of the files. Think about how the files will be allocated on the physical disk devices and how the RAID cache will be used
- Determine the I/O request size that will be used to read and write the data
- If it is large, then RAID-5 might be a good choice, as with RAID-1 you:
- Have to write out more data using more cache to disk bandwidth than with RAID-5
- Use more disks for the same amount of data space
- What amount of new data will be created (read/write ratio for setup of the cache)
If you understand the above information, you will be able to create the LUNs and set up the file system and/or volume manager based on the LUN creation. Of course, you will still have to tune the database internals, but in terms of I/O on the storage, it will be as efficient as it can possibly be. I define efficiency as the highest possible cache utilization and the lowest possible data latency.
This process can be used for any other application type and is not restricted to databases.
When I started this column late last year, I stated that the key to I/O performance is understanding the data path from end-to-end. Through the series of articles culminating with this column, I believe that we have completely covered this end-to-end understanding. RAID controllers have no knowledge of how the data will be accessed or how the files are mapped to the physical devices, yet they have built-in algorithms to cache the data, improve the I/O latency, and reduce the amount of I/O from disk to cache.
It is up to the architects, storage team, and/or administrators with knowledge of the volume manager and file system to assist the RAID controller with its caching algorithms. The RAID and the volume manager and file system do not communicate nor play well together, as they have no real communication. Maybe that will change in the future, but not for some time.
Now that we have completed the data path, next month we will start reviewing the "hows and whys" of benchmarking. If anyone has any suggestions, please let me know.
See All Articles by Columnist Henry Newman