Henry Newman's second article in a monthly column on storage hardware, software, and applications I/O focuses on the operating system and the associated systems calls for I/O.
In my last article we covered the I/O path for the C library I/O and how applications use the C library data path to get to the operating system. This month we are going to cover the operating system and the associated systems calls for I/O.
Over my nearly 23 years in the computing industry some of the biggest mistakes I've seen have been when architects, system administrators and vendors made critical decisions to purchase, optimize and design without fully understand how the system all worked together. I strongly believe that you cannot purchase, optimize and design storage systems (hardware and software) without fully understanding the files sizes, number of files, access patterns, and, most importantly, the I/O request sizes that the hardware will see from the operating system and file system. So this month we will be looking at operating systems system calls and how they pass data to and from the file system.
I/O Path (Using POSIX System Calls)
Last month we discussed the I/O data path using the C library package. The next step in the data path is the operating system and POSIX system calls for performing I/O. We'll also take a look at some of the implications of various operating systems and, in some cases, the file system implementation. Almost all operating systems today are POSIX compliant.
We will also discuss the data path for system calls and the C library I/O when requests do not begin and end on 512-byte boundaries. The path for system calls and the path from the C library to the system is the same.
A large number of variables impact both the data path and the I/O and system performance when using system calls, but first we need to review some of the important system calls for I/O and how they are used. There are two types of I/O supported on most POSIX-compliant systems. They are:
- Synchronous I/O -- Each I/O request waits for the completion of the last request to that file descriptor before the next request is allowed to execute. With synchronous I/O, you will wait for the I/O request to complete before execution of the next instruction in the application.
- Asynchronous I/O (AIO) -- I/O requests are sent to the system and then the synchronization is requested by the application. With asynchronous I/O, the requests are issued to the operating system and the next instruction is immediately executed.
It should be noted that some applications that use POSIX threads implement asynchronous I/O via threads, but the I/O within each thread is synchronous.
Page 2: System Call for Synchronous I/O
System Call for Synchronous I/O
The following is a list of common system calls and their meaning:
open -- Opens a file descriptor. Important options for some systems include:
- Large files over 2GB
- Synchronous I/O for data integrity
- Direct I/O -- I/O which moves directly from user space to the device without using any system caching
lseek -- Sets the file descriptor to the byte position specified.
- Some systems require the use of lseek64 for files larger than 2GB
read -- Reads data from a file descriptor into the user data area (buffer)
- Allows the application to check for errors
pread -- Reads data from a file descriptor into the user data area (buffer) from a specific location in the file. This is the equivalent of a read and lseek, but it does not set the file pointer to the position as the lseek system call does.
System Call for Asynchronous I/O
POSIX standards for real-time systems document the required support for asynchronous I/O (POSIX 4). The concept of asynchronous I/O was originally developed by the benchmarking group at Control Data Corporation back in the late 1960s. They needed a way to read data from FORTRAN programs (punch cards) while still being able to execute the programs because the disks were far slower than the CPU (not much has changed in 35 years).
The application was based on reading data at the beginning of a time period. By using asynchronous I/O, they were able to have the data available when they were ready to use it in the application. What was implemented was a FORTRAN I/O statement called BUFFER IN/OUT and the associated operating system changes. But enough with the history; today, most operating systems support POSIX asynchronous system calls within the operating system.
aio_read -- Asynchronous read request
a. Allows the application to check for errors.
aio_write -- Asynchronous write request
a. Allows the application to check for errors.
lio_listio -- A special call that allows you to issue a list of reads or writes with a single system call. This is very useful when reading or writing a number of records in a file at the same time. I believe a good friend of mine from Cray Research Larry Schermer originally created the list I/O call in the 1980s during Cray's transition to UNIX.
a. Allows the application to check for errors.
Page 3: How Does It All Work?
How Does It All Work?
The most important thing to remember is the basic rule for storage hardware, which is that currently all physical requests must start on 512 block boundaries and end of 512 block boundaries. The picture below illustrates this:
The first request begins on a 512-byte boundary and ends on a 512-byte boundary, while in the second example the I/O request does not begin and end on 512-byte boundaries. So what happens in the system when you do not make I/O requests on 512 byte boundaries? In this case, the system must convert the requests to read and write on 512-byte boundaries for you, as requests can only be made on physical hardware boundaries. To do this conversion, the overhead will be extremely high.
What the System Does
There are a large number of N-cases that I will try to cover, but let's start with the simplest case and work to the most complex. Much of the information here will be covered in more detail in future columns as it relates to file system implementations and issues with direct I/O. So here is what happens in a few cases:
||As the data does not begin and end on 512-byte boundaries the system reads the data into a system buffer cache and transfers the data that the user asked for. More data is read in than is required. For requests that use C library I/O (fread(3)), the library manages requests on 512-byte boundaries as long as the buffer size is a multiple of 512-bytes plus the 8 bytes needed for the pointers into the buffer.
||As the data does not begin and end on 512-byte boundaries on many if not most implementations, the operating system and file system must read the data from the disk/RAID device into a system buffer cache. The data read in is the size of the request rounded to a 512-byte boundary. So, for example, if you started writing at byte 5 and wrote to byte 131072 (128 KB), you would read into cache from 0 bytes to 131584 (131072+512-bytes) as that is the next nearest multiple of 512-bytes. The system writes the data to the buffer from the user, and then the system writes the request to the disk/RAID device.
This is extremely inefficient and is often called read-modify-write. As the data is read in, the record is modified in memory and then written out. The system overhead, the amount of data transferred, and I/O wait time is far greater than if you made requests on 512-byte boundaries. For requests that use C library I/O (fwrite(3)), the library manages requests on 512-byte boundaries as long as the buffer size is a multiple of 512-bytes plus the 8 bytes needed for the points into the buffer.
Page 4: Applications Programming Issues
Applications Programming Issues
Making requests that are not on 512-byte boundaries can cause serious performance problems for the system. There are a few very important "dos and don'ts" to ensure system performance using both system calls and/or the C Library package. Every major database uses system calls for I/O and each has internal caching mechanisms which bypass the buffer cache, so there's no need to worry about databases. The following table summarizes I/O types and the tradeoff between the these methods:
|I/O request and/or structure
||C Library Package
|Sequential on 512-byte boundaries
||If you can make the request size large (>256 KB) this is the best method
||Only use if you cannot use system calls. You should use setvbuf(3) function to make user buffer size large so that requests are larger (>512 KB) than the record
|Sequential not on 512-byte boundaries
||Do not use unless you can modify code to pad data or change the requests to be on 512-byte boundaries
||Shines over system calls if you can make buffer size at least 2x greater than request size rounded to the next 512-byte boundaries. Even for 1x the buffer size
|Random I/O on 512-byte boundaries
||System calls are the fastest, as a single request is done in one I/O
||Never make buffer size larger than request size, as you will be making I/O requests in the buffer size reading data you will not use. If you need to use the type of I/O, ensure that the buffer is exactly the size of the request
|Random I/O not on 512-byte boundaries
||Do not use unless you can modify code to pad data or requests to be on 512-byte boundaries
||Shines over system calls if the buffer size is exactly the size of the I/O request rounded to the next multiple of 512-bytes. Do not make the buffer size greater than the request size rounded up
In general, changing the request size for a program that does sequential I/O using the C library package is far simpler than changing programs that make direct system calls. All the user must do is add a single line after the open to call the setvbuf(3) library call. For programs that make system calls often, this requires major rewrites to the program and data restructuring to make larger requests. Often this has implications on the program's computational structure.
This is typically a much more difficult problem. On rare occasions the solution is very simple. Sometimes I have seen that the file being opened is small by today's standards and can actually fit in memory. I had one program that I worked on a few years ago that used a 40 MB random I/O file and which opened and closed the C library package 100,000s of times, so it was never kept in memory. A simple change was made to use setvbuf(3) to keep it in memory and removed the open/close. The application performance improvement was over 300x and CPU time was significantly reduced.
The issues with random I/O and large files are a bit more complex. You never want to make requests larger than the physical request from the application. If the requests begin and end on 512-byte boundaries, then systems calls are your best choice. If they do not begin and end on 512-byte boundaries, then using the C library and setting the buffer to the request size rounded to the next 512-byte boundaries is a far better choice given the read-modify-write that will otherwise be required for the systems calls.
It is important to note that random I/O is not always as random as you think. Often applications perform what I call randomly sequential I/O. A number of requests are made sequentially and a seek request is made, and I/O is then done sequentially again. From what I have seen, this is very common in databases, search engines, and a number of scientific applications. We will discuss this further when we talk about RAID cache in a few months.
I/O at the operating system level is not overly difficult to understand given the current hardware restrictions of I/O beginning and ending of 512 byte boundaries. It is a very binary rule -- either the application does I/O on those boundaries or the system does it for you. Performance plays a major role in determining the most efficient use of systems calls and how they pass data to and from the file system.
» See All Articles by Columnist Henry Newman