Cloud Storage Will Be Limited By Drive Reliability, Bandwidth

Friday May 14th 2010 by Henry Newman

Drive reliability and bandwidth limitations make cloud storage a near impossibility for very large data stores.

Henry Newman We've all probably heard more than we want to hear about clouds this week, thanks to EMC World, but there are some things you need to think about if you're considering adopting a cloud model as part of your storage networking architecture.

Clouds have a place in data storage architecture planning, as do applications that might use clouds, such as Hadoop. The standard cloud method of data replication is to use low-cost hardware. By replicating the data in the event of failure, the theory is that you have data reliability. As most of the work I do is in large storage environments, and given what I know about drive failure rates, I have some huge misgivings about using this method to manage petabytes of data that need to be highly reliable.

So what I want to do is take you through a step-by-step analysis of the low-cost hardware used in most clouds. I did not look at the failure rates of the blade, just the storage. As part of this analysis, I went to the Web sites of all the major disk manufacturers and used the best values across all vendors, so my analysis is likely best case and your mileage may vary. Let's go through this thought process step-by-step.


Hard Errors Per Petabyte of Data Moved

The hard error rate, also known as BER (bit error rate), has a big effect on reliability. All the disk vendors I reviewed specified the BER in terms of non-recoverable read errors per bits read (1 sector per 10EXX bits).


Drive Type One sector per X bits (Hard error rate)  Byte Equivalent PByte Equivalent
Consumer SATA 10E14  1.25E+14 0.11
Enterprise SATA 10E15  1.25E+15 1.11
Enterprise SAS 10E16  1.25E+16 11.10

Enterprise SAS drives are not being used by anyone that I am aware of in a cloud architecture or Hadoop, given the huge cost difference between enterprise SAS and SATA drives. Most installations are using the cheapest hardware.


Time to Read a 2TB Drive

You will see why this is important later in the article; for now, just note the time required to read the data on a drive.


Drive Type Time to read 2 TB drive
(in seconds)
Consumer SATA 24390.2 
Enterprise SATA 24390.2 

Number of Drives to Saturate a Channel

It is important to understand the number of drives needed to saturate different speed SONET channels. I have estimated the performance of the channels by derating the channel for TCP/IP and other packetization and retry overhead, being very conservative at 90 percent of channel rate and operating at full duplex at these speeds in both directions.


OC Channel Speed Estimated MB/sec  Number of Consumer SATA Drives in bandwidth Number of Enterprise SATA Drives in bandwidth
48 276  3.37 3.04
192 1106  13.49 12.15
384 4424 53.95 48.61
768 17695 215.79 194.45

Clearly, it does not take a large number of drives to saturate the network bandwidth with failed disk drives.


Disk Drive Failure Per Year

There are two parts to drive failure formula. The first part is based on the hard error rate. If you move 111 TB of data, you can expect a disk that cannot read data that was written on consumer SATA drives. The number for enterprise SATA is 1.1 PB. The other component to failure is something called annualized failure rate (AFR). This is based on a yearly percentage of the total number of drives and is an estimate provided by the drive vendor. It should be noted that very few drive vendors provide AFR for consumer SATA drives. The next table shows the number of drives using 2 TB SATA for various storage amounts and the expected number of failures per year.


Number of Drives AFR in % 1 PB in Drives 1 PB Failure Rate 5 PB in Drives 5 PB Failure Rate 10 PB in Drives 10 PB Failure rate 25 PB in Drives 25 PB Failure Rate
Consumer SATA 1.24% 500 6.2 2500 31 5000 62 12500 155
Enterprise SATA 0.73% 500 3.65 2500 18.25 5000 36.5 12500 91.25

The other aspect of this is failure based on the BER, and since this is based on data movement, I will again choose a conservative number for usage and estimate that the drive will use 5 percent of its total bandwidth year-round.

Drive Type Number of Failures
Per Year for 1 PB
5 PB 10 PB 25 PB
Consumer SATA 542 2712 5423 13558
Enterprise SATA 61 304 608 1519

To determine total failures, you need to add the BER to the AFR numbers using the 5 percent usage.

Drive Type Number of Failures
Per Year for 1 PB
5 PB 10 PB 25 PB
Consumer SATA 549 2743 5485 13713
Enterprise SATA 64 322 644 1611

If you take the 5 percent value and divide by 365 for total failures, you will get this number of failures per day:

Drive Type Number of Daily
Failures for 1 PB
5 PB 10 PB 25 PB
Consumer SATA 1.5 7.5 15 37.6
Enterprise SATA 0.2 0.9 1.8 4.4

A small increase to 7.5 percent usage of total bandwidth yields this number of failures per day for each of the storage volumes:

Drive Type Number of Daily
Failures for 1 PB
5 PB 10 PB 25 PB
Consumer SATA 2.2 11.2 22.5 56.1
Enterprise SATA 0.3 1.3 2.6 6.5

Total Amount of Data to be Moved for Failures

Now to the meat of the issue: For the 5 percent use case and 10 PB of storage, you will have an average of 15 consumer-grade SATA drives failing per day. Each of the drives takes approximately best case 24,390 seconds to be read and written over the network. At most, you can have the full bandwidth of 3.37 drives, and you have a total of 276 MB/sec of bandwidth for 24 hours. So using some simple math, that is 276 MB/sec*3600*24 equals total MB/sec per day. Doing the same math on the disk drives for each drive, you need 82 MB/sec for 24,390 seconds*15 drive failures. Here how that math works out for a few scenarios:


Consumer SATA Usage OC Channel Speed MB Seconds Per Day for 1 PB 5 PB 10 PB 25 PB
5% 48 20,882,319 8,860,106 -6,167,659 -51,250,956
7.5% 48 19,840,727 3,652,149 -16,583,574 -77,290,742
5% 192 92,545,935 80,523,722 65,495,957 20,412,660
7.5% 192 91,504,343 75,315,765 55,080,042 -5,627,126

Any negative number means that the drive replication requirement exceeds the channel bandwidth. So, for example, if you have 10 PB and OC-48 and 5 percent drive usage, that translates to 6,167,659 MB of bandwidth that exceeds the channel, or about 71 MB/sec over the 24 hour period. Obviously, this becomes a bigger and bigger problem over time, as you cannot replicate the data as fast as it's lost. It is a a statistical probability that you are going to eventually lose data if you have 10 PB, and it will not take long. The only architectural option is a third copy of the data, which is very costly. The crossover point for an OC-48 channel with 5 percent usage of the storage system is between 5 PB and 10 PB, and with 7.5 percent usage you only have 42 MB/sec (3,652,149/(3600*24)) of spare bandwidth at 5 PB of storage space. What is needed is much faster networking, which comes at a cost, or more reliable storage, which also isn't cheap.

I am sure cloud companies trade these costs off every day and figure out what the best method is for optimizing the costs. Is it possible that some of them don't understand some of the basic hardware issues? I sure hope that is not the case. Clearly, cloud storage works just fine for less than 5 PB for an OC-48 channel and consumer SATA storage. How many clouds have more than that much storage today? I have no idea, but certainly some do, and 10 to 20 PB archives are common for large storage users.

Cloud architecture is far more complex than architecting for local storage. Cloud storage could be designed with a RAID back end, eliminating much of the problem, but most clouds I see do not use RAID because of the cost. The bottom line is that cloud architecture and design is not easy, and for large data volumes I cannot see how clouds can be cheaper than local storage.

Drive reliability and bandwidth will limit cloud adoption, and it's a problem that may never get solved. Bandwidth will continue to get cheaper, but drive reliability hasn't improved much, and data will likely continue to grow faster than bandwidth anyway. Perhaps network-based deduplication could help — assuming the data can be deduped. But for now at least, there doesn't seem to be much of an alternative to good old-fashioned data centers for very large data stores.


Henry Newman, CTO of Instrumental Inc. and a regular Enterprise Storage Forum contributor, is an industry consultant with 28 years experience in high-performance computing and storage.
See more articles by Henry Newman.

Follow Enterprise Storage Forum on Twitter


Mobile Site | Full Site
Copyright 2018 © QuinStreet Inc. All Rights Reserved