I wasn’t impressed last week when I saw Brian Beach’s blog on what disk drive to buy. I wasn’t impressed due to the lack of intellectual rigor in the analysis of the data he presented. In my opinion, clearly Beach has something else going on or lacks understanding of how disk drives and the disk drive market work.
Let me preface this article with the following full disclosure: I own no stock in Seagate, WD, or Toshiba, nor do I have family or close friends working at any of those companies. I do not buy disk storage, as in my consulting role I am not allowed to resell hardware or software by agreement. I do know people in two of the three companies and have for years, but I have not been given free stuff nor would I take it. Basically, the only agenda I have is a comprehensive factual analysis, which in my opinion is lacking in Beach’s blog post.
Let’s start at the second table in Beach’s article. I have added a few columns in green that were not part of the original, but the information in these columns can be found on the web with a bit of work, and as you will see are pretty important.
Let’s talk about the release data first. The oldest drive in the list is the Seagate Barracuda 1.5 TB drive from 2006. A drive that is almost 8 years old! Since it is well known in study after study that disk drives last about 5 years and no other drive is that old, I find it pretty disingenuous to leave out that information. Add to this that the Seagate 1.5 TB has a well-known problem that Seagate publicly admitted to, it is no surprise that these old drives are failing.
Now for the other end of the spectrum, new drives. Everyone knows that new drives have infant mortality issues. Drive vendors talk about it, the industry talks about it, RAID vendors talk about it, but there is not a single mention of what, if anything, is done in the Backblaze environment or how that figures into any of the calculations. This is never discussed in terms of: does Backblaze have a burn in period? Are they buying drives from someone that has burned in some and not others? There are lots of questions here.
Next let’s move to the hard error rate in bits. One of the definitions of consumer drives as compared to enterprise drives is that hard error rate is 1 bit in 10E14 bits for consumer drives and 1 bit in 10E15 bits for enterprise drives. The following table shows how many bits are moved before the storage vendors say there will be an error on the drive. So move about 11.3 TiB of data on a consumer drive and expect a failure.
Let’s analyze a few things that were stated and the implications. First the blog stated, “At the end of 2013, we had 27,134 consumer-grade drives spinning in Backblaze Storage Pods. The breakdown by brand looks like this:
I’ve noted that we just found that the Seagate 1.5 TB drives are about 8 years old since release, for the failure rate, but the average age of the Seagate drives in use are 1.4 years old. Averages are pretty useless statistic, and if Seagate drives are so bad then why buy so many new drives? Yet this is not the big issue. There is nothing in the blog about how much data is written to a set of drives that has had failures. If the drive exceeds the manufacturer’s specification and it fails, while another drive that is not being read or written to as much does not fail, is that really a problem?
My answer is that you cannot discuss drive failures unless you state clearly that the amount of I/O being done to the drives is the same. You should expect that an 8-year-old drive – besides being beyond its life expectancy – has had more data read and written to it, so it’s approaching the hard error rate for the drive.
On a side note, it is interesting that the claimed failure rate of the Seagate Barracuda Green ST1500DL003 drives is 120%. Is that because 100% of the drives failed and were replaced with another 100% and then 20% failed, or what? Is the cause of the failure because, as was stated, “The drives that just don’t work in our environment are Western Digital Green 3TB drives and Seagate LP (low power) 2TB drives. Both of these drives start accumulating errors as soon as they are put into production. We think this is related to vibration. The drives do somewhat better in the new low-vibration Backblaze Storage Pod, but still not well enough.”
A simple read of the drive specification compared to other drives shows that this drive should not be used in high vibration environments. Does this mean that Backblaze engineering does not spend the time to read the drive specifications for each drive they purchase?
The next column is the vendor supplied AFR (Annualized Failure Rate), which Seagate supplies for most of their drives and WD supplies for a single drive. These numbers are the best case values for the drive to respond to a command. For the 1 drive that Seagate does not provide an AFR, the ST4000DM000, Seagate states in their manual that: Average rate of <55TB/year. The MTBF specification for the drive assumes the I/O workload does not exceed the average annualized workload rate limit of 55TB/year. Workloads exceeding the annualized rate may degrade the drive MTBF and impact product reliability. The average annualized workload rate limit is in units of TB per year, or TB per 8760 power-on hours. Workload rate limit = TB transferred × (8760/recorded power-on hours).