Skip to main content

Design for failure: Durability and its practical impact on your business

For anyone who wants their data to be reliably stored (and that’s everyone), the concept of storage durability is a serious matter. You might be backing up to a hard drive, but what happens when that hard drive dies? ​All​ hard drives will fail​, it’s not a question of “if” but one of “when”.

Today, to help mitigate against that risk, most people are sending their data off to a “cloud” service. But the challenge still exists — cloud services providers, like Backblaze, are all storing data on hard drives. And hard drives fail. Which leads us back to the core question: What happens to my data when the drives it’s stored on fail?

To alleviate fears, the storage industry has introduced the concept of “Data Durability”. It is supposed to be interpreted by people as “the probability that my data will become unrecoverable”. All major providers claim at least ​99.999999999% of durability. That’s “11 nines” worth of durability — if you stored 1 million objects for 10 million years, you would expect to lose one. ​There’s a ​higher likelihood of an asteroid destroying Earth​ within a million years. Great! We should all sleep safely tonight.

But there’s a catch. Until recently, no provider had published the math or the underlying assumptions of how that failure rate is being calculated. While magicians don’t tell you how they do a trick, trusting your data to magic is generally not recommended.

What does the math look like?

There are 2 key components to understand when calculating durability:

  1. Annualized failure rate of a drive. ​Since all drives will fail, the question is how often will they fail? Currently, across all of our (Backblaze) data centers, we see an annualized failure rate of 0.81%.
  2. Average rebuild time for a failed drive. ​Taking a step back, cloud storage providers do store your data on hard drives, but any given file is actually stored across multiple hard drives. At Backblaze, we split up files into 20 “shards” and store each shard on a different drive. We can lose any 3 shards and still recover a file. So if a drive fails, we have time to rebuild the drive without any service interruption. At Backblaze, it takes about 6.5 days to rebuild a drive.

Those are the basics. From there, the math gets very complex even for people with a background in statistics. For a complete breakdown on it, including math symbols that you may have never seen in your life, ​please visit this post.

Designing for failure

While the storage industry tries to point at the durability metric to convince customers “everything is fine, no need to look further”, it’s close to sleight of hand. Yes, hard drive failure will not bite you on any scaled system. 1 in 10 million are good odds. But that doesn’t account to far a more simple issue — what if an armed conflict or Act of God takes out the data storage facility? Whether you are storing your data yourself or using a third party, floods/earthquakes/etc. do happen.

Even more likely — what if your credit card fails and your email provider is filtering the billing emails into the SPAM folder?

While the odds of those events are harder to calculate, they can happen. So customers need to take precautions for themselves — multiple users on the account, ideally storage with multiple vendors in multiple locations, etc.

Reliably storing data requires building fault tolerant systems and processes that help mitigate the impact of failure scenarios. This means not only choosing providers that store data reliability, but also making sure your own systems are designed with that in mind. From a hard drive perspective, Backblaze has you covered. Our architecture uses ​erasure code​ to reliably get any given file stored in multiple physical locations (mitigating against specific types of failures like a faulty power strip).

Beyond the hardware, Backblaze’s business model is profitable and self-sustaining and provides us with the resources and wherewithal to make the right decisions. We also make the decision to do things like publish our hard drive failure rates, our cost structure, and the durability math all in an attempt to provide ​evidence​ about the reliability of our systems. We also have a number of ridiculously intelligent, hard working people dedicated towards improving our systems. Why? Because the obligation around protecting your data goes far beyond the academic calculation of “durability” as defined by hard drive failure rates.

To learn more, visit us in the Expo Hall at the 2018 Jamf Nation User Conference. Oh, you haven’t registered yet, no problem. Head here to register for the largest Apple IT event on the planet.

Ahin Thomas is the VP of Marketing at Backblaze.