Guest Author: Per Brashers, Founder, Yttibrium LLC
A consultancy focused on efficient storage and computing
Cold Storage Demystified
Most of the Internet Moguls have been asking for a longer service life. This may be a MTBF of 3,000,000 going to 4,000,000 or a warrantee of 7 years. Either way there are some tricky bits to deal with when talking about an archive class hard drive. The point of this article is to understand what are the top factors that are important to a ColdStorage system.
If you just want to jump to the conclusions, there is a section at the end for that, as well as the asks for both providers and consumers.
While still at Facebook, I coined a phrase, ColdStorage. I also regret coining the term ColdFlash, but more on that later..
The intention of ColdStorage at the time was to differentiate from the ‘Archival’ and ‘Backup’ use-cases. Both of those use-cases indicate an extreme latency on retrieval, and often imply a heavy burden on ingest. This use-case was to have extremely low operating costs, extremely high data durability, and in time of need, able to do restores/serve live traffic at a barely degraded rate.
1. Service life
There are a myriad of issues associated with how service life in the hyper scale is derived. Many use the 60% of warrantee mark as a place where the bathtub effect starts to hurt, and where the drives have 50% of their initial value on the grey market. Others will run to end of warrantee and do a fleet swap at that time. We have yet to see anyone willing to run past warrantee and run to the death of the fleet.
We all know that the formula for MTBF -> AFR is inaccurate, and customers are slightly different in their method to adjust from ‘perfect bench test’ to ‘real world’. Numbers range from 5X to 10X the published rate. Lets get to brass tacks for a second here. For instance the formula assumes a constant failure rate, impossible for sure. But to play it out, a drive listing 2M hrs MTBF calculates out to be a AFR of .44%. Empirical evidence in an ambient datacenter indicates that a mixed, and aged (not over warrantee), fleet of drives yields 4.4% AFR. So we can take a 10X multiplier and be moderately happy with that accuracy. Thus a drive of 800K hrs MTBF would calculate out at an AFR of 1.1%, or given the above adjustment 11% AFR. YIKES!
Now we come to another ambiguity, what affects service life? We understand well that there are physical properties, bytes transferred, load/unload cycles, start/stop cycles, which are well tested and understood. However, the ambient data center has added new challenges that are less documented. For example deg T and % RH per hour as measured 9” in front of a drive. The inner comedian in me wants to make some sort of ‘releasing the pituitary gland’ joke about how to make this measurement.. but let us stay on track. The specification is to take a reading per hour, and report the delta. So it is totally plausible for an ambient DC to turn on the evaporative cooling, causing a rapid shift in RH and Temp, but maintaining this new level for the remaining hour, thus staying inside the specifications. Said differently there is no slope specified, only implied. There is a real possibility that those who simply do not have the necessary information may unwittingly abuse operations in the grossly large time format specified.
Are temp and humidity swings the major contributing factors? (Keeping in mind this is about archive class storage) There have been a series of research papers on this topic, and specifically in the context of Cold. It seems that components that are not able to self maintain some level of temperature stability (and thus some resistance to humidity..) are subject to all the problems of the larger system. If that system allows for rapid changes in ‘conditioning’ then the affect may be large. To de-obfuscate what I am talking about, there is evidence that an idle, or inactive drive may have a higher failure rate than an active drive, in an ambient data center.. or to answer the question directly, Yes.
2. UBER not drivers, Uncorrectable Bit Error Rate.
This first paragraph is my personal rant, if you do not want to hear why we as an industry have failed to serve ourselves, please go to the next paragraph where data driven discussion continues. Uncorrectable errors happen all the time in drives. The drives drive us crazy in their herculean efforts to not post a failed read/write, taking up to 3,000 ms to do it!! Who let their eyes off the road? We are not on the right path. Problem is, the OS driver guys are at fault here. For how many years has storage been the ‘pull the chain and it becomes someone else’s problem’ of IT? Now we are in an era that is all about how the data can derive better information. “Revenge of the Geeks V2?.” No, the OS guys have to wake up, mop up and deal with the fact that they have an analog device under a digital system, and thus deal with reports of missing blocks, sectors unreadable, etc. Erasure Codes do well at the macro level, and LEC is promising to bring it to the median level, but there still in no way to know if the data retrieved is the data stored (T10-diff is a feeble but good first step).
The effect of UBER on operational conditions. The Uncorrectable Bit Error Rate of a drive has some significant impacts, even ignoring the drive being marked bad unnecessarily (80% No Trouble Found Issue) because the OS cannot get a grip on reality. ‘Soft’ errors occur at some reasonable rate, like 110-15. This means for every 125 PB of data transferred, an error will have been posted, during normal operations. The effect of this in the hyper scale datacenters is to trade off storage for network/CPU. If we have a N-1 generation erasure coded system, without local protection, then we get the simple problem of for each 125 PB transferred, we end up with at least one segment at fault. This fault in an [10,14] EC means we have to find 10 valid segments, recalculate the 4 parity bits, write a new stripe of 14 chunks, and set garbage collection on the original (remaining) segments. So when you wanted to do a single read of 10, it ended up doing 10 + CPU +14 + 13 I/Os, or 37 disk operations plus the CPU, and oh yeah, filling the network pipes in the mean time.. so relaxing UBER by an order of magnitude (110-14) is a 1037(+) cost effect on the system at large. Said differently, every 13 PB of data transferred will incur a 37X I/O and network penalty.
3. Usage model
The purpose of a Cold Storage system is to prevent total loss of data. The source we plan to protect against is generally ‘User Silliness’ (code push corrupts all the online sources etc.), to provide any level of business continuance in this respect the system must employ some level of fairness. The other main goal is to make it cost as little as humanly possible, both in the enclosure design, as well as in the operational costs. The latter having the largest impact available, as we squeezed every penny we could out of the enclosure designs, given the constraints we have to date.
The ColdStorage systems get nice creative names, like Pelican, SubZero, etc. But they all employ some sort of fairness system. Thermal and power fairness is what is employed by the Pelican system, and SubZero uses time slicing. Either way, any given data set that needs to be restored will span multiple fairness zones, so a staging location must be employed to get the data sorted out before returning to production. This was the first instance of ColdFlash, a phrase that has warped and morphed for every industry in a different way.. my apologies.
Retention times of data in ColdStorage may vary. There are cases where a lawyer asks to have records kept for the duration of a case. There is a 7-year need for corporate data; there is a 99-year need for social and medical records, etc. But none of them falls below 5 years. This is why Blu-ray has gotten a shot in the arm, is off life-support, and back in the race.
Some of the HyperScale folks went off and started using Tape, for the low media cost, and low maintenance costs. However to service the rapid restore capability, things are not quite so good. Turns out that if you buy a library and fill it, never ejecting the media (in order to hit the rapid restore needs), the loaded cost after year 0 is getting pretty close to disk costs! (The tape guys are great sales folks, they make us think that that really expensive thing holding down our floor is amortized, so do not count it in the cost of storage..) Tape also presents interesting challenges in an ambient data center, it stretches when it gets warm, and it glues itself together when it gets moist. It is also so low energy that it cannot do thermal management of its own. Thus you have to build a separate low-efficiency enclosure inside the DC to hose this one thing.. Not cool.
4. Putting it all together, at scale
Ok, now onto the summary. Let us first off fix a few variables. Let us stay with a standard 7 MW feed, yielding roughly 20,000 sqft of datacenter, or 1,000 racks. If we claim a reasonable ratio of compute to storage, both to manage I/O as well as the inevitable EC repairs, we can claim 4% of the environment is compute, the rest storage. Yielding roughly 260,000 drive slots. If we take the flat rates as yielded by “the formula” and adjust for real world, we get a 5% failure rate for a 2M hr MTBF class drive, and 11% for a 800K hr MTBF drive. Multiply the fleet size and we get 12,975 drives failed for a 2M hr MTBF and 28,545 failed drives for a 800K MTBF, EVERY YEAR! Since we know the bathtub effect is above this rate, first year will be ~3X that, and so will the last year of warrantee. The 800K MTBF drive has a 3 year warrantee on it, making the replacement rate in a single lifecycle, in the flat model 85,635 drives replaced, and in the bathtub adjusted model, 199,815 (or 77% of the fleet, thus we must assume that work has been done to reduce the tips of the tub, assume.. even if the recent article from Backblaze indicates differently). Still, YIKES again
Now let us go back to that question of data integrity, and longevity. If we take the flat rate of 28,545 drives fail in a year, and we have an erasure code over top of these drives, we earlier established that a hole in the data resulted in a 37X I/O rate on the network/disk subsystems, this will cause over a million moves per year (read the remaining bits over the network, calculate new parity, write out the new stripe over the network, garbage collect the old..) Even with something as ridiculously small as a 1 TB drive, yields over an Exabyte of data moved per year, if we use a large capacity drive, well, the number exceeds the capacity of the facility quite quickly. Data is at risk for sure.
Let us play this little game the other way around. Yttibrium has been advising providers and consumers of these types of systems to look at a design point of 4M hr MTBF, and a 7-year life. We also designed an erasure code assist (LEC) in hardware, so that the 37X problem is handled in the storage enclosure, and not on the network. Given the Yttibrium design point, we see an adjusted 2.1% failure rate (still using the 10X multiplier for lab to real adjustment) or 5,450 drives out of the 260,000 in this hypothetical fleet. We also now do not need to migrate the data at years 3 or 5 (not taken into account in the network traffic study above) as we come to the end of the drives life, we also come to the end of the SEC hold period, so the drives and data terminate together, assuming you are not a medical or social site. In addition to all these goodness’s, if you do need to keep the data longer than 7 years, the migration can happen by use of the LEC algorithm transparently in the background, as long as the enclosure has a chance to catch up between swaps.
Providers: Please provide the industry with a drive optimized for LEC in the enclosure, high RV tolerance so we can further de-cost the enclosure, a MTBF of 4M hrs, with a usable life of 7 years, at a price point that makes our TCO positive in year 3.
Consumers: Please publish more data about drives, reliability, operating environment, etc., mimic what Backblaze has been doing to be more open about what the needs and usage conditions really are. We have a hard time convincing the manufactures this added service life and lower error rate is important, your data will help.
Leave a Reply
You must be logged in to post a comment.