SSD Reliability Demystified

By May 1, 2016 April 14th, 2019 Blog

There is something that’s been sorely lacking in the world of SSD’s and it is real world reliability metrics.  We have used Mean Time Between Failure (MTBF) as a measure of hard drive reliability for years and it’s been proven to be a largely useless metric.  Generally, a higher MTBF meant a higher quality drive, typically rated somewhere between 100,000 hours and 1+ million hours.  Manufacturers assign MTBF numbers to their drives after internal testing.  SSD’s have MTBF ratings anywhere from 2x to 10x higher than magnetic disks, often giving people the impression that flash is far more reliable than traditional hard disks.  This doesn’t really tell the whole story since NAND cells have a finite lifespan determined by their Program/Erase (P/E) cycle endurance, most often measured in Drive Writes per Day (DWPD).  SSD’s use a combination of MTBF and DWPD to estimate the lifespan of a specific combination of NAND and Controller (SSD).

Getting accurate life cycle data, be it about flash or traditional disks, has always been difficult.  Online backup provider Backblaze gathered data over a sample set of approximately 27,000 consumer grade hard drives and found some interesting results.  Certain models in their collection experienced over a 25% failure rate (rated annual failure rate [AFR] of 0.32% annually and MTBF of approximately 750,000 hours from the manufacturer) at the ~4yr mark while others were as low as 4% over that same time span.  If we learn anything by reading Backblaze’s data it is that manufacturer’s reliability numbers are to be taken with a grain of salt.

When it comes to SSD’s there are very few organizations that have enough of them to draw any statistically meaningful conclusions.  A surprising fact is Amazon is now the largest consumer of flash storage in the world.  Their implementation is fairly recent and they’re incredibly tight lipped about their AWS environment so don’t expect too much information to come out of that.  Facebook and Carnegie Mellon University (CMU) collaborated to bring us the first large-scale study of flash memory failures in the field and probably the best independent look into real world NAND failure rates that we’ll see for a while.

Facebook and CMU found several key points that haven’t been previously discussed in SSD reliability trends. First flash based SSD’s go through several distinct reliability periods that correspond to failure emergence and detection, where failure rates can vary by nearly up to 82%.  Age is not necessarily the determining factor for the likelihood of failure for the SSD as previously thought.  Read disturbance errors (errors caused in neighboring pages due to a read) are not prevalent in the field and high fill rates do not correspond to any significant change in failure rates.  Sparse logical data layout across an SSD’s physical address space (non-contiguous data in NAND) with small sparse writes (which will cause a large amount of costly P/E operations) negatively affect SSD reliability.  Something that was already known and proven is that higher temperatures lead to higher failure rates; unless the SSD’s have a controller that throttles based on internal temperatures.  Lastly, the amount of data that an operating system writes to the SSD is not the same as the amount of data that is eventually written to the NAND. This is due to system level buffering and wear reduction techniques employed in the software stack of the SSD at the controller level.  Controllers that attempt to minimize the writes to the NAND effectively lengthen the lifespan of the SSD.  Another interesting takeaway from the data is that SSD NAND failures are fairly common events with between 4.2% and 34.1% of the SSD’s reporting uncorrectable errors, lending further weight to the argument of eMLC overprovisioning to reduce SSD failure rates.  When uncorrectable errors occur in NAND cells, those cells are marked bad and other other cells from the spare pool can take their place.  eMLC drives have more spare cells available for this than consumer grade SSD’s.

This proves yet again that not all All Flash Arrays (AFA’s) are created equal.  If you are not tuning your software stack for these inherent characteristics of flash, you will experience higher SSD failure rates.  EMC’s XtremIO has some interesting features to combat these characteristics of SSD’s.  First of all, sparse logical data layout is handled at the array level with a couple of neat tricks to avoid some of the problems that crop up.  With in-memory metadata operations you reduce the number of I/O’s that must go to the flash when performing standard data movement functions, such as defragmenting database tables and indices, updating data that’s already written and provisioning new VM’s, etc.  When combined with in-line global data reduction (deduplication), thin provisioning and XDP data protection (flash specific data protection), it means only unique data is written to the SSD’s in the array.  It’s important to think about how non-contiguous data is handled when these 3 features are in play.  Data interspersed with “white space” becomes contiguous since blank blocks are not stored and are ignored as a part of the write sequence to the internal SSD’s.  Another important contributor to SSD failure is the P/E (program and erase) cycles of the NAND.  XtremIO reduces the wear on NAND by reducing the P/E cycles the NAND is exposed to in everyday use.  By keeping metadata in memory, you don’t expose the NAND cells used for storing this metadata in traditional AFA’s to the constant wear of metadata updates due to writes.  Further, with inline deduplication you don’t expose NAND to the constant erase and write cycles that occur when a block of data is changed slightly (since MLC can’t simply perform a write on an already written block without first erasing it) further lengthening the life of the internal SSD’s.  Lastly, XtremIO further helps reduce the likelihood of NAND P/E cycle wear by supporting in line deduplication with full VMware VAAI integration to reduce the number of storage operations occurring at the array.

Some consumer class drives don’t properly implement thermal protection into their software, whereas most Enterprise Grade SSD’s mitigate the effects of thermal migration by implementing safeguards to protect the controller and NAND cells.  Using SSD’s which don’t implement these protection mechanisms are thereby negatively impacting SSD life and data integrity.  Furthermore, in case of an SSD failure XDP in XtremIO has a few tricks to make your life easier.  XDP allows you to leave failed SSD’s in place until a convenient time to replace them without negatively impacting the rest of the array.  When you do replace the failed SSD it provides incredibly fast rebuild times with minimal impact to your running workloads.

Every bit of information we’re learning about Flash technology tells us that we can’t use old industry axioms when leveraging flash.  Over time this will show more and more that not all AFA’s (All Flash Arrays) are created equally.  The vendors who build to flash’s strengths and avoid the pitfalls flash introduces will result in a better performing more reliable array.  Simple things like metadata location on physical flash can greatly reduce the lifespan on the NAND inside the SSD’s in an AFA, therefore, negatively impacting the reliability of the entire array.  The biggest takeaway from the great work done by Facebook and Carnegie Mellon University is that Flash isn’t inherently more reliable than traditional disks, but it can be. If you want to make sure your investment into flash gives you all the benefits without the pitfalls come talk to us about XtremIO and what it can do for your data needs.

Jose Adams, Engineer