Blog
Enterprise Reliability: Fusion's Adaptive Flashback Protection
Posted: 02/08/2012
Enterprises have long sought to take advantage of the speed, size, low power requirements, and high performance of NAND flash technology. A typical objection to NAND flash has been the perceived unreliability of the medium. Fusion-io has eliminated this barrier by inventing a self-healing technology known as Adaptive Flashback Protection. This technology continuously protects data from being lost in the flash-based storage subsystem.
Like all electronic technology, raw NAND flash periodically experiences failure. Random bit flips, permanent or transient, or whole-die failures can and do occur, and must be effectively addressed in storage sub-system design. In mission-critical environments, an uncorrected error could crash an operating system, or compromise the integrity of highly sensitive data.
The RAID Band-Aid Approach
Most SSD or PCIe flash vendors handle this problem in the same way as in disk drive-based storage systems – through RAID redundancy (either RAID 1 or RAID 5), at the system-level. While physical RAID can provide redundancy, RAID 1 implementations sacrifice half the usable capacity, and RAID 5 implementations reduce performance to a fraction of the potential. However, performance specifications are generally quoted using a RAID 0 setup, which provides no redundancy at all and is unrealistic in any kind of enterprise workload.
Further, when an unprotected failure occurs, the entire device is corrupted and must be repaired or replaced, causing ill-afforded server/storage downtime, and an often complicated, difficult and expensive (or at least inconvenient) replacement process.
Legacy RAID was designed with disk-drive failure semantics in mind. Therefore most RAID technology is not optimized for NAND Flash; neither for the performance of NAND Flash, nor for the types of failures that may occur in NAND Flash, nor for recovery mechanisms more suitable for solid-state storage subsystems, nor for the lower failure rates associated with non-electro-mechanical storage such as HDDs. RAID adds unnecessary complexity, “long-in-the-tooth” software stacks that are prime candidates for “re-factoring," additional points of failure in functions irrelevant for NAND Flash-based media and constraints to system performance and capacity.
How Adaptive Flashback is Different
That’s why we created Adaptive Flashback Protection. Fusion’s patent-pending technology includes built-in redundancy that employs added layers of protection against NAND failure in the media (not at the system level). Protection is at the Erase Block level in NAND Flash parlance. This fine-grained retirement capability of Adaptive Flashback enables retirement of very small fractions of the storage media (e.g., one-four-hundred-thousandth of the capacity!) As portions of NAND (Erase Blocks for example) fail for whatever reason, we can retire individual Erase blocks. And there are a number of them in our design. The net result is very fine grained retirement with very small impact on device capacity. Contrast this for example with a RAID Stripe built from disc drives. If one drive fails, a large chunk of the media must be replaced. In a 7+1 RAID 5 configuration, one-seventh of the capacity must be replaced, and rebuilt. This large rebuild contributes to an excessive "reconstruction window" that provides opportunity for a second, catastrophic failure to occur.
Block- and Chip-level Redundancy
Because Adaptive Flashback is implemented as an integrated hardware and software solution, it can readily distinguish between correctable and uncorrectable failures in the raw flash and employ self-healing techniques (i.e., operator free) to maintain system health. Block-level redundancy allows Fusion-io to identify, isolate, and correct localized failure in individual chips. Chip-level redundancy allows Fusion-io to address entire chip failures as well, by removing failed chips from service and substituting with a new chip from Fusion’s reserve buffer as needed. Whole chip failures do not result in the loss of the device or loss of data.
Powerful ECC at the Speed of Flash
Underlying these fault-tolerance mechanisms lays best-in-class error correction codes (ECC) that incorporate wire-speed correction of multi-bit errors. This data encoding can correct the substantial number of bit errors that arise as a natural function of the signal-to-noise ratio inherent in all digital data storage and communication media. Fusion’s second-generation Adaptive Flashback Protection extends its protection over virtual blocks in the event of any media failure, and maintains protection continuously over the product life.
Unlike traditional disk-based RAID approaches, Fusion’s flashback protection maintains peak performance and capacity, while simultaneously extending the life of the device far beyond what is possible with other approaches. Flashback protection is one of the many reasons why Fusion-io continues to be the leading provider of enterprise-grade solid-state storage solutions. No other flash vendor offers this level of redundancy, sophistication, and protection. Additionally, this fine-grained, block-level retirement capability assures maximum device capacity even with the conservative retirement policies used in Fusion-io products.