This Insane Server-Attached Flash Thing: PART 2

By September 1, 2016 June 19th, 2017 Blog

STOP! Before you read any further, make sure you check out Part 1 of this series.

I left off the last post with a cliffhanger of some of the limitations we face in the way of Flash storage. The most significant limitation is simply the interface to the SSD. We’re largely still using a serial interface designed for chaining together spinning disks. The underlying devices connected were always seen as the primary bottleneck and so the design favored signaling, ease of use, and expandability over raw throughput and latency. The AHCI SAS and SATA interfaces were designed to combat some of the inherent limitation in SCSI and PATA protocols.  Namely 16bit interfaces that had limited expandability, signal degradation over distance, and anemic throughput due to legacy bus connectors. Nanosecond latency and 10’s of gigs of throughput simply wasn’t on the radar, and frankly we will need new interfaces going forward to truly see SSD’s stretch their legs in the data center.

So what are these interfaces that will help SSD’s realize their true potential? The one that seems to have won is NVMe with a couple of connectors based on implementation. The problem here is expandability.  NVMe isn’t born from SCSI and wasn’t built with the inherent intent of becoming a multi target shared protocol. While it shares some similarities with legacy SCSI it’s no more a SCSI interface than a mule is a horse. They share some genetic code but they aren’t really the same species. Like a mule the NVMe spec is a hybrid of sorts, it adheres to some the basic tenants of disk access protocols but it does things quite a bit differently. This table from Wikipedia highlights some of the stark differences between AHCI and NVMe.

High-level comparison of AHCI and NVMe[4]
Maximum queue depth One command queue;
32 commands per queue
65536 queues;
65536 commands per queue
Uncacheable register accesses
(2000 cycles each)
Six per non-queued command;
nine per queued command
Two per command
and interrupt steering
A single interrupt;
no steering
2048 MSI-X interrupts
and multiple threads
Requires synchronization lock
to issue a command
No locking
for 4 KB commands
Command parameters require
two serialized host DRAM fetches
Gets command parameters
in one 64-byte fetch

One of the biggest improvement you’ll see are maximum queue depth and no locking for threads.  Like everything in this world all is not perfect, it does have a very significant Achilles heel and that is expandability.  There is no fabric technology for it today, or is there?

EMC has been not so quietly working on a new technology called DSSD. To be fair a couple of the Flash Storage vendors are saying that NVMeF (F for Fabrics) is the right way to go in the future, but only one company has it working today, EMC. So what is DSSD? Well it’s a few things, a new flash interface, a new networking technology, and possibly a new era in the data center. Currently DSSD will support any RDMA fabric like, Infiniband, iWARP, OmniFabric, and potentially Fiber Channel in the very near future. If you noticed anything from that list, it currently supports super scalable HPC fabrics primarily, though I’ll go on record to say that Infiniband is no longer a prohibitively expensive protocol to implement and is supported extensively in not only Windows and Linux but VMWare and ESXi as well. DSSD has quite a few similarities with traditional scale-up array architecture. It has a layer of I/O modules connected to a pair of Control Modules, similar to VMAX architecture, with engines that can communicate with any disk at any time.

Performance numbers are pretty closely guarded by EMC at this time but they did speak with The Register about the array back in August and here’s what they had to say,

Huffman describes a 64-node, hyper-converged system set-up with a 500TB all-flash virtual SAN array. It powered 6,400 VMware virtual machines and achieved 6.7 million random read IOPS, 2.4 million random write IOPS, 70GB/se sequential reads and 33GB/sec sequential writes.”

That’s about 20x the maximum throughput from a single XtremIO X-Brick, and about 3x the theoretical through maximum of 8 X-Bricks. In that same article they show a 64GB copy operation occurring in under 1 second. This thing isn’t only fast, it’s revolutionary. NAND promised to bring near the performance of in-memory operations given time. Moore’s law means we will see a continued and drastic increase in performance year over year, until manufacturing process limits are hit, and that means this is only going to get faster.  DSSD is the future, and the future is now.  EMC seemingly just finished the Flash storage race before any other vendors got to the starting line.

The last thing I want to touch on in this already long blog post is how to actually benefit from flash storage. You certainly can just move data and take advantage of the far lower access times provided by SSD’s, but that’s only skimming the surface of what’s possible with flash. The more you can parallelize workloads on your flash arrays the more benefit you get from flash today. With Fiber Channel and iSCSI having been designed primarily for spinning disks single client performance is still going to be largely bottlenecked at the network. If you aren’t scaling-out your intensive workloads you aren’t leveraging one of the greatest benefits of flash storage, parallelism in workloads. Things like data cubes in OLAP, SQL Always On Active/Active-Read-Only for OLAP reporting, shards for OLTP workloads across distributed clusters, load balanced front end servers with Citrix NetScalers, and a variety of other techniques can be leveraged to further improve the end-user experience with flash storage.


Jose Adams, Engineer