Part 1 … the server and cluster
Since time immemorial, we have used the SCSI-based file stack to define how we talk to drives. Mature, but very verbose, it was an ideal match to single-core CPUs and slow interfaces to very slow hard drives. With this stack, it was perfectly acceptable to initiate an I/O and then swap processes, since the I/O took many milliseconds to complete.
The arrival of flash drives upset this applecart completely. IOPS per drive grew by 1000X in short order and neither SCSI-based SAS nor SATA could keep up. The problem continues to get worse, with the most recent flash card leader, Smart IOPS, delivering 1.7 million IOPS, a 10-fold further increase.
The industry’s answer to this performance issue is replacing SAS and SATA with PCIe and the protocol with NVMe. This gives us a solution where multiple ring-buffers contain queues of storage operations, with these queues being contexted to cores or even apps. This allows a bunch of operations to be pulled from the queue and processed by the drive using RDMA techniques. On the return side, response queues are likewise built up and serviced by the appropriate host. Interrupts are concatenated so that one interrupt services many responses.
NVMe is a huge step forward in interfacing storage and already has largely displaced SAS as the interface in fast servers. But it only fixes the I/O stack south of the file systems layers, leaving us with many outmoded structures that beg for cleanup.
Let’s take a common example. RAID arrays used to be the top of the storage tree. You’d connect up via Fibre-Channel to a box with a controller fronting a lot of drives. Faster CPUs led to soft-RAID solutions, which connected drives directly to the server and used an additional layer in the file stack to provide mirroring and striping.
With NVMe, the soft-RAID approach can be much simpler, since data is moved directly to and from application space via the RDMA. This takes away the multiple address redirections and command chains of SCSI and speeds up I/O even further.
But why stop there. Object storage systems distribute their data blocks in a similar fashion to RAID, but the result is far more robust, as it invokes appliance redundancy as well as drive redundancy. Why not have the server determine the end storage points for data blocks, instead of an intermediate Ceph appliance, for example. Well, Ceph supports the structure to do this, though it isn’t clear if the tie-ins to the OS are optimized, with Ceph running a REST interface.
But let’s swallow the whole enchilada! We are moving to an era where huge data pools are created, whether through object stores or hyper-converged code such as Nutanix. One of our challenges is how to partition 10’s of petabytes of storage into useful spaces characterized by a file system of a NAS extent or an object storage pool. This all requires automated orchestration to be agile enough for future systems clusters.
Most of the storage code now would run in containers on the server cluster, while I/O operations could be block-chained to deliver much more complex data transformations (think, index, compress, dedupe etc.).
Storage code in appliances and drives would migrate to the server farm, too. Microsoft’s Project Denali just surfaced and it conceives of near-bare SSDs, with much of today’s drive code running in host instances. Drives would just handle primitive block translation and failure detection and such. Drives should be cheaper and the time to write and certify drivers much reduced.
Host-based compression will reduce data transmission and storage by a typical 5 times factor, effectively boosting I/O tremendously without buying more hardware. Source-side encryption will make data much more secure, while erasure coding in the server will reduce network traffic even further.
Here’s where storage makes a quantum leap! Byte-addressable memory is coming fast to the market. Intel and Micron expect to hit the market later in 2018 with the ability to write a single persistent word from a CPU instruction. This is a game-changer, since we are talking 1 instruction replacing thousands in the I/O stack as well as all the data moving of a 4KB block.
The problem, of course, is how to create a transportable structure to hold these single words, one that can move from aap1 to app2, be copied to other storage spaces, etc. This is a massive overhaul of software, from applications to compilers and operating systems. The benefits are potentially huge, but I suspect the industry will be working on the code for the best part of the next decade.
You can see that change is in the air! One thing is certain – what I’ve described will differ from the final result, since there are just too many places that value-added changes can occur. We do live in interesting times!