It’s not often I can write about two dissimilar views of the same technology, but recent moves in the industry on the A.I. front mean that not only does storage need to better align with A.I. needs than any traditional storage approach, but the rise of software-defined storage concepts makes A.I. an inevitable choice for solving advanced problems. The result, this article on “Storage for A.I.” and the second part of the story on “A.I for Storage”.
The issue is delivery. A.I. is very data hungry. The more data A.I. sees, the better its results. Traditional storage, the world of RAID and SAN, iSCSI and arrays of drives, is a world of bottlenecks, queues and latencies. There’s the much-layered file stack in the requesting server, protocol latency, and then the ultimate choke point, the array controller.
That controller can talk to 64 drives or more, via SATA or SAS, but typically only has output equivalent to maybe 8 SATA ports. This didn’t matter much with HDDs, but SSDs can deliver data much faster than spinning rust and so we have a massive choke point just in reducing streams to the array output ports’ capability.
There’s more! That controller is in the data path and data is queued up in its memory, adding latency. Then we need to look at the file stack latency. That stack is a much-patched solution with layer upon layer of added functionality and virtualization. In fact, the “address” of a block of data is transformed no less than 7 times before it reaches the actual bits and bytes on the drive. This was very necessary for the array world, but solid state drives are fundamentally different and simplicity is a possibility.
Parsing all of this adds processing delays. It’s fundamentally cumbersome and it reflects the industries unwillingness to look at the big picture frequently enough, a problem we see with data structures, protocols and topologies everywhere.
Back to A.I.’s insatiable need for data! Let’s throw the tradition of the SAN structure aside and look elsewhere for an answer to a high throughput, low-latency storage solution. A good starting point is to move to an NVMe interface, using RDMA to take stacks out of the data-moving path. This is around a 4-year old technology and it’s pretty mature, with huge gains in throughput and latency. NVMe has all but killed off the SAS enterprise drive business.
Now, we are moving to NVMe-over-Ethernet as a way to make all drives in a cluster shareable to all servers. This is another big step in architectures, bolstering the hyper-converged model for the datacenter no end. Companies like Excelero are pushing hard to make this a preferred solution and it’s generally being very well received.
The problem is that it isn’t enough! NVMe still reports back to the file stack, which is still pretty cumbersome. It does avoid the SCSI layer, which is a considerable saving, but addresses still need translation.
We need to tackle the file stack itself. This is becoming an urgent problem outside of the A.I. space. The collision of NVDIMM technology with ultrafast Optane or Z-NAND NV memory will bring us byte addressability for stored data. To get any advantage from that capability, we need to look at the implications a bit.
Byte-addressed NV memory is a persistent storage that can be read or written using a single simple CPU instruction such as move-register-to-memory. That’s a few nanoseconds per I/O operation. On the other hand, the file stack thinks in virtualized 4KB blocks, so not only is there a long transmission time for the data itself, we need to read a block, then change the byte and rewrite it.
File I/O takes a few microseconds and is pedestrian compared with the register-memory operation, but the thousands of instructions in the file stack also add latency, while the current practice is to disconnect from the I/O until it flags completion, a holdover from HDD times. Disconnects involve a state change in the CPU which takes quite a while to complete, yet another time-waster. Clearly, for the ultimate in performance, we need some sort of memory addressing scheme that uses global byte addresses.
We are beginning to see the first steps in that thought process with an architectural concept promulgated by the Gen-Z consortium (and others). Here, the server memory is a resource on a fabric, along with all the (RDMA) storage devices. While this is still an evolving concept, making the physical addressing schema a flat memory space becomes a possibility, with 264 discrete addresses initially.
I suspect that this is way too simplistic, though. Memory devices (CPU cache, DRAM, Optane NVDIMM, Optane SSD and NAND) all have different characteristics and latencies. You wouldn’t have an F1 driver race in a production sedan, or put a refrigerator in the back of a Ferrari! Device and locality knowledge are likely part of any scheme … maybe some of the bits are used for device ID or we go to a 128 bit address scheme up front. That’s all speculation, of course.
Beyond physical addresses, the challenge is to keep the “file system” as thin as possible. This is the mechanism for keeping objects in tidy bundles, but it also separates tenants and provides a vehicle for multi-tenant sharing of the storage pool when combined with access management such as Active Directory. The S3 REST model is perhaps a good starting point. It’s a flat system, but can handle buckets and devices.
Keeping track of all of this, while troubleshooting and optimizing the storage pool, will be a complex problem that requires high levels of automation and very agile and rapid response to any issues. That’s the subject of the second part in this series … A.I for storage. See you there!