Enmotus Blog

Information Storage – A truly novel concept

Posted by Jim O'Reilly on Oct 17, 2017 9:37:29 AM
Find me on:

When you see “storage” mentioned it’s often “data storage”. The implication is that there is nothing in the “data” that is informational, which even at a verbatim read is clearly no longer true. Open the storage up, of course, and the content is a vast source of information, both mined and unmined, but our worldview of storage has been to treat objects as essentially dumb, inanimate things.Information.jpg

This 1970’s view of storage’s mission is beginning to change. The dumb storage appliance is turning into smart software-defined storage services running in virtual clusters or clouds, with direct access to storage drives. As this evolution to SDS has picked up momentum, pioneers in the industry are taking a step beyond and looking at ways to extract useful information from what is stored and convert it to new ways to manage the information lifecycle, protect integrity and security and provide guidance that is information-centric to assist processing and guide the other activities around the object.

This information storage idea stems from a concept called extended metadata. In fact, we’ve always had metadata. It’s the set of values that define a file’s attributes, for example, but this is defined by the operating system. What distinguishes the extended metadata concept is that users and apps can add to the metadata, usually in a key/data format and that there is no formal boundary to what can be added.

We’ve already seen retention and compression policies be attached to storage and we can expect coverage to expand to include tiering and access control in the near future. Using metadata to control the objects allows much more automation while policies ensure consistent automatic application. Clearly, with storage expected to expand dramatically to fill all those 100TB SSDs. This is a necessary step in storage evolution.

Why stop at data service policies, though. Suppose, for example, you are a legal firm. Indexing all of a disclosed data set is standard procedure. It’s done on a monolithic scale with the result a report on the whole dataset, but it can take a long time to process. A better model might be to make the index an attribute of each folder of files and process folders in parallel.. By concatenating the indices, we end up with the same result, but we also have the option of much faster searches of the smaller datasets.

This concept might make even more sense in a corporate environment using email or word. If the index metadata is created when the document is saved, it is now available immediately to be merged into the search base. Moreover, as metadata, it is actually part of the file and is retained with it until the file itself is deleted.

A real problem with documents is finding them at a later time. Extended metadata provides a mechanism for keyword definition, including a primary filing location. In the global sense of all employees, this is a standard way to place the reference copy and to find the file.

Security gives us another example of the use of extended metadata. We need to know when a file is “touched” by a user. It would be useful to first identify if that user can read or write to the file (a metadata list of authorizations), then every subsequent change is logged. In an environment where data is deduplicated, this gives us a single-point history of the file. Add in snapshot recording of the actual changes, and we can recover any version quickly.

There’s more, and we’ll get to that, but you might be saying at this point that the cost of storage is too high, or the data set is going to really grow. Adding extended metadata like this does indeed expand the storage required, but storage is morphing fast and drive capacities are looking to the stratosphere. 100 TB SSD units have been announced, likely aiming for mid-2018, and I have an Israeli buddy who has a technology for doubling even that.

Intel is discussing a 1U box that will hold 32 “ruler” drives (elongated M2.0 format) and that will hold a petabyte or maybe more. But, there is much more! All that extra bandwidth in SSDs allows background compression of data. Compression is a repetitive process that looks for common strings of bytes in a pile of data. With commercial data, that usually yields a 5X reduction in space needed, so that 1U box is now 5 petabytes!

The bottom line is that storage will be at cornucopia levels! That opens up a new vista of use for extended metadata. First, the library of compression primitives becomes a metadata attribute of the folder and when any file is used it points back to the original folder, maintaining consistency. This is way better than global libraries, both for compression and rehydration. A benefit of this compression is that reading the object is much faster and uses, say, a fifth of the bandwidth.

Second, the availability of a huge amount of space allows us to focus on optimizing storage flow and use. Suppose we capture statistics about access times and using apps, errors that occur and bottlenecks in the network. These are all points of knowledge for traffic analytics, allowing a picture of rough spots in the system to be identified and corrected. This would be the difference between the Los Angeles rush hour as it is today, versus a future with autonomously-driven vehicles and a central AI guidance system. In the future system, traffic is diverted around chokepoints and cars kept much closer to each other.

There is no reason to draw a limit to what can be done with extended metadata. App-dependent helper information could be added, for example. Comments on documents can be captured by the approach too.

All in all, we will move to an information, rather than data, storage model. This has radical implications that emphasize information-centricity in IT.

Topics: big data, SSD, Data Center