Enmotus Blog

Content driven tiering using storage analytics

Posted by Adam Zagorski on Aug 9, 2017 10:05:00 AM

IT has used auto-tiering for years as a way to move data from expensive fast storage to cheaper and slower secondary bulk storage. The approach was at best a crude approximation, being only able to distinguish between objects on the basis of age or lack of use. This meant, for instance, that documents and files stayed much longer in expensive storage than was warranted. There simply was no mechanism for sending such files automatically to cheap storage.Tiered Staircase.jpg

Now, to make life even more complicated, we’ve added a new tier of storage at each end of the food chain. At the fast end, we now have ultra-fast NVDIMM offering an even more expensive and, more importantly space limited, way to boost access speed, while at the other end of the spectrum the cloud is reducing the need for in-house long-term storage even more. Simple auto-tiering doesn’t do enough to optimize the spectrum of storage in a 4-state system like this. We need to get much savvier about where we keep things.

The successor to auto-tiering has to take into account traffic patterns for objects and plan their lifecycle accordingly. For example, a Word document may be stored as a fully editable file in today’s solutions, but the reality is that most of these documents, once fully edited, become read-only objects moved in their entirety to be read. If changes occur, a new, renamed, version of the document is created and the old one kept intact.

Clearly, most of these documents should go to an object store characterized by rapid retrieval of whole objects. Ceph, with its parallelism is a good model for this store. Documents still being edited might want to be local in the primary SSD layer , but the editing tools (MSWord) might want to reside in NVDIMM as a ready-loaded image to provide “instantaneous” start-up.

Apply all of this to a cloud of micro-services, apps and applets, all of them transient in nature and the complexity of the whole problem looms large. This isn’t a trivial planning issue. Millions of files may be accessed in a given work week, while very active databases might need optimization at the record level. Moreover, with hyperconverged infrastructure gaining steam, managing data positioning can be a cluster-wide operation.

“Advanced” auto-tiering is the solution for the problem. This brings a combination of content and context to the selection processes, with an objective of having useful data ready for apps while making space for new data needs by down-tiering. Of course, other than setting policy in advance of operations the process is fully automated.

The first step in automated data management is gathering statistics of use rates, combined with file type information. Next, it is necessary to look beyond this simple data and add object tagging capability to the metadata system. This is a significant break with tradition, where metadata is very simple and more or less standardized by the operating system.

Extended metadata will be a necessary tool in software-defined storage. Moving away from rigid sequences of operations, a vehicle for informing the SDS environment about actions needed on an object is a requirement and the metadata approach meets this need, so extending the concept to control tiering falls within the SDS philosophy. Within auto-tiering, metadata can drive decisions at a finer granularity than file type or age or folder location.

The cloud-like configuration of the future also needs mechanisms that can reduce the amount of data transporting that occurs in systems. This is a balancing act between opening service or app instances where data is already stored (the new virtual systems approach) and moving data to where an app is already running (the traditional computing model). With a little extension, events can be planned along a timeline, with data or instances being shuffled in advance of changing workloads.

Gathering all this data loosely falls under the umbrella of a new discipline, storage analytics. A given cluster is metricated to feed all of the necessary data for decision-making to a repository. The repository has sets of differently structured database entries and these can be mined by micro-services to deliver advanced auto-tiering on a parallel, scalable model.

Big Data approaches can also be used to glean information from the storage pool. These can look at more complicated data interactions such as the creation of physical bottlenecks either by not having the correct data in the correct place or by jamming a VLAN or physical LAN connection by moving too much data. Latencies provide a tracker for defective connections, as an example.

Storage analytics is still in its infancy. Companies such as Enmotus are figuring the optimal approaches and the very nature of creating mineable data means that the segment will grow and surprise everyone with its scope and capabilities over the next few years.


Free Trial

Topics: autotiering, big data, Data Center, NVMe over Fibre, enmotus, data analytics