For the last 3 decades of computer storage use, we’ve operated essentially blindfolded. What we’ve known about performance has been gleaned from artificial benchmarks such as IOMeter and guestimates of IOPS requirements during operations that depend on a sense of how fast an application is running.
The result is something like steering a car without a speedometer ... it’s a mess of close calls and inefficient operations.
On the whole, though, we muddled through. That’s no longer adequate in the storage New Age. Storage performance is stellar in comparison to those early days, with SSDs changing the level of IOPS per drive by a factor of as much as 1000X. Wait, you say, tons of IOPS…why do we have problems?
The issue is that we share much of our data across clusters of systems, while the IO demand of any given server has jumped up in response to virtualization, containers and the horsepower of the latest CPUs. In fact, that huge jump in data moving around between nodes makes driving blind impossible even for small virtualized clusters, never mind scaled-out clouds.
All of this is happening against a background of application-based resilience. System uptime is no longer measured in how long a server runs. The key measurement is how long an app runs properly. Orchestrated virtual systems recover from server failures quite quickly. The app is restarted on another instance in a different server.
Where this concept falls down is where the fault is not a hard failure - those are easy to detect – but a system hang or a throughput bottleneck. Let’s take an example. SSDs are designed to handle imperfections in the underlying flash die. Those imperfections change over time. Even a “good” cell can return wrong data if electrical noise or ambient temperature excursions are present.
The SSD controller handles these errors, but some failures can cause many retries, leading to data being reconstituted using error-correction codes. If the problem gets bad, the drive runs slow, but it won’t report a problem. This type of bottleneck may dramatically slow down an app and the problem will, eventually, be seen in long runtimes or a sluggish response, but detecting these can be delayed until long after the problem arises.
Moreover, in an agile virtual environment, with apps and micro-services coming and going, the problem may continue to wreak havoc for a long time. The symptoms likely won’t reproduce if an app instance is stopped and restarted, for example.
There are many of these types of soft fault, from bad networks to server memory issues, something that will become more serious as NVDIMM memories become mainstream. Most of us don’t want to become experts in detecting these faults. We want an “expert system” to spot them for us and interact with orchestration to achieve prompt remediation.
Enter the world of storage analytics. The storage industry is currently busting a gut to make storage smarter. That’s what software-defined storage is all about, with the idea of embedding storage software in virtual instances that can be treated in the same agile and scalable manner as any other instance in the virtual cluster.
Storage analytics gather data on the fly from a wide list of “virtual sensors” and is able to not only build a picture of physical storage devices and connections, but also of the compute instance performances and VLANs in the cluster.
This data is continually crunched looking for aberrant behavior. The typical analytics tools, such as the software from Enmotus, have a web-based GUI display that can display health problems in an easy to read form and allow drilling down into connections and nodes for more information. The better tools can attach IO information to app instances, too.
Typically, these tools show hard problems, bottlenecks in throughput, low IOPS and long latencies, among many other metrics. These metrics are open-ended, insofar as new measurement types can be added, something that is of real value as the face of IT changes and the private-public cloud interface blurs up.
Tools have two components as well as the GUI. Gathering the metrics and storing them is a real-time operation that has to run in many places in parallel. Analyzing those stored metrics is a compute-intense application. Both of these areas are rapidly evolving, with analytics a candidate for Artificial Intelligence approaches in the future Another aspect of analysis is tying the conclusions of the analysis to automated orchestration events, which is also in its infancy today.
Sophisticated metrication and analytics will deliver a much more stable and predictable operating environment as large clusters or clouds are built out. It’s worth getting an early start in the approach, since it will formulate IT operating processes able to stand up to large scale-out and agile automated operation.