Reading current blogs on clouds and storage it’s impossible not to conclude that most cloud users have abandoned hope on tuning system performance and are just ignoring the topic. The reality is that our cloud models struggle with performance issues. For example, a server can hold roughly 1000 virtual machines.
With an SSD giving 40K IOPS, that’s just 40 IOPS per VM. This is on the low side for many use cases, but now let’s move to Docker containers, using the next generation of server. The compute power and, more importantly, DRAM space increased to match the 4,000 containers in the system, but IOPS dropped to just 10/container.
Now this is the best that we can get with typical instances. One local instance drive and all the rest is networked I/O. The problem is that network storage is also pooled and this limits storage avail
ability to any instance. The numbers are not brilliant!
We see potential bottlenecks everywhere. Data can be halfway across a datacenter instead of localized to a rack where compute instances are accessing it. Ideally, the data is local (possible with a hyper-converged architecture) so that it avoids crossing multiple switches and routers. This may be impossible to achieve, especially if diverse datasets are being used for an app.
Networks choke and that is true of VLANs used in cloud clusters. The problem with container-based systems is that the instances and VLANs involved are often closed down by the time you get a notification. That’s the downside of agility!
Apps choke, too, and microservices likewise. The fact that these often only exist for short periods makes debug both a glorious challenge and very frustrating. Being able to understand why a given node or instance runs slower than the rest in a pack can fix a hidden bottleneck that slows completion of the whole job stream.
Hybrid clouds add a new complexity. Typically, these are heterogeneous. The cloud stack in the private segment likely is OpenStack though Azure Stack promises to be an alternative. The public cloud will be one of AWS, Azure or Google, most likely. This means two separate environments, very different from each other in operation, syntax and billing, and an interface between the two.
In fact, today, that interface is the greatest challenge. Apart from syntactical issues, which are seeing major progress in interoperability, getting data smoothly and in a timely fashion is the great unsolved mystery of hybrid clouds. Cloudbursting only works if the data needed is present, and latencies between clouds are so high that even relatively small storage datasets take 10’s of minutes to copy over.
One fix to the problem is to actually keep the primary copy of data in the public cloud, and uses special gateway computers with very large (SSD and DRAM) caches to support the in-house copies. This may sound backwards, but in-house gear can be customized … just try asking Google do that!
Most of us are still flying blind, though. Today’s monitors just don’t give the speed or visibility needed to react to the agile environment we’ve created. In fact, tuning is becoming such a complex and potentially sophisticated issue that automation is the only viable approach.
We can turn to companies like Enmotus, a leader in tuning for data availability, to resolve the issue. They are building toolsets that include very fast and extensible monitors that not only track the traditional metrics but give visibility at the app, VLAN, VM, container or OS level as well as tracking physical gear. This type of approach is crucial as we move to software-defined infrastructure, where all the sins of hardware can rea-appear in a different form in the virtual analogs.
With the right API, these tools can create a pluggable virtual ecosystem of micro-services that allow third-party vendors to add monitors. The result is an unstructured database that can be mined or queried using Big Data techniques, as well as accessed using more structured data approaches such as SQL.
At the simple end of the spectrum, Enmotus envisions apps similar to traditional apps, capable of searching a set of metrics for say, hardware failures or slow links. Beyond that, Big Data steps in and allows more complex and unstructured analysis. These are as open-ended as the data that underpins them, so again, there is plenty of room to build a rich ecosystem of tools.
The ultimate is to extend the analytics to an AI approach. While this is likely a few years out, the payback is much better data management and the ability to respond to a changing workload or environment automatically.
All of these tools will need to talk to orchestration in each connected cloud. There are no real standards yet for this, so that will be one challenge for the industry to overcome. The end-point will be a true software-defined infrastructure, with software managing almost all of the administrative tasks.