A few months ago, the CTO of one of the companies involved in system design asked me how I would handle and store the literal flood of data expected from the Square-Kilometer Array, the world’s most ambitious radio-astronomy program. The answer was in two parts. First, data needs compression, which isn’t trivial with astronomy data and this implies a new way to process at upwards of 100 Gigabyte speed. Second, the hardware platforms become very parallel in design, especially in communications.
We are talking designs where each ultra-fast NVMe drive has access to network bandwidth capable of keeping up with the stream. This implies having at least one 50GbE link per drive, since drives with 80 gigabit/second streaming speeds are entering volume production today.
In many ways, this is the tip of the iceberg, since server architectures are morphing towards much higher memory bandwidths and modular designs that make drives in HCI or HPC configurations directly addressable through RDMA without going through the CPU.
When we look at extensions to DRAM space coming from NVDIMM and the Hybrid-Memory Cube approach to the CPU/memory complex, coupled with CPU core count more than doubling, it’s clear that 2018 will see a large step up in system performance at the top of the server range.
Even though these are expensive systems, the workload boost and reduced runtimes will make them very attractive to IT operations. There are fewer licenses involved and a smaller server farm to manage, while the moving of many workloads to an in-memory model (though taking a while) will deliver very large gains in throughput.
Let’s talk to particulars. A 2U 2-CPU system that today supports perhaps 800 virtual machines will be handling 3000 containers, with the only additions being some more memory and some SSDs. But the next-gen system could handle anywhere from 10,000 containers up to as many as 20,000.
The explosion of horsepower comes with transactions and events occurring at much faster rates. VLAN blockages or bottlenecks, slow links, traffic congestion and plain outages will occur in the networks, while apps will choke or stall completely, CPUs will lose their brains and memory will drop bits. All of this will create traffic jams worthy of Los Angeles on Friday afternoon.
Automated discovery and remediation will be an absolute requirement at this level of performance. After all, no-one is planning to orchestrate their cloud by hand! Discovery needs a real-time, but lightweight, monitoring tool that creates a flow of sensor-type data to be analyzed as a Big Data stream.
Using a Big Data approach means that complex analyses can be staged to run in parallel, which, with masses of containers, shouldn’t be a problem. In fact, this architecture could extend to an artificial intelligence engine that effectively manages the storage platforms and the networks connecting them. AI is not an easy step, though … perhaps on a five to seven year horizon.
Big Data approaches are inherently extensible, which fits the evolving world of servers, storage and containers very well. We just don’t know what a given configuration will need to measure to be optimizable, and we accept that we certainly don’t know where evolution could take us.
Any analytics tool needs to be easily extensible to add and store new metrics. The control dashboard for the tool will need to allow that to happen, perhaps even to the extent of user admins controlling the content. This implies a policy control system and identity management to control access, too.
Published APIs will allow third-party analytics to expand the ecosystem of tools available and, perhaps, also speed up maturity of the toolset.
If you are interested in a deeper dive into these approaches to future storage, I commend Enmotus Inc. as a thought-leader in the field. They have a good deal of real-time storage control and metrication experience and are already working with leading vendors towards building a solution.