Redefining the standard for system availability
How Nimble Storage uses predictive analytics to achieve more than six nines availability across its entire installed base
Businesses in every sector are increasingly reliant on applications to handle everything from back-end operations to the delivery of new products, services, and customer experiences. That is why infrastructure system availability and the elimination of unplanned downtime are more important than ever before. Recent research has shown that the average cost of an hour of downtime is about half-a-million dollars,1 and this will only increase with the continued digitization of industries. For far too long, superior storage availability has only been possible through expensive, on-site service contracts on excessively redundant hardware models. Since its founding, Nimble, a Hewlett Packard Enterprise company, has been on the ambitious mission to break the mold and not only build better availability into their products but also enable continuous improvement over time.
In 2014, Nimble (now a Hewlett Packard Enterprise company) announced what was then a breakthrough: over five nines of measured availability. Just two years later, Nimble has further distanced itself from the pack with over six nines (99.999928%) of measured availability across its entire installed base. This translates to an impact of less than 25 seconds annually—a 4X improvement in just over two years.It is important to understand that published availability values are not all created equal—many are just theoretical measurements. The details on how availability is delivered distinguish one from the other and reduce business risk. With respect to availability from Nimble:
It is important to understand that published availability values are not all created equal—many are just theoretical measurements. The details on how availability is delivered distinguish one from the other and reduce business risk. With respect to availability from Nimble:
- It is measured and based on real, achieved values, not theoretical projections.
You can be confident about future availability levels only when metrics about past performance are transparent and proven by actual data and customers.
- It is measured for the entire installed base, including every model and OS release.
Showing improvement on the latest products and releases is easy. The challenge is delivering complete system availability including systems that have been in operation for over six years.
- It is continuously improving.
It already starts out more reliable than others and keeps improving with over six years of installed-base learning and insights.
- It is standard for all products, not requiring special terms or service.
Building best-in-class availability into every product without charging a premium or requiring a special service contract or configuration is fundamental to Nimble.
This innovation begs the question—how does Nimble do it?
The basis for system reliability at Nimble starts with the architecture of the storage platform.There is no single point of failure (fault tolerance with redundant components). Dual controllers allow for nondisruptive upgrades with no performance impact in the case of controller failure. Moreover, the software architecture is fault tolerant and delivers extremely robust data integrity including Triple+ Parity RAID and end-to-end integrity validation.
However, there are degrees of unpredictability that can’t be engineered out through system design, due to complexity across infrastructure layers. This has not stopped Nimble from continuing to improve significantly and progress towards a zero-downtime lifecycle. The measured availability of Nimble arrays keeps getting better through predictive analytics, installed-base learning, and our commitment to a transformed support experience. Nimble is redefining the standard.
The following sections of this paper dive into the details, revealing the unique approach that has enabled Nimble to continuously improve and exceed six nines of measured availability across the entire installed base.
Preventing downtime with InfoSight Predictive Analytics
Since its inception, Nimble has incorporated advanced analytics into the core architecture of every system, it does so to radically improve operational system reliability—not only for the storage arrays but also for infrastructure layers beyond storage. The complexity and variability across applications, infrastructure, and configurations has made downtime-inducing problems all but inevitable.
To combat this longstanding issue, Nimble took a unique approach and began embedding diagnostic sensors into every module of code from day one, building a foundation for real-time, deep health, and performance analytics. To date, each system contains thousands of sensor collectors and InfoSight Predictive Analytics collects and correlates millions of sensor data points per second across its installed base, enabling global visibility and learning.
Infrastructure that learns
InfoSight applies data science to identify, predict, and prevent problems across infrastructure layers. For any new problem experienced in the installed base, predictive health signatures are assigned and InfoSight intelligently utilizes pattern-matching algorithms and continuously searches for signatures across the systems.
If a signature is detected, InfoSight either prevents the problem from occurring or proactively resolves it with a prescriptive resolution, even if the problem is outside of storage. There are no false alerts as machine learning normalizes performance behavior across the installed base. Each system continually gets smarter, learning from the installed base, and downtime events are increasingly prevented.
Non-storage factors, such as misconfigurations, host, network, or VM problems, can impact the I/O path. InfoSight correlates sensor data across the infrastructure and resolves problems beyond storage, uncovering the root causes of issues affecting data delivery from storage to virtual machines (VMs). In fact, 54% of the issues InfoSight resolves are outside of storage. Because Nimble has been at this for over six years, InfoSight has more diagnostic sensor data and predictive insights than any other vendor.
With InfoSight and the power of predictive analytics, measured availability is greater than six nines today and continues improving for all systems. This availability value is not limited to the latest model or software version as it is for other vendors, but instead is representative of the entire Nimble installed base.
Guiding principle for preventing issues
If Nimble has seen or knows about a problem, no customer should experience the same problem in their environment—regardless of the complexity or location of the root cause. This guiding principle has created a methodical focus on clearly understanding the root cause of every issue and case, even those outside of storage, to prevent any customer from experiencing the same issue.
See once, prevent for all
InfoSight enables a new and better support experience, one that applies data science and intelligent case automation to help minimize the possibility of a known issue ever being experienced in the installed base. Integral to this support experience are the PEAK engineers—a special team with expertise across the infrastructure layers. These engineers are responsible for case assessment, rapid and definitive root cause analysis, defining case automation rules, and overseeing problem resolution before problems can affect customers. The following figure outlines the team’s standard operating procedure.
- Data analysis: InfoSight continuously monitors and analyzes sensor telemetry from the global installed base—millions of sensors per second from over 10,000 customers.
- Case creation: InfoSight predicts a potential problem or a customer creates a case (Note: Ninety percent of cases are auto-created and 86% of cases are auto-resolved and closed before the customer knows of an issue).
- Root cause analysis: For complex issues, a dedicated PEAK engineer is assigned and works with engineering and InfoSight to quickly diagnose the root cause, including problems outside of storage. A signature is created identifying the parameters, including OS, performance
- Problem resolution: The PEAK engineer develops the resolution plan, verifies the completion of fixes, and closes the case.
- Installed-base prevention: InfoSight applies pattern-matching algorithms on the signature to identify, predict, and prevent other systems from experiencing the same problem.
Customized upgrade paths
The PEAK engineers can invoke a blacklist mechanism that prevents customers from upgrading to specific NimbleOS versions associated with a problem that has been identified in other environments with similar configurations. InfoSight, in turn, creates customized upgrade paths for each customer. This means customers can know with certainty that the upgrades available are safe, as identified problems have been mitigated.
Nimble’s laser focus on preventing known issues, combined with InfoSight Predictive Analytics, has resulted in a 19.3% year-over-year decrease in customer involved support cases.3 This achievement has been made despite having grown its customer base 900% over the same period. Net result: Downtime events are prevented and valuable customer time can be spent driving business value rather than on maintenance, troubleshooting, and problem resolution.
Infrastructure is an investment. Rather than choosing a depreciating asset, you can choose one that actually improves over time.
Businesses are increasing their reliance on software applications and even the smallest amount of downtime can have tremendous consequences. A robust design that incorporates flash technology is a requirement today. However, system design alone cannot overcome the complexity in infrastructure that causes unplanned downtime.
Nimble combines robust system design with predictive analytics to deliver the highest measured availability in the storage industry and a transformed support experience. Building predictive analytics into the core architecture from day one allows infrastructure to learn, no matter how long it has been deployed. This is reflected in the following:
- Measured availability greater than six nines (99.999928%) across more than 10,000 customers, providing uptime for customers
- Over 86% of support cases are automatically resolved by InfoSight, saving time and money from trying to diagnose and troubleshoot
- Fifty-four percent of issues that InfoSight resolves are outside of storage, addressing a full spectrum of issues that impact infrastructure uptime.
- Intuition says that reliability will go down and the likelihood of problems will increase as systems age. However, Nimble Storage has flipped that paradigm with InfoSight Predictive Analytics.
==============================================================================================[contact-form-7 404 "Not Found"]