Today’s Forecast: Cloudy with a Chance of Sudden Alert Storms

Lee

By Lee LaFrese

 

lightningI live in Tucson, AZ. The joke here is that most of the year the weather is so stable you can publish the forecast on a billboard – sunny, clear skies. Sounds nice, doesn’t it?  But every July we have monsoon season and the weather turns “interesting”. I suppose that you could still put the forecast on a billboard – sunny with a chance of wild and crazy storms! These storms are unpredictable and sometimes quite violent. Unfortunately, weather science is not to the point where it can clear up the uncertainty. Thankfully, in the world of storage performance we can do better.

Typical IT shops have various real-time monitors designed to raise alerts when something goes wrong. In theory this sounds like a good arrangement. If you get an alert, you can take action and fix things quickly, right? But in reality this won’t always be as effective as it would seem. On the one hand alerts may point to symptoms after you have already felt the impact. Do you want to hear that it is raining when you are already soaked to the bone? It would be much preferable to have advance warning before the problem manifests itself. On the other hand, sometimes you get more alerts than you know what to do with. It may be unclear whether these alerts indicate a real problem or whether they are just a bunch of false alarms. This is the classic “alert storm” and it can sometimes be as disruptive as a real problem. You don’t want your team to scramble because of a bunch of false positives. However, you can’t just discount alerts, because if there really is a problem you ignore them at your own peril.

Once I was talking performance with a storage manager who casually mentioned that they had over 60,000 alerts in their monitoring system. Needless to say having 60,000 alerts is the same as having 6,000 alerts or no alerts at all. Ideally a monitoring system will give you a manageable number of alerts that are actionable. Otherwise why do performance monitoring at all?

This is where a solution like IntelliMagic Vision can help. Vision is designed to be proactive and give you advance warning when issues are starting to pop up, well before they impact users. Vision comes preset with thresholds based on the capabilities of your specific storage infrastructure. In many cases these thresholds are dynamic and adjust from interval to interval based on the workload attributes. These factors all contribute to mitigating false alarms and alert storms. Thresholds may be adjusted to include service level objectives if desired but often this is unnecessary.

Traditional performance monitors report on key performance indicators (KPIs) and may alert you when there is a statistical deviation from the norm.   However short term changes in KPIs may or may not tell you anything useful. Alerts in Vision are based on knowledge of the hardware capabilities and are presented as key risk indicators (KRIs) that reflect sustained over-utilization of internal components. Thus KRIs are inherently more meaningful than KPIs.

With IntelliMagic Vision you are able to avoid alert storms. The deep insight and analysis tools that IntelliMagic Vision provides helps you quickly determine root cause if a storm should develop. Vision can also help you decide if the alert storm is truly a problem or just a red herring.

The bottom line here is that real-time is too late. Alert storms turn into fire drills that just suck up resources. If you supplement your real-time monitoring with the right tools, you can find shelter from the storm.

Leave a Reply

Your email address will not be published. Required fields are marked *