z/OS Performance Monitors – Why Real-Time is Too Late

By Morgan Oatsperformance monitor

Real-time z/OS performance monitors are often advertised as the top tier of performance management. Real-time monitoring means just that: system and storage administrators can view performance data and/or alerts indicating service disruptions continuously as they happen. In theory, this enables administrators to quickly fix the problem. For some companies, service disruptions may not be too serious if they are resolved quickly enough. Even though those disruptions could be costing them a lot more than they think, they believe a real-time monitor is the best they can do to meet their business needs.

For leading companies, optimal z/OS performance is essential for day-to-day operations: banks with billions of transactions per day, global retailers, especially on Black Friday or Cyber Monday, government agencies and insurance companies that need to support millions of customers at any given time, transportation companies with 24/7 online delivery tracking; the list goes on and on. For these organizations and many others, real-time performance information is in fact, too late. They need information that enables them to prevent disruptions – not simply tell them when something is already broken.

The Critical Nature of Availability in Mainframes

Mainframes are the backbone of the application infrastructure in many leading enterprises because they offer better security, scalability, reliability, availability, serviceability, and compatibility. 23 of the 25 top US retailers, 96 of the world’s top 100 banks, and 9 out of the 10 world’s largest insurance companies count on mainframe technologies for their core business needs every hour of every day.

For these organizations, a minute of downtime can cost more than $5,000, and if a disruption occurs on a transaction-critical day, the result can be millions of lost revenue on top of damaged credibility and future revenue losses. A real-time monitor alert to an already occurring service disruption does not provide the preventative information critical to these organizations. It is the equivalent of an alert in your car that you’ve just been hit while driving on a busy highway. Wouldn’t it have been better if your car had alerted you that your current trajectory would lead to a crash if you did nothing and suggested you slow down by 10mph and move one lane over to stay safe? That’s meaningful intelligence.

Transforming RMF & SMF Data into Intelligence

z/OS generates volumes of rich measurement data via SMF and RMF. Prediction and prevention require information to be created from all that raw data, and acting on that information is crucial to maintaining availability. By unlocking the intelligence inside the data for specific workloads within your environment and combining this with knowledge of the hardware capability, something remarkable may be achieved. Reducing availability incidents is no longer a matter of resolving them with a workaround after the fact, but preventing them before they can occur. The challenge is interpreting the data in an efficient and meaningful way.

Existing performance reporting solutions that have been in use for years are unable to create this kind of intelligence from the data. Often, the preventative information is well hidden and only understandable to an absolute expert doing a thorough analysis. A performance expert with 20 years of experience may know where to look when problems occur and is usually pretty adept at optimizing the environment to reduce incidents. However, without broad and deep visibility and intelligence, they are oftentimes working on a reactive basis. In fact, since there are often too many problems to work on, there is no time for proactive analysis, and there is no time to or inclination to implement the analytics into the IT management process. Furthermore, continuously analyzing the relevant details of every subsystem quickly becomes overwhelming. Not to mention the scarcity of deep experts with that kind of skill.

Preventing z/OS Performance Disruptions Before They Occur

Modernized solutions with automatic expert analytics will embed intelligence into the software. This intelligence is more effective than regular ITOA solutions that use statistical anomaly detection. Automated analysis is vital for huge amounts of data, but merely applying statistical methods lacks context-specific insights.

As Forrester research put it: The [monitoring] tools present us with the raw data, and lots of it, but sufficient insight into the actual meaning buried in all that data is still remarkably scarce”. To unlock the hidden predictive and preventative value, the data should not be treated as unstructured ‘big data’, but interpreted using detailed knowledge of the IT architecture and the meaning of each metric. This is where embedding the expert knowledge in the software sets itself apart.

Once we realize that incidents can be easily prevented, the myth that real-time monitors are the top tier of z/OS performance management is debunked. The value of this predictive intelligence is the ability to avoid service incidents, enabling your z/OS experts to implement new features and improve support for new applications, reduce costs, and preserve the reliability and availability that mainframes are known for.

Leave a Reply

Your email address will not be published. Required fields are marked *