5 Reasons IBM z/OS Infrastructure Performance & Capacity Problems are Hard to Predict and Prevent

B._Phillips-web0By Brent Phillips

Solving z/OS infrastructure performance and capacity problems is difficult. Getting ahead of performance and capacity problems before they occur and preventing them is more difficult still. This is why it takes years, and decades even, for performance analysts and capacity planners to become experts.

And together, with the rapid retiring of the current experts, the difficulty in becoming an expert is why the performance and capacity discipline for the mainframe is experiencing a significant skills gap. It is simply too difficult and time consuming to understand what the data means for availability, let alone derive predictive intelligence about upcoming production problems within the complex IBM z Systems infrastructure.

The primary root causes of this performance and capacity management problem are:

1)  z/OS Infrastructure Performance and Capacity Experts are in Short(er) Supply

The evidence of the performance and capacity skills gap is all around us. Even organizations that don’t believe they have a skill shortage still show the signs that they do. Performance and capacity teams are small as it is, even for larger shops, and most sites have significant gaps in expert knowledge about critical parts of the infrastructure.

There are simply too many areas to be an expert on, and too little staff and time to be an expert in everything. Even the most experienced and knowledgeable performance analyst will have trouble staying up to date with all the current and developing areas. And as environments grow, understanding how to recognize issues early and optimize each area of the z system infrastructure becomes impossible.

2)  New Recruits Cannot Become Experts Fast Enough

Even when new recruits are proactively brought on board before the current experts retire so they can benefit from multiple years of mentorship from the team, it is still proving unfeasible to replace the depth of expert knowledge needed. The amount of time required is too long because the infrastructure is broad and complex while the analytics are antiquated without built-in knowledge about what is good or bad. The process of mentoring new recruits also takes valuable time from the current experts who are then less able to do proactive performance and capacity planning.

3)  There is Too Much Complex Data to Figure Out What It Means

RMF (or CMF) and SMF data is a rich source of valuable data. The richer the data, the smarter the analytics can be if the data is combined with z/OS expert knowledge that is specific to the specific infrastructure components in use. Status quo reporting on the data produces hundreds or thousands of static reports that are unrated and uncorrelated with one another. Trying to interpret and understand what they all mean so that current or upcoming problems can be recognized and remediated is completely unfeasible with the status quo reporting in use today.

The complexity surrounding the data continues to grow with the addition of multiple new record types about CPU processor efficiency, Pervasive Encryption/Cryptology, SMT, Data compression, TCP/IP and MQ, etc. The addition of this data is valuable and important to continuous availability, but has also proven difficult for resource-constrained sites to get the required visibility and understanding.

4)  Relying on Technology that Only Generates Static Reports

The reporting products sites are most often using for the data were created over 30 years ago when infrastructure was smaller with less complexity. Today, this important area of RMF or CMF and SMF analysis is probably the only area of the mainframe infrastructure that still requires custom coding to produce new charts and views about the data.

This antiquated approach to the data offers insufficient visibility, is costing precious time, and lacks any predictability for the availability of the mission-critical applications running on the platform.

Smart analytics enables the computer to derive accurate performance and Availability Intelligence out of the raw data with minimal false positives and false negatives. This allows the human analyst to spend more time on analysis and designing problem remediation rather than digging through lots of cumbersome reports trying to identify what is most important.

To enable the computer to be more helpful to the human analyst requires algorithms designed to use artificial intelligence techniques that can assess and rate hundreds of performance conditions using built-in or derived z/OS infrastructure expert knowledge.

5)  Lack of Visibility and Collaboration

Maintaining the infrastructure required to generate visibility into the data is a time-consuming task that significantly reduces the time available for analysis. And these solutions often keep the true status of the infrastructure hidden within difficult to share repositories of knowledge within different silos of expertise in the organization.

z/OS infrastructure performance analytics via secure cloud addresses these issues. Taking advantage of a cloud-based SaaS offering enables speed of implementation, low-risk commitment, ease of maintenance, immediate availability of new features, and full access to world class experts.

Conclusions

  • Not More Reports – your team does not need a solution that generates more reports. You already have more reports than the team can effectively and proactively look at.
  • More Intelligence – Solving this problem is about better intelligence, not more reports. What does the data mean for current and future delivery of required application service levels? What z/OS infrastructure components are being used inefficiently or about to be saturated and cause service time delays? What z/OS performance and configuration best-practices are being violated?
  • Make the Computer Do More of the Work – A large catalogue of daily static reports from the data is the least effective way to understand performance and availability. New types of analytics that use statistical analysis techniques on the machine-generated data are becoming popular. But these are not enough to monitor and understand root causes of problems and to predict and prevent problems. Rather, expert knowledge about the specific infrastructure in use and how it is responding to various types of workloads you are running is what is needed in addition to statistical techniques.
  • Force Multiplier & Outsourcing Avoidance – Outsourcing of the mainframe continues to grow, and one of the reasons is the difficulty in finding expert staff. Yet, outsourcing success stories are rare. A better approach is to provide the performance and capacity team with the ability to be predictive and to more quickly see root causes of problems.

IntelliMagic Vision for z/OS has been designed from the ground up using artificial intelligence techniques to solve these problems. Hundreds of man-years have been invested in the product so that customers can benefit from the automatic application of application infrastructure performance assessment and rating with superior visibility and drill-down capabilities. This advances the state of the art in the mainframe performance and capacity discipline so that problems are easier to understand, resolve, predict, and prevent.

Leave a Reply

Your email address will not be published. Required fields are marked *