Bridging the z/OS Mainframe Performance & Capacity Skills Gap

B._Phillips-web0By Brent Phillips

Many, if not most organizations that depend on mainframes are experiencing the effects of the mainframe skills gap, or shortage. This gap is a result of the largely baby-boomer workforce that is now retiring without a new generation of experts in place who have the same capabilities. At the same time, the scale, complexity, and change in the mainframe environment continues to accelerate. Performance and capacity teams are a mission-critical function, and this performance skills gap represents a great risk to ongoing operations. It demands both immediate attention and a new, more effective approach to bridging the gap.

Bridging the z/OS Mainframe Performance and Capacity Skills Gap

In our connected and technology-dependent world, deep performance and capacity management skills are essential. Applications and back-end transactions are often accessed throughout the day and night, causing less workload predictability. Simultaneously, the infrastructure as well as the performance and configuration data analysis required to maintain availability is even more complex than it used to be. New features such as Pervasive Encryption and hardware data compression (zEDC) must be measured and monitored to ensure they do not impact required service levels. New cross-platform applications with Web front-ends and mainframe transaction back-ends are now common and create new requirements for predictive and prescriptive monitoring of TCP/IP, MQ, and other network parts of the infrastructure.

This dynamic environment, coupled with the shortage of performance and capacity experts, represents a significant risk to mainframe operations and affects most of the world’s largest organizations. The mainframe performance skills gap is in fact, one of the significant issues causing IT executives to question the future role of the mainframe within their organization.

The Future of Mainframe

Compuware’s 2014 survey of 350 large company CIO’s revealed the scope of the mainframe skills gap: 66% of CIO’s fear that the impending retirement of the mainframe workforce will hurt their business and 40% have no formal plans for dealing with the key risks of the skills shortage. To this, some might suggest the path should be to move away from the mainframe, but in that same study, 81% of CIO’s believe the mainframe will remain a key business asset over the next decade.

Regardless of what some say, reality seems to lean in favor of the mainframe:

Will the mainframe remain viable?

Some still think that the mainframe is dying, and companies will all shift to different platforms. In fact, at the 2016 Amazon Web Services conference, AWS VP & Distinguished Engineer James Hamilton lead a chant of “death to the mainframe”. In most cases, however, replacing the mainframe just isn’t feasible given the cost, migration pain, and risk to custom business applications. And besides the difficulty of migrating, the platform itself remains the world class leader for scalability and security, especially for transaction processing and other systems of record functionality. Given that these organizations are not going to be moving off the mainframe for the foreseeable future, solving the skills shortage in the meantime is a priority.

At face value, it seems that there are only a few options to solve the problem created by the skills gap:

  • Train new workers so they can gain sufficient expertise
  • Find and hire experienced talent elsewhere which cannot work for every shop
  • Outsource operations and hope the outsourcing company can find the skills
  • Move off the mainframe

Steps are being taken to train the next generation of mainframe users, and IBM’s Academic Initiative has made some traction on this front, but this is not a short-term solution. IntelliMagic often hears for example, that even performance staff that has worked for years on the platform are able to significantly deepen their expertise only when serious problems occur. That method of learning is too slow and too costly. There is however, a lower cost and lower risk solution: make all staff, experienced and new, more productive and effective in far less time by modernizing analysis of the mainframe application infrastructure using artificial intelligence enhanced processes.

Modernizing Mainframe Performance and Capacity Analysis for Superior Results

The mainframe infrastructure has many components that are each in and of themselves complex. The relationship and interdependencies between components only increases the difficulty of understanding what is affecting performance. Each new advance in the physical, virtual, and logical infrastructure components, and there has been significant innovation in recent years, requires mastery of the new, complicated technology.

While the z/OS mainframe infrastructure technology has advanced and adapted to new modern application requirements, the processes for performance and capacity operations have not kept up. The vast majority of sites are still depending on a process designed four decades ago to understand the performance and configuration data and what it means for their application workloads on their specific infrastructure configuration.

This process generates hundreds or thousands of static reports once per day. These reports are most-often static and unrated. In other words, you cannot navigate intelligently through the data or change the views. And there is no indication of whether the metrics are good or bad, important or unimportant.  Consequently, it requires many hours from staff with deep expertise in different parts of the infrastructure, and this requires multiple team members.

The net effect is that the data is NOT reviewed proactively for predictive views of developing problems, but typically it is only reviewed for forensic studies after a significant performance problem has already occurred or for long term capacity planning purposes. Inefficiencies, for example in CPU cache efficiency, and other issues, can significantly and unnecessarily increase mainframe software license charges, and are also very difficult to spot with this status quo reporting process.

 “Machines are for answers; Humans are for questions”

Artificial intelligence (AI) can be defined as a machine capability that would otherwise require human intelligence to perform. The discussion above makes it clear that reviewing many different performance measurements, and evaluating how those are or are not stressing the various infrastructure components, requires human intelligence and expertise.

Computer algorithms can provide a huge leap forward in carrying out these processes, if they are properly designed with the right types of built-in expert knowledge. Computers are far more effective than humans at this kind of automated assessment and rating to find what is important out of reams of measurement data. Kevin Kelly famously said, “Machines are for answers; humans are for questions.” Humans are good at asking questions and devising plans based on accurate intelligence, and not as good at sorting through reams of data looking for elusive answers.

Historic attempts at using computers for this type of problem have been fraught with false positive issues, causing the analysts to ignore the alerts. And false negative issues, where the solution does not detect a real issue, also creates distrust and disuse of the solution for the opposite reasons.

Fortunately, the algorithms have now advanced such that they can add real value both to deep expert users, and to new users learning the platform. Proper artificial intelligence makes the entire team more productive and proactive through capabilities such as:

  • Continuous, automated assessment of infrastructure performance risk
  • Continuous, automated assessment of infrastructure cost-efficiency issues
  • Adaptive ratings of assessment severity that vastly reduce false positives and false negatives
  • Exception tables summarizing and prioritizing all identified issues even for very large sites
  • Built-in recommendations that lead to understanding root causes of assessment issues
  • Intelligent correlation and normalization of the data for multi-dimensional navigation by users
  • Automated compare functions and chart customization to accelerate resolution

Delivery of the solution via Cloud Services is also important from a skills gap perspective. First, this removes from the local team the tasks of implementing and maintaining the infrastructure to run the solution, and software upgrades, as well as ensuring continuous processing occurs.  Secondly, it can provide easy access to experts at a vendor who are seeing many different sites, and this can accelerate training, problem remediation plans, and reduce risk.

Solving the Mainframe Performance Skills Gap

z/OS mainframes today look nothing like they did in the 1980’s, yet performance and capacity analysis processes have largely remained unchanged. Artificial intelligence (AI) is being used in many areas of large organizations today for new business value, and the mainframe performance skills gap is quickly making the use of AI on the performance and configuration data and analysis processes a requirement. IntelliMagic Vision has been designed to help your existing and new performance and capacity staff successfully meet the challenge in today’s operating environments.

If you would like to learn more about using modernized analytics to bridge the z/OS performance and capacity planning skills gap, view our webinar on the topic here. 

Leave a Reply

Your email address will not be published. Required fields are marked *