Root Cause Analysis for z Systems Performance – Down the Rabbit Hole

By Morgan Oats

Finding the root cause of z Systems performance issues may often feel like falling down a dark and endless rabbit hole. There are many paths you can take, each leading to further possibilities, but clear indicators as to where you should really be heading to resolve the problem are typically lacking. Performance experts tend to rely on experience to judge where the problem most likely is, but this may not always be adequate, and in the case of disruptions, time is money.

Performance experts with years of experience are more likely able to resolve problems faster than newer members of the performance team. But with the performance and capacity skills gap the industry is experiencing, an approach is needed that doesn’t require decades of experience.

Rather than aimlessly meandering through mountains of static reports, charts, and alerts that do more to overwhelm our senses than assist in root cause analysis, performance experts need a better approach. An approach that not only shines a light down the rabbit hole, but tells us which path will lead us to our destination. Fortunately, IntelliMagic Vision can be your guide.

Root Cause Analysis in z Systems Performance – Where to Begin?

Perhaps the most crucial step to uncovering the root cause of a z Systems performance disruption is determining where to begin. You may have a hunch or underlying assumption about the source of a problem, but a mistake at this critical first step may plunge you deep into the rabbit hole for hours, only to discover that you had wasted all that time.

IntelliMagic Vision begins the root cause analysis process with a high level “Warnings and Exceptions” table which gets us started right where it matters most. The example below is for Coupling Facilities.

Root Cause Analysis for z systems performance

This view alerts us to all CF components that have exceptions (highlighted red), indicating likely disruptions, and warnings (highlighted yellow) about potential issues we may not even know about.

In addition to the prioritized exception table, we are offered Observations & Recommendations for what is likely causing the issue and what we can do about it. The Observations and Recommendations for the first report, “Dr Reclaims”, says, “Directory reclaims happen regularly. A directory reclaim occurs when there are not enough directory entries to describe all referenced data. The coupling facility will reclaim an existing slot and cross-invalidate the data as needed. This should be avoided. Increase the directory space for the structure.”

We are immediately shown what all current and upcoming issues are from this single view, and specifically what is causing them. Much of the root cause analysis is done in the first step.

By clicking on any of the reports, we are taken to the report with the exception or warning. Let’s drill down to the directory reclaims report.

Directory entry reclaim rate

In this view we can see the directory entry reclaim rate for the structure HSMCACHE1, and indeed, the reclaim rate is significantly exceeding the threshold.

From Overall Health to Specific Issues

Another way to determine the root cause of this performance issue is to begin at the Overall Health Dashboard for Coupling Facility Cache Usage. Rather than always needing to return to the exception dashboard to resolve issues, it’s important to be able to see issues while examining overall health and performance. This view below represents all of the cache structures in our environment and gives a good indicator of our overall health.

Coupling Facility Cache Usage

Using the basic color scheme of green means good, yellow is a warning, and red is bad, we immediately see that there are performance issues occurring right now. The color of the bubbles, as well as the box surrounding the dashboard, comes from the rating, in this case 0.42. We don’t need to interpret the rating at this stage because the color scheme does that for us. (Blue bubbles indicate activities that don’t warrant a rating.)

Looking at the key metrics on top of the rated bubbles, we see that there is clearly an issue in Directory Reclaims for the Structures, HSMCACHE1 and HSMCACHE2.

The rated dashboard turned some lights on in this rabbit hole, and the red bubbles tell us exactly which paths to go down – clicking on the red bubble takes us to the next step.

The Paths Begin to Multiply

The reason root cause analysis is so tricky is because of the sheer number of possibilities. z Systems performance analysis has continued to grow in complexity while the old tools and methods for monitoring and managing them have remained static and stagnant.

After clicking on the red bubble in the previous chart, we see a set of minicharts representing all of the various components and key metrics of the structure, HSMCACHE1. And like the overview dashboard, each of these charts are rated to tell us where to investigate further.

CF Cache Structures Minicharts

Dynamic Thresholds

An interesting point to note in this view is the difference in the red minicharts in the top right and the green minicharts second to last on the bottom. The second to last chart on the bottom row has dynamic thresholds – you can see this by the fluctuations in the white, light, and dark backgrounds which rise and fall with the workload. Compare this to the red chart with a static threshold. Let’s drill into the green chart with the dynamic threshold.

Dynamic Thresholds

By clicking on any of the rated minicharts, you are taken to a chart for that particular metric – in this case the “Directory entry counter snapshot” chart. If the chart had used a static threshold, it would have triggered an exception rating.

IntelliMagic Vision knows when your hardware can handle a fluctuating metric and when it can’t. This is because IntelliMagic Vision’s deep intelligence compares not only the configuration settings of your mainframe, but also the specific capabilities of your workload. This predictive intelligence is applied to the dashboards and ratings that guide your way to root cause discovery.

Drill Downs Result in Root Cause Discovery

By returning to the minicharts from the previous step we can continue with the root cause analysis by drilling down into the chart with the red border.

Directory entry reclaim rate

This is the same chart that we arrived at after clicking on the report in the Exception dashboard. During the time period seen at the bottom of the chart, HSMCACHE1 crossed the warning threshold for a significant enough amount of time to generate the red border and high rating.

And thanks to the recommendations from IntelliMagic Vision, we know that to resolve this problem we should create more directory space. A z Systems performance expert may take this knowledge for granted, but a novice to the industry may greatly benefit not only from quick drill down capabilities, but Observations and Recommendations towards the resolution.

Optimize z Systems Performance Root Cause Analysis – Eliminate the Rabbit

Root cause analysis can be a painful process when you’re running blindly down a rabbit hole of data. With the right tools and a modernized approach to monitoring and managing your infrastructure, most of your time isn’t spent finding the problem but in resolving the problem. Or even better, preventing problems altogether. This lets you get back to what’s important in your job and away from constant fire-fighting.

Predictive Intelligence for z/OS Systems Infrastructure

Leave a Reply

Your email address will not be published. Required fields are marked *