It is common in today’s challenging business environments to find IT organizations intensely focused on expense reduction. For mainframe departments, this typically results in a high priority expense reduction initiative for IBM Monthly License Charge (MLC) software, which usually represents the single largest line item in their budget.
This article begins a four-part series focusing largely on a topic that has the potential to generate significant cost savings but which has not received the attention it deserves, namely processor cache optimization. The magnitude of the potential opportunity to reduce CPU consumption and thus MLC expense available through optimizing processor cache is unlikely to be realized unless you understand the underlying concepts and have clear visibility into the key metrics in your environment.
Subsequent articles in the series will focus on ways to improve cache efficiency, through optimizing LPAR weights and processor configurations, and finally on the value of additional visibility into the data commonly viewed only through the IBM Sub-Capacity Reporting Tool (SCRT) report. Insights into the potential impact of various tuning actions will be brought to life with data from numerous real-life case studies, gleaned from experience gained from analyzing detailed processor cache data from 45 sites across 5 countries.
Processor cache utilization plays a significant role in CPU consumption for all z processors, but that role is more prominent than ever on z13 and z14 models. Achieving the rated 10% capacity increase on a z13 processor versus its zEC12 predecessor (despite a clock speed that is 10% slower) is very dependent on effective utilization of processor cache. This article will begin by introducing the key processor cache concepts and metrics that are essential for understanding the vital role processor cache plays in CPU consumption.
Key Metric #1 – Cycles Per Instruction (CPI)
The first key metric to consider is Cycles Per Instruction (CPI). CPI represents the average number of processor cycles spent per completed instruction. Conceptually, processor cycles are spent either productively, executing instructions and referencing data present in Level 1 (L1) cache, or unproductively, waiting to stage data due to L1 cache misses. Keeping the processor productively busy executing instructions is highly dependent on effective utilization of processor cache.
Figure 1 breaks out CPI into the “productive” and “unproductive” components just referenced. The space above the blue and below the red line reflects the cycles productively spent executing instructions for a workload. This would be the CPI value if all required data and instructions were always present in L1 cache, and it reflects the mix of simple and complex machine instructions in a business workload.
But since the amount of L1 cache is very limited, many machine cycles are spent waiting for data and instructions to be retrieved from processor cache or memory into L1 cache. Note that these “waiting on cache” cycles, represented by the area under the blue line, represent a significant portion of the total, typically 35-55% of total CPI. This highlights the magnitude of the potential opportunity if improvements in processor cache efficiency can be achieved and the importance of having clear visibility into key cache metrics.
Key Metric #2 – Relative Nest Intensity (RNI)
A second key metric and one that correlates closely with the unproductive cycles represented by that blue line is Relative Nest Intensity (RNI). RNI quantifies how deep into the shared processor cache and memory hierarchy (called the nest) the processor needs to go to retrieve data and instructions when they are not present in L1 cache (see Figure 2). This is very important, because access time increases significantly for each subsequent level of cache and thus results in more waiting by the processor.
The formula to calculate RNI as provided by IBM is processor dependent and reflects the relative access times for the various levels of cache. The z13 RNI formula appears here [Kyne2017].
2.3 * (0.4*L3P + 1.6*L4LP + 3.5*L4RP + 7.5*MEMP) / 100
L3P = % of L1 misses sourced from the shared chip-level L3 cache
L4LP = % of L1 misses sourced from L3 or L4 cache in the same (local) node
L4RP = % of L1 misses sourced from L3 or L4 cache in a remote node or drawer
MEMP = % of L1 misses sourced from memory
Note that retrievals from memory (MEMP) have a weighting factor almost nineteen times higher than the factor for L3 cache (7.5 vs. 0.4), reflecting how many more machine cycles are lost waiting for data when it is not found anywhere in processor cache and must be retrieved from memory.
IBM defines a threshold value for RNI of 1.0 above which a workload is considered to place a high demand on cache and thus may not achieve the rated capacity of a processor. But no matter the current RNI value, reducing it means less unproductive waiting cycles and thus less CPU consumption.
A key technology to understand when seeking to reduce RNI is HiperDispatch (HD). HD was first introduced in 2008 with z10 processors, but it plays an even more vital role on z13 and z14 models where cache performance has such a big impact.
With HD, the PR/SM and z/OS dispatchers interface to establish affinities between units of work and logical CPs, and between logical and physical CPs. This is important because it increases the likelihood of units of work being re-dispatched back on the same logical CP and executing on the same (or nearby) physical CP. This optimizes the effectiveness of processor cache at every level, by reducing the frequency of processor cache misses, and by reducing the distance (into the Nest) required to fetch data.
When HD is active, PR/SM assigns logical CPs as Vertical Highs (VH), Vertical Mediums (VM), or Vertical Lows (VL) based on LPAR weights and the number of physical CPs. VH logical CPs have a 1-1 relationship with a physical CP. VMs have at least a 50% share of a physical CP, while VLs have a low share of a physical CP (and are subject to being “parked” when not in use).
The Benefit of Maximizing Work Executing on Vertical Highs
Work executing on VHs optimizes cache effectiveness, because its one-to-one relationship with a physical CP means it will consistently access the same processor cache. On the other hand, VMs and VLs may be dispatched on various physical CPs where they will be contending for processor cache with workloads from other LPARs, making the likelihood of their finding the data they need in cache significantly lower.
This intuitive understanding of the benefit of maximizing work executing on VHs is confirmed by multiple data sources. One such source is calculated estimates of the life of data in various levels of processor cache (derived from SMF 113 data). Commonly, cache working set data remains in L1 cache for less than 0.1 millisecond (ms), in L2 cache for less than 2 ms, and in L3 cache around 10 ms. This means that by the time an LPAR gets re-dispatched on a CP after another LPAR executed there for a typical PR/SM time-slice of 12.5 ms, its data in L1, L2 and L3 caches will all be gone. The good news is those working sets will be rebuilt quickly from L4 cache, but the bad news is that each access to L4 cache may take 100 or more cycles.
This thesis that work executing on VHs experiences better cache performance is further substantiated by analyzing RNI data at the logical CP level. At the beginning of the case study presented in Figure 3, the Vertical CP configuration for this system consisted of three VHs and two VMs. Note that the RNI values for the two VM logical CPs (CPs 6 and 8) were higher than the RNIs for the VHs.
After the deployment of additional hardware caused the Vertical CP configuration to change to five VHs (Figure 4), the greatest reductions in RNI occurred for CPs 6 and 8. Now that they had also become VHs and were no longer experiencing cross-LPAR contention for processor cache, their RNIs decreased to a level comparable to the other VHs.
Optimizing Processor Cache to Lower MLC Software Costs
Up to this point, we have seen that CPI decreases when there is a reduction in unproductive machine cycles waiting for data and instructions to be staged into L1 cache. Reducing CPI translates directly to decreased CPU consumption, which when occurring at peak periods results in reduced IBM Monthly License Charge (MLC) software expense.
Opportunities to optimize processor cache are worth investigating, because those unproductive waiting cycles typically represent at least one third of overall CPU, and often one half or more. Fortunately, the mainframe is a very metric rich environment, and the RNI metric is available that correlates to those unproductive waiting cycles.
Note that optimizing processor cache can have a particularly big payoff on z13 and z14 processors, which are more dependent than ever before on effective processor cache utilization. In the next article in this series we will explore tuning actions to reduce RNI and thereby optimize processor cache.
Read part 2 here: Reduce MLC Software Costs by Optimizing LPAR Configurations
[Sinram2015] Horst Sinram, z/OS Workload Management (WLM) Update for IBM z13, z/OS V2.2 and V2.1, SHARE Session #16818, March 2015.
[Kyne2017] Frank Kyne, Todd Havekost, and David Hutton, CPU MF Part 2 – Concepts, Cheryl Watson’s Tuning Letter 2017 No. 1.
 While “waiting to stage data” z processors leverage Out of Order execution and other types of pipeline optimizations behind the scenes seeking to minimize unproductive waiting.
 Another small component of CPI, Translation Lookaside Buffer (TLB) misses while performing Dynamic Address Translation, is represented by the yellow line on Figure 1.
 The fact that waiting cycles are approximately half of total CPI leads to the rule of thumb that a 10% reduction in RNI correlates to a 5% reduction in CPU consumption.