A prominent theme among IT organizations today is an intense focus on expense reduction. For mainframe departments, this routinely involves seeking to reduce IBM Monthly License Charge (MLC) software expense, which commonly represents the single largest line item in their budget.
This is the second article in a four-part series focusing largely on a topic that has the potential to generate significant cost savings but which has not received the attention it deserves, namely processor cache optimization. (Read part one here). Without an understanding of the vital role processor cache plays in CPU consumption and clear visibility into the key cache metrics in your environment, significant opportunities to reduce CPU consumption and MLC expense may not be realized.
This article focuses on changes to LPAR configurations that can improve cache efficiency, as reflected in lower RNI values. The two primary aspects covered will be optimizing LPAR topology, and increasing the amount of work executing on Vertical High (VH) CPs through optimizing LPAR weights. Restating one of the key findings of the first article, work executing on VHs optimizes processor cache effectiveness, because its 1-1 relationship with a physical CP means it will consistently access the same processor cache.
PR/SM dynamically assigns LPAR CPs and memory to hardware chips, nodes, and drawers seeking to optimize cache efficiency. This topology can have a very significant impact on processor performance because remote cache accesses can take hundreds of machine cycles. Figure 1 provides a framework for the following discussion.A z13 (or z14) processor has one to four CPU drawers (along with some number of I/O drawers), and two nodes in each drawer. Each active physical CP (labeled “PU” on Figure 1) has its own dedicated L1 and L2 cache.
When a unit of work executing on a CP on a given Single Chip Module (SCM) accesses data in L3 cache (the first level that is shared and thus part of the “nest” and RNI metric), that access can be on its own SCM chip, “on node” (on a different SCM within this node), “on drawer” (on the other node in this drawer), or “off drawer”. The more remote the access, the greater the number of machine cycles required, often hundreds of cycles. Similarly, access to L4 cache and to memory can be “on node”, “on drawer”, or “off drawer.”
A first example showing the impact of LPAR topology on RNI begins with Figure 2.
Each entry in this diagram represents a general purpose logical CP and indicates the system identifier, polarity (VH, VM or VL), and relative logical CP numbered starting from 0. VHs are highlighted in bold. zIIP CPs have been removed because this analysis is focused on general purpose CPs.
In this scenario, the logical CPs from several LPARs are competing for physical CPs on three chips. Green and orange shading has been added to compare two primary Production systems that execute similar data sharing workloads, each having three VHs and two VMs.
Note that the VHs and VMs for System 3 (shaded green) are collocated on the same chip. The “RNI by Logical CP” chart for this system appears in Figure 3 below. The RNIs for the two VM CPs on that system (CPs 6 and 8, the green and light blue lines at the top of the chart), were higher than for the VHs, but not dramatically so.
Contrast that with the topology of System 18 (in orange), which has three VHs in Drawer 2, but its two VMs reside in a different drawer, Drawer 3. These VMs are not located on a different chip in the same node, nor on a different node in the same drawer, but in the most remote possible location, in a different drawer.
Figure 4 shows the extent of the negative impact to RNI that results from cache references for the VMs on system 18 routinely travelling across drawers. The gap in RNI between the two VMs (CPs 6 and 8) and the three VHs is much larger on this system with the adverse LPAR topology than it was on system 3 (Figure 3), where the VHs and VMs were collocated on the same chip. Figure 5 shows the dramatic improvement in RNI when CPs 6 and 8 changed from VMs to VHs, because it is very likely that all five of those VHs are now collocated on the same chip.
Additional analysis breaking out the waiting cycles of CPI by the various causes shows the impact of the far slower off-drawer accesses on system 18’s CPI (Figure 6). Note the significant increases in cycles spent on off-drawer accesses for system 18 for L3 cache (gray), L4 cache (light blue), and memory (dark blue). The cumulative impact of the L3 and L4 off-drawer accesses added more than an extra half a cycle (0.57) per instruction to system 18.
Drilling further into this cache miss data on system 18 by logical CP (Figure 7) demonstrates how those off-drawer accesses predominantly occurred on the two VM logical CPs, CPs 6 and 8, as expected. Note that work executing on those two VMs required two full additional cycles per instruction compared to work executing on the VHs.
When there is a significant disparity between the RNIs for VMs and VHs on the same system, an investigation of the LPAR topology is warranted. This analysis also highlights another benefit of having more work executing on VHs. In addition to avoiding cross-LPAR contention for cache from other LPARs (as identified earlier), since PR/SM is very likely to collocate VHs, the incremental work now executing on these VHs is unlikely to be subject to the cross-node or cross-drawer access times often experienced on VMs.
A second LPAR topology scenario involves an entirely different kind of opportunity. In the use case depicted in Figure 8, PR/SM configured all ten VH CPs from two Production LPARs in the same node on a single drawer. The outcome of this configuration was that the two LPARs were sharing the 480 MB L4 local cache of that single node between them.
Increasing the memory allocation for both LPARs caused them to exceed the memory available in a single drawer and forced PR/SM to assign them to separate drawers and thus separate nodes. Now that these two LPARs had the L4 local cache of two nodes available to them (960 MB), we would expect more misses to be resolved from there. (The z13 RNI formula is repeated here to highlight how much lower the weighting is for L4 Local accesses than L4 Remote and Memory.)
Figure 9 shows that an additional 1.5% of L1 misses were now sourced from L3 or L4 local node cache (L4LP in Green), reducing the frequency of accesses from L3 or L4 remote cache and especially from memory. This resulted in an 11.5% reduction in RNI and corresponding 6% increase in effective capacity. Note that this change did not require deploying any additional hardware, but instead involved utilizing the existing hardware more effectively.
Maximizing Work Executing on VHs
A second way to reduce RNI is to maximize work executing on VHs. The two variables determining the Vertical CP configuration are (1) LPAR weights and (2) the number of physical CPs. The remainder of this article will cover how LPAR weights can be adjusted to maximize work executing on VHs. The third article of the series will address options and considerations for increasing the number of physical CPs.
There are several ways to maximize work on VHs through setting LPAR weight values. One is to adjust LPAR weights to increase the number of VHs for high CPU LPARs that currently have a significant workload executing on VMs and possibly even VLs. In Figure 10, a very small weight change on a large LPAR changing the LPAR weight percentage from 70 to 71 percent increased the number of VHs from seven to eight.
This resulted in a measured decrease in RNI of 2% for a given measurement interval, which correlated to a CPU reduction of 1%. 1% less CPU on a large LPAR can translate into a meaningful reduction in MLC software expense, especially when compared with the level of effort required to identify and implement this type of change. Benefits from tuning LPAR weights typically produce single-digit percentage improvements as in this case, but there can be larger opportunities as we will now see.
A second way to increase work executing on VHs involves tailoring LPAR weights to increase the overall number of VHs assigned by PR/SM on a processor. The LPAR weight configuration of 30/30/20/20% in Figure 11 appears routine, but unfortunately, on a z13 it results in zero VHs.
As Figure 12 shows, relatively small LPAR weight changes could increase the number of VHs from zero to two, and increase the amount of work eligible to execute on VHs from 0% up to 33%. On a comparable system in this environment, the RNI for work executing on VHs was 20% lower than work on VMs, so it is likely this change would significantly reduce CPU consumption on this processor.
If the predominant characteristics of workloads on LPARs change significantly between shifts (e.g., online vs. batch), automating LPAR weight changes corresponding to those shifts may be another way to increase the workload executing on VHs. And finally, fewer, larger LPARs may be a configuration option to increase the size of the workload executing on VHs.
Optimizing processor cache can have a particularly big impact on CPU consumption for z13 and z14 processors, which are more sensitive than ever before to cache effectiveness. In the next article in this series, we will explore options and considerations relating to the number of physical CPs that can reduce RNI and CPU consumption and MLC expense.
Read part 3 here: Optimizing MLC Software Costs with Processor Configurations
[Havekost2017a] Todd Havekost, Impact of Processor Cache Optimization on MLC Software Costs, Enterprise Tech Journal, 2017: Issue 4.
[Sinram2015] Horst Sinram, z/OS Workload Management (WLM) Update for IBM z13, z/OS V2.2 and V2.1, SHARE Session #16818, March 2015.
[Havekost2017b] Todd Havekost, Beyond Capping: Reduce IBM z13 MLC with Processor Cache Optimization, Share Session #20127, March 2017.
[Snyder2016] Bradley Snyder, z13 HiperDispatch – New MCL Bundle Changes Vertical CP Assignment for Certain LPAR Configurations, IBM TechDoc 106389, June 2016.
 For background on key metrics and concepts such as Cycles Per Instruction (CPI), Relative Nest Intensity (RNI), HiperDispatch, and vertical CP configurations, see the first article in the series [Havekost2017a].
 LPAR topology data is provided by the SMF Type 99 Subtype
14 record. As opposed to some SMF 99 subtypes which can generate overwhelming volumes, the volume of subtype 14 data is very manageable, one record per logical CP every five minutes, making this another data source that warrants collection and analysis.
 IBM specialists assisting at my former employer identified this opportunity and the workaround to create the desired topology, and measured the increase in effective capacity, which correlated well with estimated CPU savings derived from the RNI metric.
 The specifics of how PR/SM determines Vertical CP assignments based on LPAR weights and the number of physical CPs is beyond the scope of this article [for details see Havekost2017b].
 This is the configuration after an IBM z13 microcode change released June 2016 [see Snyder2016 for details].