The Roots and Evolution of the RMF and SMF for Mainframe Performance Data (Part 1)

George Dodson

By George Dodson

This blog originally appeared as an article in Enterprise Executive.

Computer professionals have been interested in determining how to make computer applications run faster and determine the causes of slow running applications for more than 50 years. In the early days, computer performance was in some ways easy because electronic components were soldered in place. To understand what was happening at any point in the circuitry, we simply attached a probe and examined the electronic wave information on an oscilloscope.

Eventually, we were able to measure activity at key points in the computer circuitry to determine things like CPU Utilization, Channel Utilization and Input/Output response times. However, this method still had many shortcomings. First, the number of probes was very small, usually less than 40. Secondly, this method gives no insight into operating system functions or application operations that may be causing tremendous overhead. And of course, when integrated circuits were developed, the probe points went away.

In 1966 I joined an IBM team that was focusing on a better way to conduct benchmarks in what was then named an IBM Systems Center. Customers considering computer upgrades would come to our data center to determine how their programs would operate on newly released hardware. But it was simply not possible to host every customer in this way.

Continue reading

“State in Doubt”

Brett By Brett Allison

 

One of our customers recently came across a problem in their environment that I think warrants some attention. SANvideox'sThe VMWare administrator had gone to the storage team and asked if they saw any issues on the Fabric or IBM SVC storage environment because the infamous “state in doubt” message was popping up in the /var/log/vmkernel log file messages were similar to what is shown below: Continue reading

Four Steps You Should Take to Identify, Resolve and Prevent IBM SVC Front-End Imbalance

Brett

By Brett Allison

 

Did you know you could be at risk of a performance meltdown while you still have plenty of front-end bandwidth?^C4031C5083E777C98FFA92FFCF04342FF4F4D55DBADF9C7952^pimgpsh_thumbnail_win_distr

An imbalanced front-end can cripple the performance of your IBM SVC system. An imbalanced front-end is another way of saying that too much workload is handled by too few ports. This leads to buffer credit shortages, increases in latency, and low throughput. It is very easy to create imbalances within an IBM SVC system’s front-end, and it can be fairly difficult to see it happening without the proper tools. Continue reading

Modeling – Is it for You?

Lee

By Lee LaFrese

 

In social situations, people sometimes bring up what they do for a living. When I say, “I am a Storage Performance consultant,” I usually get blank stares. When I am asked for more details, I usually reply “I do a lot of modeling.” This often elicits snickers which is entirely understandable. Anyone that has met me knows that I don’t have the physique of a model! When I add that it is MATHEMATICAL modeling that I am talking about it usually clears up the confusion. In fact, folks are typically impressed, and I have to convince them that what I do is not rocket science. Of course, a lot of rocket science is not “rocket science” either, if you use the term as a euphemism for something very complex and challenging to understand. In this article, I will try to help you understand how computer system performance modeling is done, specifically for disk storage systems. Hopefully, you will have a better appreciation of performance modeling after reading this and know where it can be used and what its limitations are. Continue reading

What HDS VSP and HP XP P9500 Should Be Reporting in RMF/SMF – But Aren’t

Gilbert

By Gilbert Houtekamer, Ph.D.

This the last blog post in a series of four, where we share our experience with the instrumentation that is available for the IBM DS8000, EMC VMAX  and HDS VSP or HP XP P9500 storage arrays through RMF and SMF.   This post is about the Hitachi high-end storage array that is sold by HDS as the VSP and by HP as the XP P9500.

RMF has been developed over the years by IBM, based on IBM storage announcements. Even for the IBM DS8000, not nearly all functions are covered; see “What IBM DS8000 Should Be Reporting in RMF/SMF – But Isn’t” blog post.  For the other vendors it is harder still –  they will have to make do with what IBM provides in RMF, or create their own SMF records.

Hitachi has supported the RMF 74.5 cache counters for a long time, and those counters are fully applicable to the Hitachi arrays.  For other RMF record types though, it is not always a perfect match.  The Hitachi back-end uses RAID groups that are very similar to IBM’s.  This allowed Hitachi to use the RMF 74.5 RAID Rank and 74.8 Link records that were designed for IBM ESS. But for Hitachi arrays with concatenated RAID groups not all information was properly captured.   To interpret data from those arrays, additional external information from configuration files was needed.

With their new Hitachi Dynamic Provisioning (HDP) architecture, the foundation for both Thin Provisioning and automated tiering, Hitachi updated their RMF 74.5 and 74.8 support such that each HDP pool is reflected in the RMF records as if it were an IBM Extent Pool.   This allows you to track the back-end activity on each of the physical drive tiers, just like for IBM.

This does not provide information about the dynamic tiering process itself, however.    Just like for the other vendors, there is no information per logical volume on what portion of its data is stored on each drive tier. Nor are there any metrics available about the migration activity between the tiers.

Overall, we would like to see the following information in the RMF/SMF recording:

  • Configuration data about replication.   Right now, you need to issue console or Business Continuity Manager commands to determine replication status.  Since proper and complete replication is essential for any DR usage, the replication status should be recorded every RMF interval instead.
  • Performance information on Universal Replicator, Hitachi’s implementation of asynchronous mirroring.  Important metrics include the delay time for the asynchronous replication, the amount of write data yet to be copied, and the activity on the journal disks.
  • ShadowImage, FlashCopy and Business Copy activity metrics. These functions provide logical copies that can involve significant back-end activity which is currently not recorded separately.  This activity can easily cause hard-to-identify performance issues, hence it should be reflected in the measurement data.
  • HDP Tiering Policy definitions, tier usage and background migration activity.  From z/OS, you would want visibility into the migration activity, and you’d want to know the policies for a Pool and the actual drive tiers that each volume is using.

Unless IBM is going to provide an RMF framework for these functions, the best approach for Hitachi is to create custom SMF records from the mainframe component that Hitachi already uses to control the mainframe-specific functionality.

It is good to see that Hitachi works to fit their data in the framework defined by RMF for the IBM DS8000.  Yet we would like to see more information from the HDS VSP and HP XP P9500 reflected in the RMF or SMF records.

So when considering your next HDS VSP or HP XP P9500 purchase, also discuss the need to manage it with the tools that you use on the mainframe for this purpose: RMF and SMF.  If your commitment to the vendor is significant, they may be responsive.

Reality Check: Living in a Virtual Storage World

Lee

By Lee LaFrese

The last decade has seen a storage virtualization revolution just as profound as what has happened in the server world.  In both cases virtualization enables logical view and control of physical infrastructure with the goal of optimization, simplification and greater utilization of physical resources.  These are lofty goals, but there is a fundamental difference between server and storage virtualization.  When a virtual server needs compute resources for an application, there is little limitation on which specific resources may be used other than maximum usage caps.  With storage resources, once data is placed on a particular storage media, an application is tied to the portion of the infrastructure that contains that media and usage caps don’t apply.  Thus, I believe performance management of virtualized storage is intrinsically more difficult and important than managing performance of virtualized servers.  In fact, George Trujillo states in a recent entry in his A DBA’s Journey into the Cloud blog that “statistics show over and over that 80% of issues in a virtual infrastructure are due to the design, configuration and management of storage”.

Continue reading

What EMC VMAX Should Be Reporting in RMF/SMF – But Isn’t

Gilbert

By Gilbert Houtekamer, Ph.D.

This the third in a series of four blogs on the status of RMF as a storage performance monitoring tool. This one is specifically about EMC VMAX. The previous postings are What RMF Should Be Telling You About Your Storage – But Isn’t” and “What IBM DS8000 Should Be Reporting in RMF/SMF – But Isn’t.

RMF has been developed over the years by IBM based on its storage announcements – although even for IBM DS8000 not nearly all functions are covered, see this blog post. Other vendors will have to work with what IBM provides in RMF, or, like EMC does for some functionality, create their own SMF records.

EMC has supported IBM’s RMF 74.5 cache counters since they were introduced, and they’ve started using the ESS 74.8 records in the past several years to report on FICON host ports and Fibre replication ports.  However, with respect to back-end reporting, it hasn’t been that simple. Since the EMC Symmetrix RAID architecture is fundamentally different from IBM’s, the EMC RAID group statistics cannot be reported on through RMF.

For EMC’s asynchronous replication method SRDF/A, SMF records were defined that, among other things, track cycle time and size.  This is very valuable information for monitoring SRDF/A session load and health.  Since Enginuity version 5876, SRDF/A Write Pacing statistics are written to SMF records as well, allowing users to track potential application impact.   The 5876 release also provided very detailed SMF records for the TimeFinder/Clone Mainframe Snap Facility.

Still, there are areas where information remains lacking, in particular on back-end drive performance and utilization.  Before thin provisioning was introduced, each z/OS volume would be defined from a set of Hyper Volumes on a limited number of physical disks.  EMC provided great flexibility with this mapping:  you could pick any set of Hyper Volumes that you like.  While conceptually nice, this made it very hard to correlate workload and performance data for logical z/OS volumes to the workload on the physical disk drives.  And, since the data on a z/OS volume was spread over a relative small number of back-end drives, performance issues were quite common.  Many customers needed to ask EMC to conduct a study if they suspected such back-end issues – and they still do.

With the new thin provisioning and FAST auto-tiering options, the relationship between logical and physical disks has been defined through even more intermediate steps.  While EMC’s FAST implementation using the policy mechanism is very powerful, it may be hard to manage for z/OS users, since no instrumentation is provided on the mainframe.  On the positive side, since data tends to be spread over more disk drives because of the use of virtual pools rather than individual RAID groups, back-end performance issues are less likely than before. Still, more information on back-end activity is needed both to diagnose emerging problems and to make sure no hidden bottlenecks occur.

Information that should make it into RMF or SMF to uncover the hidden internals of the VMAX:

  • Configuration data about SRDF replication.   Right now, users need to issue SRDF commands to determine replication status.  Yet proper and complete replication is essential for any DR usage, so the replication status should be recorded every RMF interval.
  • Data that describes the logical to physical mapping, and physical disk drive utilizations. There is external configuration data available through proprietary EMC tools that can sometimes be used in combination with RMF to compute physical drive activity. This is no substitute for native reporting in RMF or SMF.
  • Snapshot-related backend activity. Snapshots provide immediate logical copies which can generate significant back-end activity that is currently not recorded.  Snapshots are a frequent player in hard-to-identify performance issues.
  • FAST-VP policy definitions, tier usage and background activity. FAST-VP will supposedly always give you great performance, but it cannot do magic: you still need enough spindles and/or Flash drives to handle your workload.  For automatic tiering to work well, history needs to repeat itself, as Lee LaFrese said in his recent blog post, “What Enterprise Storage system vendors won’t tell you about SSDs.”  From z/OS, you want visibility into the migration activity, along with the policies for each Pool and the actual tiers that each volume is using.

It will probably be easier for EMC to create more custom SMF records, like they did for SRDF/A, than it would be to try to get their data into RMF.  Such SMF records would be fully under control by EMC and can be designed to match the VMAX architecture, making it much easier to keep them up-to-date.

EMC does seem to respond to customer pressure to create reporting in SMF for important areas.  An example of this is the creation of SRDF/A records and the recent write pacing monitor enhancement.

When considering your next EMC VMAX purchase, also consider discussing the ability to manage it with the tools that you use on the mainframe for this purpose: RMF and SMF.   If your company’s order is big enough, EMC might consider adding even more mainframe-specific instrumentation.

What IBM DS8000 Should Be Reporting in RMF/SMF – But Aren’t

Gilbert

By Gilbert Houtekamer, Ph.D.

This the second in a series of four blogs by Dr. Houtekamer on the status of RMF as a storage performance monitoring tool. This installment is specifically based on experience using available instrumentation for IBM DS8000. What RMF Should Be Telling You About Your Storage – But Isn’t is the first in the series of blogs.

While more advanced capabilities are added with every new generation of the DS8000, these also introduce extra levels of ‘virtualization’ between the host and disk drives. One good example is FlashCopy, which delivers an instantly usable second copy, but also causes hard to predict backend copy workload. Others are Global Mirror or EasyTier, which both lag behind when it comes to the measurement instrumentation.

Although users value added functionality, it is wise to be mindful of the inevitable performance impact caused by increased complexity. Ironically, despite the rush to add more functionality, we have not seen a major update to RMF since ESS was first introduced and the 74.8 records were added to provide back-end RAID groups and ports statistics. Continue reading

What RMF Should Be Telling You About Your Storage – But Isn’t

Gilbert

By Gilbert Houtekamer, Ph.D.

With every new generation of storage systems, more advanced capabilities are provided, that invariably greatly simplify your life – at least according to the announcements.  In practice however, these valuable new functions typically also introduce new a new level of complexity. This tends to make performance less predictable and harder to manage.

Looking back at history, it is important to note that RMF reporting was designed in the early days of CKD (think 3330, 3350) and mainly showed the host perspective (74.1).  With the introduction of cached controllers, cache hit statistics came along that eventually made it into RMF (74.5).  When the IBM ESS was introduced, additional RMF reporting was defined to provide some visibility into the back-end RAID groups and ports (74.8), which is now used by both HDS and IBM.

Continue reading