z/OS Performance Monitors – Why Real-Time is Too Late

By Morgan Oatsperformance monitor

Real-time z/OS performance monitors are often advertised as the top tier of performance management. Real-time monitoring means just that: system and storage administrators can view performance data and/or alerts indicating service disruptions continuously as they happen. In theory, this enables administrators to quickly fix the problem. For some companies, service disruptions may not be too serious if they are resolved quickly enough. Even though those disruptions could be costing them a lot more than they think, they believe a real-time monitor is the best they can do to meet their business needs.

For leading companies, optimal z/OS performance is essential for day-to-day operations: banks with billions of transactions per day, global retailers, especially on Black Friday or Cyber Monday, government agencies and insurance companies that need to support millions of customers at any given time, transportation companies with 24/7 online delivery tracking; the list goes on and on. For these organizations and many others, real-time performance information is in fact, too late. They need information that enables them to prevent disruptions – not simply tell them when something is already broken.

Continue reading

Which Workloads Should I Migrate to the Cloud?

Brett

By Brett AllisonCloud Storage

By now, we have just about all heard it from our bosses, “Alright folks we need to evaluate our workloads and determine which ones are a good fit for the cloud.” After feeling a tightening in your chest, you remember to breathe and ask yourself, “How the heck do I accomplish this task as I know very little about the cloud and to be honest it seems crazy to move data to the cloud!”

According to this TechTarget article, “A public cloud is one based on the standard cloud computing model, in which a service provider makes resources, such as applications and storage, available to the general public over the internet. Public cloud services may be free or offered on a pay-per-usage model.” Most organizations have private clouds, and some have moved workloads into public clouds. For the purpose of this conversation, I will focus on the public cloud. Continue reading

Is Your Car or Mainframe Better at Warning You?

jerrystreetBy Jerry Street

 

Imagine driving your car when, without warning, all of the dashboard lights came on at the same time. Yellow lights, red lights. Some blinking, while others even have audible alarms. You would be unable to identify the problem because you’d have too many warnings, too much input, too much display. You’d probably panic!

That’s not likely, but if your car’s warning systems did operate that way, would it make any sense to you? Conversely, if your car didn’t have any dashboard at all, how would you determine if your car was about to have a serious problem like very low oil pressure or antifreeze/coolant? Could you even operate it safely without an effective dashboard? Even the least expensive cars include sophisticated monitoring and easy interpretation of metrics into good and bad indicators on the dashboard.

You have a need for a similar dashboard of your z/OS mainframe to alarm you. When any part of the infrastructure starts to be at risk of not performing well, you need to know it, and sooner is better. By being warned of a risk in an infrastructure component’s ability to handle your peak workload, you can avoid the problem before it impacts production users or fix what is minor before the impact becomes major. The only problem is that the dashboards and reporting you’re using today for your z/OS infrastructure, and most monitoring tools, do not provide this type of early warning.

Continue reading

All I Want for Christmas is…Time

By Jerry Streetgift

With the holiday season upon us, I occasionally think of what might be waiting for me to unwrap. Will it be another gift card? I hope not. Gift cards are someone’s way of saying, “I appreciate you so much that you should get your own present.” There are many things that I would enjoy getting as a present, but the one thing that would actually make my life better would be a couple of extra hours in my day. I need more time! Unfortunately, I can’t get the earth to slow down and make a full revolution in 26 hours instead of 24. So I need tools to save me time within the 24 hours that I’m scripted to have.

As IT Performance professionals, we are continually asked to do more.  Systems grow more complex, analyses need to be delivered faster, and dollars have to be spent more wisely than ever. When professional life demands require more time, you can either give up your personal time or let the quality of your work suffer. I don’t want to do either of those things so I would choose to do my job both faster and better. A tool that helps me accomplish both goals is IntelliMagic Vision. Continue reading

z/OS Petabyte Capacity Enablement

Brett

By Dave Heggen

We work with many large z/OS customers and have seen only one requiring more than a petabyte (PB) of primary disk storage in a single sysplex. Additional z/OS environments may exist, but we’ve not yet seen them (if you are that site, we’d love to hear from you!). The larger environments are 400-750 TB per sysplex and growing, so it’s likely those will reach a Petabyte requirement soon.iStock_000027232723Small

IBM has already stated that the 64K device limitation will not be lifted. Customers requiring more than 64K devices have gotten relief by migrating to larger devices (3390-54 and/or Extended Address Volumes) and by exploiting of Multiple SubSystems (MSS) for use by PAV Aliases and Metro Mirror (PPRC) Secondary and FlashCopy Target devices.

The purpose of this blog is to discuss the strategies of how to position existing and future technologies to allow for this required growth. Continue reading

Does everybody know what time it is? Tool Time!

Lee

By Lee LaFreseiStock_000016017745Small

Home Improvement was a popular TV show from the 90’s that lives on forever in re-runs. Usually, one of the funniest segments was the show within a show, “Tool Time”. During Tool Time, Tim Taylor (played by Tim Allen) would demonstrate how to use various power tools, often with disastrous and hilarious results. If things were not working right he would often exclaim “more power” as if that would make everything right. Unfortunately when you are not using the right tool in the right way, more power usually does more harm than good. The expression “when all you have is a hammer, everything looks like a nail” comes to mind.

Some of our prospective customers in the z/OS space use IBM DS8000. They often ask why they would want IntelliMagic Vision if they have IBM Tivoli Storage Productivity Center for Disk (TPC). The answer is very simple – for z/OS environments IntelliMagic Vision is clearly the right tool to use. There is no reason to pull a Tim Taylor and force fit something else for the job. Here are some of the reasons why IntelliMagic Vision is the best choice for this environment.
Continue reading

Obscured by Clouds

Lee

By Lee LaFreseUrban Morning Haze

Unless you have your head in the sand, you probably have been bombarded with messages about how all things are moving to the cloud. Google, Amazon, IBM, EMC, SalesForce and numerous others are in a race for supremacy in this now all important market segment. These and other vendors are trumpeting their cloud solutions that are going to make your life easy so you don’t have to worry about things like availability, reliability and performance. But what exactly is the cloud? And can it be trusted with one of your most important corporate assets – your data?

In my opinion, “the cloud” has become one of the most overused buzzwords in modern information technology. According to Wikipedia, the cloud is simply a “large number of computers connected through a real-time communication network such as the Internet.” This does not sound like anything magical to me. In fact, by this definition we have had our head in the cloud for years!   But the really sexy part of cloud computing is to allow users to benefit from information technologies without the need for deep knowledge or expertise. The cloud aims to help the users focus on their core business instead of being impeded by IT obstacles. Continue reading

IBM TS7700 Replication – Is Your Data Safe? (Part 2 of 2)

BurtLoper

By Burt Loper

 

One of the challenges in IT is getting your data replicated to a remote location for fail-over and data recovery if your main operations center is compromised. It is not sufficient to set up replication, you also have to watch closely whether your replication goals are met at all times.

Part 1 of this blog explored the various TS7700 replication modes. Part 2 explores how IntelliMagic Vision can be used to monitor the health of the TS7700 replication process.

TS7700 Replication Monitoring

The TS7700 keeps track of many performance statistics about its operation. A constant watch of these metrics is needed to make sure that performance and replication goals are being met. IntelliMagic Vision performs fully automated daily interpretation of all relevant performance statistics. It applies built-in intelligence about the hardware and workloads to rate the health of the clusters and flag exceptions in dashboards and charts. The enhanced metrics are put in a database that can also be used for ad-hoc reporting with easy-to-use graphical views. Continue reading

Does your Disaster Recovery Plan meet its objectives? Analyzing TS7700 Tape Replication (Part 1 of 2)

BurtLoper

By Burt Loper

 

This blog is the first in a series of two blogs on the topic of Mainframe Virtual Tape Replication.

One of the challenges in IT is getting your data replicated to another location so that you have a recovery capability if your main operations center is compromised. IBM TS7700 Series Virtualization Engines support the copying of your tape data to other locations.

This article explores the various TS7700 replication modes.

TS7700 Terminology

The IBM TS7700 Virtualization Engine is commonly known as a cluster. When you connect two or more clusters together, that is called a grid or composite library. The information here applies to both the TS7740 model (which uses backend tape drives and cartridges to store tape data) as well as the TS7720 model (which uses a large disk cache to store tape data).

In a multi-cluster grid, the clusters are interconnected with each other via a set of 1 Gb or 10 Gb Ethernet links. The TS7700’s use TCP/IP communication protocols to communicate with each other and copy tape data from one cluster to another.

Continue reading

You Can’t Do Performance Analysis with SMI-S and Other Myths

Brett

By Brett Allison

Some vendors are perpetuating the myth that SMI-S is not designed for performance management. Recently some of our customers asked a vendor to surface additional performance metrics through SMI-S. They received a response along the lines of: “SMI-S is not supposed to handle performance metrics; it is mainly for management. If you want performance metrics, you should buy our proprietary tool.”

While SMI-S has some limitations, the SMI-S Block Server Performance (BSP) defines a very rich set of storage system components for which metrics can be defined: System, Peer Controller, Front-end Adapter, Front-end Port, Back-end Adapter, Back-end Port, Replication Adapter, Volume, and Disks. The BSP further defines counters for reads, writes, read throughput, write throughput, read response time and write response time for each of the components. Coupled with the comprehensive configuration information available within the standard, the SMI-S standard provides a rich canvas for a client consuming SMI-S data to paint the performance profile of a vendor’s hardware.

Continue reading

Five Activities to Avoid If You Want to Be a Hero

Brett

By Brett AllisonSuperhero screaming and opening shirt, blank blue t-shirt undern

Management, please read and reward your true heroes

Often when you do the right things in performance and capacity management, your work goes unnoticed and unappreciated.  In fact, if you are consistently pro-active with your storage performance and capacity management processes, you will have few opportunities to be in the spotlight.  This is because pro-active management reduces the number of crises and, consequently, the need for heroic action!

For those who want to live dangerously, consider avoiding the following activities:

Continue reading

Storage Management Initiative (SMI) and the Block Services Performance (BSP) sub-profile: the Good, the Bad, and the Ugly – Part 2 “The Bad”

Brett

By Brett Allison

In my first installment of this 3 part blog I discussed the reason for SMI-S and the goodness that it has brought to the world.  SMI-S though, like my favorite dark chocolate, is often bitter sweet.  Clint Eastwood, in “The Good, The Bad and The Ugly” is not good unless we have Lee Van Cleef playing the ruthless villain “Angel Eyes” to contrast him with.

While the technology chosen by the SMI program has significant benefits, it also has challenges.  To anyone but a seasoned software engineer the technology is essentially inaccessible.  To be proficient in leveraging SMI-S takes significant time and energy.  So while the initial goal was, at least on the surface, to be inclusive, it was only truly inclusive for those willing to pay the significant price of admission.

Continue reading

nil sub sole novum – Vision 7 Support for IBM APAR OA39993

Brett

By Dave Heggen
IBM APAR OA39993 available for the zEC12 processor introduced what could be considered a new Service Time component for devices, Interrupt Delay Time.  This value measures the time from when the IO completes to when the z/OS issues Test SubChannel (TSCH) to view the results.

Not new, but previously Uncaptured

I selected the title for this Blog as “nil sub sole novum”. It’s a common Latin phrase whose literal translation is ‘Nothing under the Sun is new’, meaning ‘everything has been done before’.  This is very true for Interrupt Delay Time, the component is new but the activity has always been with us. This activity was previously without description.  It came from what you could call ‘Uncaptured IO Time’.  Even my spell checker rebels against the use of the word uncaptured.

Continue reading

What HDS VSP and HP XP P9500 Should Be Reporting in RMF/SMF – But Aren’t

Gilbert

By Gilbert Houtekamer, Ph.D.

This the last blog post in a series of four, where we share our experience with the instrumentation that is available for the IBM DS8000, EMC VMAX  and HDS VSP or HP XP P9500 storage arrays through RMF and SMF.   This post is about the Hitachi high-end storage array that is sold by HDS as the VSP and by HP as the XP P9500.

RMF has been developed over the years by IBM, based on IBM storage announcements. Even for the IBM DS8000, not nearly all functions are covered; see “What IBM DS8000 Should Be Reporting in RMF/SMF – But Isn’t” blog post.  For the other vendors it is harder still –  they will have to make do with what IBM provides in RMF, or create their own SMF records.

Hitachi has supported the RMF 74.5 cache counters for a long time, and those counters are fully applicable to the Hitachi arrays.  For other RMF record types though, it is not always a perfect match.  The Hitachi back-end uses RAID groups that are very similar to IBM’s.  This allowed Hitachi to use the RMF 74.5 RAID Rank and 74.8 Link records that were designed for IBM ESS. But for Hitachi arrays with concatenated RAID groups not all information was properly captured.   To interpret data from those arrays, additional external information from configuration files was needed.

With their new Hitachi Dynamic Provisioning (HDP) architecture, the foundation for both Thin Provisioning and automated tiering, Hitachi updated their RMF 74.5 and 74.8 support such that each HDP pool is reflected in the RMF records as if it were an IBM Extent Pool.   This allows you to track the back-end activity on each of the physical drive tiers, just like for IBM.

This does not provide information about the dynamic tiering process itself, however.    Just like for the other vendors, there is no information per logical volume on what portion of its data is stored on each drive tier. Nor are there any metrics available about the migration activity between the tiers.

Overall, we would like to see the following information in the RMF/SMF recording:

  • Configuration data about replication.   Right now, you need to issue console or Business Continuity Manager commands to determine replication status.  Since proper and complete replication is essential for any DR usage, the replication status should be recorded every RMF interval instead.
  • Performance information on Universal Replicator, Hitachi’s implementation of asynchronous mirroring.  Important metrics include the delay time for the asynchronous replication, the amount of write data yet to be copied, and the activity on the journal disks.
  • ShadowImage, FlashCopy and Business Copy activity metrics. These functions provide logical copies that can involve significant back-end activity which is currently not recorded separately.  This activity can easily cause hard-to-identify performance issues, hence it should be reflected in the measurement data.
  • HDP Tiering Policy definitions, tier usage and background migration activity.  From z/OS, you would want visibility into the migration activity, and you’d want to know the policies for a Pool and the actual drive tiers that each volume is using.

Unless IBM is going to provide an RMF framework for these functions, the best approach for Hitachi is to create custom SMF records from the mainframe component that Hitachi already uses to control the mainframe-specific functionality.

It is good to see that Hitachi works to fit their data in the framework defined by RMF for the IBM DS8000.  Yet we would like to see more information from the HDS VSP and HP XP P9500 reflected in the RMF or SMF records.

So when considering your next HDS VSP or HP XP P9500 purchase, also discuss the need to manage it with the tools that you use on the mainframe for this purpose: RMF and SMF.  If your commitment to the vendor is significant, they may be responsive.

Reality Check: Living in a Virtual Storage World

Lee

By Lee LaFrese

The last decade has seen a storage virtualization revolution just as profound as what has happened in the server world.  In both cases virtualization enables logical view and control of physical infrastructure with the goal of optimization, simplification and greater utilization of physical resources.  These are lofty goals, but there is a fundamental difference between server and storage virtualization.  When a virtual server needs compute resources for an application, there is little limitation on which specific resources may be used other than maximum usage caps.  With storage resources, once data is placed on a particular storage media, an application is tied to the portion of the infrastructure that contains that media and usage caps don’t apply.  Thus, I believe performance management of virtualized storage is intrinsically more difficult and important than managing performance of virtualized servers.  In fact, George Trujillo states in a recent entry in his A DBA’s Journey into the Cloud blog that “statistics show over and over that 80% of issues in a virtual infrastructure are due to the design, configuration and management of storage”.

Continue reading

What EMC VMAX Should Be Reporting in RMF/SMF – But Isn’t

Gilbert

By Gilbert Houtekamer, Ph.D.

This the third in a series of four blogs on the status of RMF as a storage performance monitoring tool. This one is specifically about EMC VMAX. The previous postings are What RMF Should Be Telling You About Your Storage – But Isn’t” and “What IBM DS8000 Should Be Reporting in RMF/SMF – But Isn’t.

RMF has been developed over the years by IBM based on its storage announcements – although even for IBM DS8000 not nearly all functions are covered, see this blog post. Other vendors will have to work with what IBM provides in RMF, or, like EMC does for some functionality, create their own SMF records.

EMC has supported IBM’s RMF 74.5 cache counters since they were introduced, and they’ve started using the ESS 74.8 records in the past several years to report on FICON host ports and Fibre replication ports.  However, with respect to back-end reporting, it hasn’t been that simple. Since the EMC Symmetrix RAID architecture is fundamentally different from IBM’s, the EMC RAID group statistics cannot be reported on through RMF.

For EMC’s asynchronous replication method SRDF/A, SMF records were defined that, among other things, track cycle time and size.  This is very valuable information for monitoring SRDF/A session load and health.  Since Enginuity version 5876, SRDF/A Write Pacing statistics are written to SMF records as well, allowing users to track potential application impact.   The 5876 release also provided very detailed SMF records for the TimeFinder/Clone Mainframe Snap Facility.

Still, there are areas where information remains lacking, in particular on back-end drive performance and utilization.  Before thin provisioning was introduced, each z/OS volume would be defined from a set of Hyper Volumes on a limited number of physical disks.  EMC provided great flexibility with this mapping:  you could pick any set of Hyper Volumes that you like.  While conceptually nice, this made it very hard to correlate workload and performance data for logical z/OS volumes to the workload on the physical disk drives.  And, since the data on a z/OS volume was spread over a relative small number of back-end drives, performance issues were quite common.  Many customers needed to ask EMC to conduct a study if they suspected such back-end issues – and they still do.

With the new thin provisioning and FAST auto-tiering options, the relationship between logical and physical disks has been defined through even more intermediate steps.  While EMC’s FAST implementation using the policy mechanism is very powerful, it may be hard to manage for z/OS users, since no instrumentation is provided on the mainframe.  On the positive side, since data tends to be spread over more disk drives because of the use of virtual pools rather than individual RAID groups, back-end performance issues are less likely than before. Still, more information on back-end activity is needed both to diagnose emerging problems and to make sure no hidden bottlenecks occur.

Information that should make it into RMF or SMF to uncover the hidden internals of the VMAX:

  • Configuration data about SRDF replication.   Right now, users need to issue SRDF commands to determine replication status.  Yet proper and complete replication is essential for any DR usage, so the replication status should be recorded every RMF interval.
  • Data that describes the logical to physical mapping, and physical disk drive utilizations. There is external configuration data available through proprietary EMC tools that can sometimes be used in combination with RMF to compute physical drive activity. This is no substitute for native reporting in RMF or SMF.
  • Snapshot-related backend activity. Snapshots provide immediate logical copies which can generate significant back-end activity that is currently not recorded.  Snapshots are a frequent player in hard-to-identify performance issues.
  • FAST-VP policy definitions, tier usage and background activity. FAST-VP will supposedly always give you great performance, but it cannot do magic: you still need enough spindles and/or Flash drives to handle your workload.  For automatic tiering to work well, history needs to repeat itself, as Lee LaFrese said in his recent blog post, “What Enterprise Storage system vendors won’t tell you about SSDs.”  From z/OS, you want visibility into the migration activity, along with the policies for each Pool and the actual tiers that each volume is using.

It will probably be easier for EMC to create more custom SMF records, like they did for SRDF/A, than it would be to try to get their data into RMF.  Such SMF records would be fully under control by EMC and can be designed to match the VMAX architecture, making it much easier to keep them up-to-date.

EMC does seem to respond to customer pressure to create reporting in SMF for important areas.  An example of this is the creation of SRDF/A records and the recent write pacing monitor enhancement.

When considering your next EMC VMAX purchase, also consider discussing the ability to manage it with the tools that you use on the mainframe for this purpose: RMF and SMF.   If your company’s order is big enough, EMC might consider adding even more mainframe-specific instrumentation.

What IBM DS8000 Should Be Reporting in RMF/SMF – But Aren’t

Gilbert

By Gilbert Houtekamer, Ph.D.

This the second in a series of four blogs by Dr. Houtekamer on the status of RMF as a storage performance monitoring tool. This installment is specifically based on experience using available instrumentation for IBM DS8000. What RMF Should Be Telling You About Your Storage – But Isn’t is the first in the series of blogs.

While more advanced capabilities are added with every new generation of the DS8000, these also introduce extra levels of ‘virtualization’ between the host and disk drives. One good example is FlashCopy, which delivers an instantly usable second copy, but also causes hard to predict backend copy workload. Others are Global Mirror or EasyTier, which both lag behind when it comes to the measurement instrumentation.

Although users value added functionality, it is wise to be mindful of the inevitable performance impact caused by increased complexity. Ironically, despite the rush to add more functionality, we have not seen a major update to RMF since ESS was first introduced and the 74.8 records were added to provide back-end RAID groups and ports statistics. Continue reading

What RMF Should Be Telling You About Your Storage – But Isn’t

Gilbert

By Gilbert Houtekamer, Ph.D.

With every new generation of storage systems, more advanced capabilities are provided, that invariably greatly simplify your life – at least according to the announcements.  In practice however, these valuable new functions typically also introduce new a new level of complexity. This tends to make performance less predictable and harder to manage.

Looking back at history, it is important to note that RMF reporting was designed in the early days of CKD (think 3330, 3350) and mainly showed the host perspective (74.1).  With the introduction of cached controllers, cache hit statistics came along that eventually made it into RMF (74.5).  When the IBM ESS was introduced, additional RMF reporting was defined to provide some visibility into the back-end RAID groups and ports (74.8), which is now used by both HDS and IBM.

Continue reading

Confessions of a Storage Performance Analyst

Stuart

By Stuart Plotkin

The journey to create best practices for enterprise level storage systems

For many years, I have had the opportunity to be involved in addressing enterprise infrastructure performance challenges. When I mention to friends that I am a storage performance expert, they occasionally respond: “Me too, I have this huge hard drive on my laptop and I make sure that the IO blazes.”

It is then that I smile and calmly explain that, on your laptop, you are the only user. In my world, I have hundreds to thousands of people all wanting to use hard drives at the same time. In these cases, the hard drive performance is mission critical and needs to be in top condition 24/7.

It is then that they start to get the picture.

Continue reading