Less is More – Why 32 HyperPAVs are Better than 128

Gilbert

By Gilbert Houtekamer, Ph.D.

When HyperPAV was announced, the extinction of IOSQ was expected to follow shortly.  And indeed for most customers IOSQ time is more or less an endangered species.  Yet in some cases a bit of IOSQ remains, and even queuing on HyperPAV aliases may be observed.  The reflexive reaction from a good performance analyst is to interpret the queuing as a shortage that can be addressed by adding more Hypers.  But is this really a good idea? Adding  aliases will only increase overhead and  will decrease WLM’s ability to handle important I/Os with priority.  Let me explain why.

HyperPAV, like many I/O related things in z/OS, works on an LCU basis.  LCUs are a management concept in z/OS: each LCU can support up 8 channels for data transfer, and up to 256 device addresses.   With HyperPAV, some of the 256 addresses are used for regular volumes (“base addresses”), and some are available as “aliases”.  You do not need to use all 256 addresses; it is perfectly valid to have no more than 64 base addresses and 32 aliases in an LCU.

Unlike SCSI devices that use tag command queuing, z/OS I/O devices were historically designed to  only handle one I/O operation per device address at a time.  HyperPAV circumvents this issue by assigning  alias addresses “on demand” when additional I/O operations arrive for a particular device address.  This way, a single logical volume (“volser”) can handle many I/O operations at any point in time, up to the number of available aliases. Thus it seems logical that any signs of IOSQ could be resolved by simply adding aliases.  However this ignores other constraints.

The total number of I/O operations that can be handled concurrently by an LCU is constrained by the number of devices and the number of FICON channels.   Devices are logical resources for which the work is spread over very many physical back-end drives, but FICON channels are real resources, using real host interface boards in the storage systems.  Both the channels and the interface boards have limited processing and data transfer capability.

When 32 aliases and (say) 16 base addresses are used, this means that 48 concurrent I/O operations are supported for a particular LCU, or 6 operations per FICON channel (48 concurrent I/Os divided by 8 FICON channels).  As typically most operations are handled at cache speed (read hits, sequential reads, writes) there is really no point in starting even more operations at the same time by increasing the number of aliases.  You will only create internal queuing inside the storage system, resulting in less efficient operation and higher pending, connect and disconnect times.  You may reduce the IOSQ time, but you pay for it elsewhere.

Another commonly used argument is that more aliases (than 32, or any number) are needed, because RMF reports that sometimes all aliases are used, regardless of whether there is visible IOSQ time.  In many cases that we have investigated, these ‘all hypers used’ conditions are caused by DB2 buffer flushes, where DB2 flushes very many buffers at the expiration of an interval.    DB2 queues up so many I/O requests that all aliases are exhausted, connect time will explode, and it will take the storage system quite a while to recover.  ‘A while’ may be only 0.1 second, but during that 0.1 second no other work can be started since all aliases are claimed by this essentially asynchronous DB2 work.  When 32 aliases are defined, DB2 will not be able to claim more than 32, and any excess I/Os are queued.  That may seem bad, but it means that when more important work comes in, the Workload Manager can still give these I/Os priority!  And the DB2 I/Os buffer flush I/Os are asynchronous to begin with, so they should not be allowed to monopolize a storage system.

Having fewer aliases means that your storage system may be driven into saturation by short spikes but z/OS WLM will be able to give priority to new I/Os, and thus it is harder in general for a single application to dominate an LCU.

Our recommendation is that the number of aliases be based on your assessment how many I/Os you will be able to process concurrently on a channel set.  When making this estimate, please consider that multiple LCUs tend to share the same sets of ports, further reducing the maximum throughput that is possible.   It is very unlikely that this analysis will give a higher number than 32.

There is one exception to this rule.   When connect time is only a small portion of the total service time it may make sense to increase the aliases.  This may happen for example because of high disconnect for read-miss, or because of I/O to remote sites where pending time is elevated due to distance.  In these rare instances it does make sense to try and get more concurrent I/Os started on the FICON links by using more aliases.

8 thoughts on “Less is More – Why 32 HyperPAVs are Better than 128”

  1. Fabio says:

    Hi Gilbert

    I was reading your article again. I am curious when you said ZOS devices do not use Tag Command Queuing.

    Nowadays both mainframe I/O and Open I/Os take advantage of the feature I suppose.

    I am interesting in knowing if one operation system like Windows or Linux could also start multiple I/Os to one single LUN or filesystem? How is the serialization in that case? Cluster?

    Tks

    For me only ZOS systems would have the ability

  2. Hi Gibert,

    Congratulations on your article.

    You said: “Our recommendation is that the number of aliases be based on your assessment how many I/Os you will be able to process concurrently on a channel set. When making this estimate, please consider that multiple LCUs tend to share the same sets of ports, further reducing the maximum throughput that is possible. It is very unlikely that this analysis will give a higher number than 32.”

    Do you think this is still true when we are talking about EAV 2?

    We are starting to configure some LCU with only EAV 2. So imagine: one specific LCU with 16 EAV 2 (16 TB) and 32 Hyperpav. In this case we have a ratio of 2 hyperpav to each EAV. It might be too few the bear all I/O to those huge volume.

    Regards,

    Fabio

    1. Gilbert says:

      Hi Fabio,

      Thank you. Yes, the reasoning remains the same. The number of I/Os that you can handle at the same is going to be limited by the number of channels paths. With EAV2 it may mean that you can combine fewer LCUs on the same number of channels.

      If you feel that in your current setup you really do need to 32 HyperPAVs, you could use 64 if you double the capacity of each volume. That would be very safe. Keep in mind though that will double the workload on the set of channels.

      Gilbert

  3. Gilbert says:

    Note that there are other situations where HyperPAV aliases do not help. One example is when multiple I/Os try to update the same area (extent) on a volume. If this is the case, only one I/O at a time will get access and the other I/Os will simply have to wait. Another situation where HyperPAV does not benefit is when there is a hardware reserve: then additional I/Os cannot be started, not even on an alias.

    1. Anne Adams says:

      What would you consider a “small” portion of the total service time? We’re seeing connect times of about 8%.

      1. Gilbert says:

        Anne, When the connect time is just 8% of the total service time, that is small indeed. What are the values that you see for IOSQ/Pending/Disconnect/Connect?

        1. Anne Adams says:

          Pending is fairly consistent at 0.11 ms, Disconnect is ranging between 1 and 8 ms, and IOSQ is about .6 ms.

          1. Gilbert says:

            Anne, given the IOSQ time I take it you are not yet using HyperPAV?
            When these numbers are from a Dynamic PAV environment, you will do just fine with 32 aliases.

Leave a Reply

Your email address will not be published. Required fields are marked *