Credit Card Transaction Timeouts – IOSQ Analysis

By Joe HydeCredit Card Transactions - IOSQ Analysis

Black Friday is one of the busiest transaction days of the year, and it often seems like an easy payday for most participating companies. But have you ever wondered what performance preparations must be made to accommodate the overly inflated volume of credit card transactions?

A large global bank was struggling because their latest version of a credit card swipe application was failing at high volume load testing. In preparations for Black Friday they needed the application to handle a much higher number of credit card swipes, but periodically their credit card transactions were timing out.

When we became involved they had spent weeks on the issue, thousands of man-hours and had incurred significant financial penalties because of the delays. They had spent the past two weeks on day-long conference calls with over 100 people on the phone (often forcing some off the line so others could join) all pointing fingers at one another. The performance team, application team, storage team, and the vendor all blamed one another for the timeouts.

You see, the delays had a significant revenue impact to their business as any credit card approval that timed out had to be sent over a competitor’s exchange, incurring significant fees. After the two weeks of conference calls proved to be unsuccessful in determining the root cause of the problem, they called us in. We took a deep dive into some of the key storage metrics and were able to provide the key insight in determining root cause of the timeouts in a few days of research and additional data acquisition.

Parallel Access Volume Evolution

The cause of the delays was related to queuing that happened in the mainframe I/O subsystem as evidenced by increases in IOSQ time. The I/O architecture has evolved over the past twenty years to minimize this wait time. I/O concurrency at the volume level has been enhanced through a number of features using parallel access volumes or PAVs for short.

The evolution of PAVs included the following steps:

  • It began with static PAVs: a predefined number of aliases.
  • This was quickly followed by dynamic PAVs, that moved from one volume to another based on load.
  • Next came HyperPAVs: PAVs shared among volumes within a logical control unit “as needed”.
  • And now we have SuperPAVs, where PAVs are shared among volumes across sets of logical control units (called alias management groups).

More Aliases are Not Always the Answer

Even with the architectural improvements to reduce IOSQ time, delays can still occur. Below is an IntelliMagic Vision stacked bar chart of I/O response time components for a DB2 storage group.

IO Response Time - DB2 Storage Group

The chart covers a period of four days. The majority of I/O response time is IOSQ time. Although SuperPAV could not be deployed, the configuration was changed to increase the HyperPAV aliases from 32 to 96 to alleviate the high I/O response time. Unfortunately, this increase had almost no effect on the I/O delay.

The maximum I/O rate during this time was roughly 25 thousand I/Os per second and the total throughput was well under 1 gigabyte per second. The disk storage system was not under duress, on average, during this time but IOSQ delay and poor performance persisted.

Using GTF Trace for Finer Granularity

At that point a more granular approach to the data analysis was needed. Instead of looking at averages over the RMF interval of 15 minutes, a GTF I/O Summary Trace was collected so we could examine the behavior of the I/Os at event level granularity. (For a more in-depth discussion on GTF analysis see my previous blog, IBM z/OS’s Microscope – GTF)

The GTF trace analysis showed that the I/O arrival pattern to the DB2 volumes was quite “bursty”. Although reads I/Os had a fairly steady arrival pattern throughout the trace period, write I/Os arrived in “batches” with long periods with little write activity. The bursty write behavior swamped the physical (and logical) resources of the disk storage system.

After consulting with the vendor (IBM DB2 software) they suggested a change to the Castout behavior of the DB2 group buffer pool to spread the write activity out over time. Below is the before and after effect of this change.



The blue represents the read I/Os executed every 0.1 second and red shows the writes executed every 0.1 second.

Credit Card Transaction Timeout Resolved

After the DB2 group buffer change the IOSQ time was drastically reduced, resulting in almost no credit card transaction timeouts. This saved the business millions of dollars, and the hundred-people jumping on conference calls could finally go back to work.

IntelliMagic Vision was used to identify the increase in IOSQ that correlated to the credit card transaction timeouts and then we used GTF I/O Summary Traces to help identify the root cause of the queuing and subsequent timeouts.

Feel free to contact us with any questions or to get a more comprehensive view of the interaction of your applications and storage.


2018 RMF/SMF Analytics - Status & Predictions

5 thoughts on “Credit Card Transaction Timeouts – IOSQ Analysis”

  1. Jack Hyde says:

    I was involved in a similar situation for a popular Midwest-based credit card client. I used RMFMONIII to find the IOSQ delay but the evidence was not as accurate as what GTF can provide. Nice charts. Thanks.

    1. Joe Hyde says:

      Thanks, Jack. RMFMONIII is used for delay reporting but for this issue if GTF trace analysis is not an option I’d use RMFMONII to track a subset of DB2 volumes with 1 second intervals in delta mode and cut these records to SMF for later reporting and analysis. The second by second reporting would give a good idea of the batched arrival of IOs DB2 was generating.

  2. There is a fairly new function called DB2 castout acceleration – a combination of Media Manager code and IBM DS8880 microcode. I wonder if this function was active at the time the problems occurred – and whether it would have provided a solution?

    1. Joe Hyde says:

      Leif, DB2 Castout Accelerator was not available. I don’t have much room for a full reply so I’ll consider writing another blog on the DB2 Castout Accelerator. I don’t believe having DB2 Castout Accelertor active for this workload would have made much difference. The number of distinct domains per zHPF IO for the castouts was low so the benefit would be modest, at best.

  3. George Dodson says:

    Joe, I can’t count all of the times in my career I’ve run into situations where decisions were made without adequate information to choose between alternatives of setting up IT infrastructure or program operations poorly. Your situation here points out that many individuals tried tomfis the “problem”, without having any true idea what the root cause of the problem. Thanks for describing this situation. No matter how large IT enterprises become, having the proper level of information is absolutely critical. Having the wrong level of data can easily hide the true problem.

Leave a Reply