Breaking Through Hard-To-Diagnose SVC Bottlenecks

Brett

By Brett Allison

 

SVC bottleneckImbalances in an SVC environment can occur in different areas. If these imbalances are significant, they can cause bottlenecks, resulting in optimization and performance issues. Using IntelliMagic Vision, it is possible to quickly discover, diagnose and resolve even the most difficult-to-pinpoint SVC problems.

Using embedded storage intelligence from our many years of developing modeling tools, we provide end-users with an easy-to-use interface to quickly identify performance issues. IntelliMagic also provides a cloud services solution where users can send data for expert analysis, in addition to providing coverage for IBM SVC, other IBM storage controllers, EMC and other hardware platforms.

Using IntelliMagic Vision, here are just some of the common imbalances that can be quickly identified and resolved:

  • Front-end ports constrained
  • Managed storage controller front end ports
  • Backend HDDs over-utilized
  • Cache
  • Host multi-pathing
  • Fabric/ISL contention

To follow are several real-world stories where serious bottlenecks occurred. Using IntelliMagic Vision, we helped end users head off trouble and quickly resolve the problem.

Diagnosing high front-end write response times

An end user had been complaining about performance degradation on a number of hosts with storage in an SVC01 environment. The SLAs were in jeopardy and an adverse business impact was likely. When we investigated, using IntelliMagic Vision, we discovered that these hosts were experiencing poor I/O write service times. All write I/Os to SVC01_N1 and SVC01_N2 (see chart) were experiencing poor response time in comparison to the other nodes in SVC01. The CPU utilization on SVC01_N1 was also observed to be consistently around 25%, which was very different than the lower utilization levels and wider variation seen on the other nodes.

Because each SVC node contains a single quad-core processor, and the node utilization was pegged at about 25%, it was safe to assume that one of the four cores was fully consumed on SVC01_N1.

In this case, we simply drilled down on the write response times by node to the volume level to quickly see which volumes were affected. By identifying the volumes we were able to determine additional characteristics, such as the host name, size of volume, storage pools, etc., to better understand the impact of the performance degradation.

We concluded that there were high node write response time on both node 1 and node 2, which were part of the same I/O group. We also identified an anomaly in the node utilization in which one of the cores of the quad core processor on N1 was consumed 100% of the time. Further research identified a related problem in which the resolution was either to restart the SSH process, or restart the node (which would accomplish the SSH process restart), or upgrade the firmware to version 6.2.0.4 or later.

Diagnosing high read and write response times on the front end

In this case, an end user with the current 5.1 SVC firmware had been complaining about performance degradation on a number of hosts with storage in the SVC03 environment. Using IntelliMagic Vision, our objective was to identify the hosts most impacted and determine any related issues. Initially, we looked at the storage pool read response time. At an SVC storage pool level, the average read response times were extremely high, particularly for rnk0401 (see chart), which was spending a significant amount of time above the exception threshold.

In addition to the front-end read response time for rnk0401, we also saw extremely high back-end read queue times for rnk0401. This indicated that the back-end storage controllers servicing the request to rnk0401 were over committed. Drilling down further to the back-end storage controller servicing the request to rnk0401, we looked at IBM-000 and found significant saturation on ports Ink-0020 and Ink-0023, as demonstrated with extremely high front-end read response times of 100 ms or greater.

The primary reason for this high response time was over commitment of resources on the host adapter, HA-0027, which was associated with these two links. In the chart, note the average with the green dot, the standard deviation with the green rectangle and the minimum and maximum values with the yellow rectangle. On the far left chart, you can see HA-0027 having the highest average, standard deviation and minimum/ maximum throughput of any of the other host adapters.

In addition, there were three storage pools serviced by the back-end controller: rnk-0400, rnk-0401 and rnk-0402. With a total of 208 mdisks across these three storage pools, they were assigned to 16 ports on the storage controller. There were also a total of eight host adapters, with two ports on each host adapter. As you recall, lnk-0020 and lnk-0023 were on the same host adapter.

While each of 16 ports had 13 mdisks assigned to it, the problem was that the workloads were quite different between the different storage pools. Subsequently, the workload to the individual mdisks varied greatly, causing the imbalance. In this case, we see that rnk-0401 and rnk-0402, by far the busiest storage pools, had a significant number of mdisks assigned to lnk-0020 and lnk-0023, resulting in the through-put imbalance and high response time.

We learned that zoning improvements in the way the back-end mdisks were assigned to the back-end ports could be greatly improved using SVC firmware 6.3 or later. This also improved the distribution of the workload by utilizing and leveraging all back-end storage ports for all mdisks, helping to reduce the congestion and imbalances experienced in this particular case.

For more in-depth discussion on these cases and how use of IntelliMagic Vision can help diagnose and resolve real-world performance issues, please review our SVC recorded series”.

Leave a Reply

Your email address will not be published. Required fields are marked *