Four Steps You Should Take to Identify, Resolve and Prevent IBM SVC Front-End Imbalance

Brett AllisonBy Brett Allisonimbalance

Did you know you could be at risk of a performance meltdown while you still have plenty of front-end bandwidth?

An imbalanced front-end can cripple the performance of your IBM SVC system. An imbalanced front-end is another way of saying that too much workload is handled by too few ports. This leads to buffer credit shortages, increases in latency, and low throughput. It is very easy to create imbalances within an IBM SVC system’s front-end, and it can be fairly difficult to see it happening without the proper tools. To be fair, this also happens on other vendor’s hardware but that is a topic for another day.

The IBM SVC virtualizes heterogeneous back-end block storage. It scales from one to four I/O groups, each supporting up to 2048 volumes. Each node contains between four and eight ports. Best practice dictates that each host shall utilize two Host Bus Adapters (HBAs) for failover. Each of the host’s HBAs should be zoned to each node within an I/O group, across at least two paths per node, on alternating fabrics as demonstrated in Figure 1. This provides redundancy on fabric links, nodes, and node ports.

Figure 1:

Within the I/O group, each host’s volumes are assigned a preferred node from a load balancing perspective in a round-robin fashion. Not all host multi-pathing software honors the preferred node. This is discussed in more detail below. The purpose of this “spreading of the volumes” is to provide a rough mechanism for balancing the load across the nodes and ports within an I/O group, a node, and the node’s ports.

While this is not groundbreaking news, I continue to see very intelligent people running production environments with significant imbalances on the front-end and limited visibility into the state of the imbalance, the reasons why it got there, the best corrective actions, and the processes for avoiding it in the future.

Follow the steps below and see how IntelliMagic Vision can identify the issues and help you avoid them in the future.

Step 1: Identify Front-end Imbalance:

With IntelliMagic Vision we use balance charts to get visibility into I/O Group, Node, and Port imbalances as shown in Figure 2:

Figure 2 shows an imbalance between the I/O groups. IO_GRP1 carries the majority of the workload. Once you have identified such an imbalance at the I/O group level you can drill down to the Nodes and see if there is an imbalance at the node level as shown in Figure 3:

SVC01_N4 has a slightly higher workload than SVC01_N3 but this slight imbalance may still be acceptable. Still, it is wise to drill down even further to see if the port workloads are properly balanced, as shown in Figure 4:

Port SVC01_N4-2 has an average of 276 MB/sec while SVC01_N4-3 has an average of only 143 MB/sec. Clearly there is an imbalance in the way the workload is distributed. Why is such an imbalance problematic? This is because it limits the amount of work that the entire cluster can do to the most constrained component, in this case port SVC01_N4-2.

Step 2: Identify how the imbalance occurred.

An imbalance can happen for several reasons:

  • A host is honoring the preferred node setting but the host’s workload is imbalanced across its volumes. This is typical in a database environment where you have some LUNs dedicated to some highly active data tables, others dedicated to temp space, and yet others for logs. This will result in a workload that is not evenly balanced across the LUNs; weakening the effectiveness of the round-robin algorithm.
  • Improper zoning can result in all requests from one host coming through a single node within an I/O group.
  • Improperly configured host multi-pathing software can result in a situation where all access requests occur on a single path.

Step 3: Identify the best corrective action.

The best action will follow logically from the root cause identified in Step 2:

  • If this is an imbalance due to workload differences, modify the preferred node setting for individual volumes to redistribute the workload across the nodes more evenly. This is a non-disruptive change.
  • If the imbalance is due to improper zoning, fix the zoning to spread the I/O workload across both nodes within an I/O group such that the overall load is balanced.
  • If the imbalance is due to improperly configured multi-pathing software, optimize the multi-pathing software to honor the preferred node setting which gives the admin the control to do the balancing. In the case of VMWare use round-robin which distributes the I/Os across the available paths. A round-robin access scheme may be less desirable than using a preferred path to each volume, but it surely is much better than the serious performance issues caused by an imbalanced front-end.

Step 4: Be pro-active and avoid future imbalances:

  • Review balance charts on at least a weekly basis in order to have a firm grasp on your current workload balance.
  • When adding new hosts to an environment follow the same process each time:
    1. Review performance loads across the I/O groups
    2. Leverage to the least utilized I/O group
    3. Assess new hosts as part of recurring baseline review in order to ensure continued balance in the environment.

In addition to the steps above, recent enhancements in both hardware and software allow for the SVC ports to be dedicated to specific uses such as:

  • Host to SVC traffic
  • SVC to back-end storage traffic
  • Replication traffic

This provides you with further options when distributing workloads across the available resources. In general, we advise to dedicate replication traffic to specific ports. I discuss how to choose the best replication technology in this blog.

In conclusion: while imbalances are common, balancing the front-end is important and not excessively difficult.

6 thoughts on “Four Steps You Should Take to Identify, Resolve and Prevent IBM SVC Front-End Imbalance”

  1. Maik says:

    Brett, i also agree with you but only in a non-stretched SVC environment. If u use a stretched configuration with a metro distance between both nodes in an IO group it depends on how you set up your vdisks an hosts across sites. If you have more hosts in one site and use not mirrored vdisk it could be better to have an unbalanced io group because of the additional latency you’ll get for going over the distance. The rules for ports on one node are the same like in your example. On mirrored vdisk it is also good to have a look where the host is located to avoid using the distance twice. I opened a RFE for a host site awareness in SVC so that SVC can self optimize it but i have no answear from IBM yet. (sorry for my bad english)

    1. Maik-

      Thanks for your insight. I did not consider stretched cluster configurations in my blog. I like your idea of having some information on the host location in the configuration data. While imperfect, one solution would be to use a naming convention that identifies the location of the host. How do you manage the host location identification currently?

      1. Maik says:

        Hi Brett, we use a location tag in host names as a general rule. All our host names are codes with OS, location, type (db, file, app, bkp, …) and so on.

        1. Sounds like a good work-around. SVC would have to be pretty smart to know the location of the server physically. What were your thoughts on how SVC should identify the location automatically?

  2. Kamesh says:

    Brett, I totally agree with you and have seen across many SVC environments, however, you need an intelligent Storage Architect/SME to lay out the baseline in order to avoid the front-end port imbalance, said that I totally agree with you and recommend you to consider 8 ports per node (DH8) instead 4 ports as they are old and you may explain about isolating ports for replication.

    1. Kamesh-

      Thanks for responding. I hope all is well with you. Even with the most intelligent SME the front-end can become imbalanced and that is why it is so important to continue to review the balance on a regular basis as part of the ongoing management of the environment.

      As you know the DH8 offers not only 8 but 12 ports and when you have replication I agree that you should specify purpose built ports (4 for Front-end, 4 for Back-end, 4 for Replication) when you can. This will significantly reduce SVC port contention.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.