In part 1 of this blog series we talked about how to select your SVC/V7000, replication technology that matches your business requirements, or more likely, your budget.
Now we need to think about how you can monitor and diagnose SVC/V7000 performance issues that may be caused by replication. I run into SVC/V7000 replication issues quite frequently, and have found that not all monitoring and diagnostic tools provide a comprehensive picture of SVC/V7000 replication. Further complicating matters, the nature of the technology you have selected will influence expectations and approach to problem determination.
If you recall, in Part 1 of this blog I discussed several types of copy services available within IBM SVC/V7000:
- Metro Mirror (MM) for synchronous metropolitan distances that ensures writes to primary and secondary disks are committed prior to the acknowledgement of host write completion.
- Global Mirror (GM) uses asynchronous copy services and is better for low bandwidth situations. Stretched Cluster Volume Mirroring: This could also be considered a replication option within SVC/V7000 families. In this case, an SVC/V7000 cluster has nodes located at two locations allowing real-time copies to be spread across two locations.
- Global Mirror with Change Volumes (GMwCV) is essentially a continuous flashcopy that asynchronously updates a remote copy. It completely isolates the primary from WAN issues but takes up significant disk capacity and cache resources locally and also leads to a remote copy that is significantly out of synch with your local copy.
- Stretched Cluster Volume Mirroring: This could also be considered a replication option within SVC/V7000 families. In this case, an SVC/V7000 cluster has nodes located at two locations allowing real-time copies to be spread across two locations.
As with performance analysis on any technology, response times can provide a quick way to understand the health of replication, provided it reflects reasonable expectations.
One challenge is that the response times vary greatly depending on available and required bandwidth, geography, number of hops, and the distance. So without further interpretation, it is hard to see at first glance if a certain response time is normal, or a sign that there is a problem.
The following table illustrates the best case expected network latency times for a single I/O operation across different distances:
There are several ways to track the health of your replication environment:
1) Pro-actively analyze your SVC statistics, including:
- PPRC Send and Received Tracks/sec and PPRC Send/Received Response time (valid for MM and GM, though not available on GM with change volumes). Establish a baseline of what you may expect under normal healthy circumstances. With IntelliMagic Vision, we can set meaningful thresholds on the PPRC Send and PPRC Received response times that are based on your situation. The expected values depend on the distance and available bandwidth. Figure 1 demonstrates a chart that shows and interprets the Response time for Replication Writes.
You can see from the red border on this chart that IntelliMagic Vision detected that the replication response times were much higher than they should be, indicating a problem that needs to be investigated. IntelliMagic Vision allows you to drill down to the storage pools and volumes to see which volumes are actively transmitting data.
- SVC Port Throughput (Send/Received): Monitoring the port traffic dedicated to replication traffic. This data is available on the port level of SVC/V7000. It is also easy to monitor if you have dedicated ports for replication. If these metrics show that the throughput is low, but response time is high, then this is likely an indication that there is a problem with the WAN. Figure 2 demonstrates the total throughput for an SVC. With IntelliMagic Vision, you can drill down to the I/O Groups, Nodes or ports.
- SVC Port: Zero Buffer to Buffer Credit % is often an indication of WAN congestion. It is not uncommon to see some level of shortages as the SVC is usually capable of sending data faster than the link can support. Figure 3 demonstrates the average % of zero buffer to buffer credits.
- Port to Local Node Response time: If this increase correlates with zero buffer to buffer credit % increase then the congestion on the WAN may be impacting the front-end write response time.
- SVC Front-end Write Response time (Volume, storage pool, node): The front-end write response times should not be high for GM or GM with change volumes. If the front-end write response times are high for source volumes in a GM relationship, it is highly likely that significant WAN congestion exists or there is a performance issue within the downstream SVC/V7000.
2) Monitor the status of replication sessions with native SVC commands that inform you of the status of a volume that is in a replication session (See the IBM Redbook IBM System Storage SAN Volume Controller and Storwize V7000 Replication Family Services for more information).
3) Review the SVC error logs for the following errors:
- 1720 – In a Metro Mirror or Global Mirror operation, the relationship has stopped and lost synchronization, for a reason other than a persistent I/O error. You need to check the fabric logs and health of target cluster/nodes.
- 1920 – This is the most common indication of WAN bandwidth issues and usually follows a mirror relationship being taken offline due to the link tolerance (identification of link issues) going beyond 300 seconds.
- For GM with Change Volumes, consider adjusting the Cycling period. The default is 300 seconds. This is the amount of time that is allowed to copy all the changed grains to the secondary site before the next replication can start. If the changed grains are not completed from the first cycle than the next cycle will not start. The shorter cycle period the less opportunity there is for peak write I/O smoothing, and the more bandwidth you will need. We have seen modest improvements in front-end write response times for bandwidth constrained environments by setting this number higher but this does have the side effect of increasing the RPO/RTO.
A final tip for trouble-shooting in a GM/MM situation: if you feel that replication might be the culprit for front-end write response time issues, but finding the proof turns out to be hard, then you could suspend the mirroring and see if the front-end write response times return to normal.
- If they do return to normal after suspending GM/MM, then the issue is related to GM/MM. Investigate the performance of the replication using the monitoring tips above.
- If the front-end write response times remain poor after suspending the GM/MM relationships, then the issue is not related to the GM/MM environment. You will need to look elsewhere.
What is your plan for protecting the availability of your SVC/V7000 replication services?