When I first started working in the storage industry, the only way data was backed up was via tape. In the days of $10,000 per gigabyte of disk, there was no way any sane person would propose disk back-up. The best practice for Disaster Recovery (DR) in these days was to create a nightly tape backup and then use PTAM (pick-up truck access method) to store it offsite. Applications were unavailable during the dump and the odds of successfully restarting after a disaster were typically not in your favor. DR testing was piecemeal at best and ignored at worst. In those days, the statistics suggest that many enterprises that experienced a major data loss due to a disaster simply went out of business.
Today, it is a different world. Cheaper disk, combined with the realization that most businesses need continuous availability, has led to replication schemes designed to avoid both data loss and downtime in the event of an unexpected outage. Point in time copy is used to make local disk clones to facilitate functions such as data mining, business intelligence, and rigorous DR testing. Straightforward backups are still done, but they now often use “tapeless tape” systems that rely on spinning disk instead of on magnetic tape. The net result is that instead of two copies of data (one on disk, one on tape), many enterprises now have more copies of their data than they can keep track of. Indeed, this proliferation of copies has been a major influence on the explosion of storage capacity that has been going on. Although there are good reasons for all of these copies, it seems that our data centers are under siege by disk clones!
Naturally, none of these extra copies are free. Besides the obvious cost of the storage capacity, there is a performance cost. For example, let’s consider the form of remote replication called synchronous mirroring. With synchronous mirroring, a write I/O is not complete until not only the primary storage system, but also the remote storage system acknowledges that it has been written. Until someone figures out how to repeal the laws of physics and exceed the speed of light, synchronous mirroring causes a significant performance penalty that is dependent on the distance. Light needs roughly 5 microseconds to travel a kilometer through glass, and synchronous replication requires at least one full round trip of communication. Thus, the response time penalty for synchronous replication at 100 km is no less than a millisecond even before accounting for any protocol overhead. Furthermore, you need enough inter-site bandwidth to handle the peak write throughput on the primary. If the bandwidth is insufficient, you may either drop into a semi-synchronous state or lose your mirroring session completely. Since you are paying a lot of money for all of the infrastructure to maintain replication, losing your session is not very desirable. And even if you have enough bandwidth, you may see performance impact if there is congestion on the links. Thus, introducing synchronous remote replication can make storage performance management quite a bit trickier.
When synchronous remote replication first came out, a lot of customers were afraid of the performance ramifications and strict bandwidth requirements. There was also a consensus that true DR required the recovery site to be far enough away to not be affected if a regional disaster brings down the primary, so a solution was needed that supported large distances. This was especially important for very long distance replication (thousands of kilometers) where the latency would severely impact I/O performance and the cost of bandwidth was onerous. This requires asynchronous replication, which does not wait for the mirror to the remote site to be completed. The first asynchronous solution I was involved with was IBM Extended Distance Remote Copy (XRC). This was a mainframe-only solution that used a z/OS server as the data mover. One of the initial design principles of XRC was to minimize how far behind the remote data would get compared to the primary. In fact, XRC would actually slow down your applications to prevent the remote data from lagging behind. Needless to say, some XRC customers did not like this! Later IBM gave customers a bit more control of this by introducing volume pacing. XRC is still used by many large enterprises. It is even more complex than synchronous remote replication and has many performance aspects that need careful monitoring.
A while after they introduced XRC, after listening to their customers, IBM came up with another asynchronous replication scheme called IBM Global Mirror. Global Mirror is strictly a peer-to-peer replication method so you no longer needed a z/OS host to move data, saving a lot of expenses. The design principle of Global Mirror was the opposite of XRC: protect performance at all costs even if it means the currency of the remote data falls far behind. Of course, some customers complained about this. It is impossible to please everybody! IBM made numerous tweaks to the original design to avoid serious currency issues. Even though Global Mirror sounds simple, there is a multitude of additional metrics that may affect performance that should be monitored.
There is no standardization across storage vendors, so EMC and Hitachi came up with their own replication methods in parallel to IBM. EMC called their suite of replication technologies the Symmetrix Remote Data Facility (SRDF). Although the ultimate functionality of the various SRDF methods is pretty much identical to IBM’s replication methods, the processes are different and there is a whole other set of metrics that need to be monitored. Hitachi mostly licensed IBM remote replication technologies. However, they also introduced Hitachi Universal Replicator. This is almost like XRC running under the covers on the HDS storage system, thus avoiding the need for a z/OS host.
The bottom line is that having so many copies of data and methods for creating these copies adds a lot of complexity to storage performance management. Many shops are running remote replication, plus point in time copies, plus backups all at the same time. How do you stop the attack of the disk clones on your sanity? Your investment in remote replication is designed to protect availability, but how do you manage the risk that replication performance issues will render your infrastructure effectively unavailable? Our software, IntelliMagic Vision, helps manage remote replication performance and the associated risks to availability. IntelliMagic Vision alerts you when you have bandwidth issues, insufficient replication ports or overutilization of host adapters and you can use it to investigate issues and track how far behind your Global Mirror or SRDF/A sessions are getting. With IntelliMagic Vision, the disk clones will be your ally, not your enemy.