Performance Virtual Reality – Seeking the Truth in Storage Benchmarks

Lee

By Lee LaFrese

 

Performance analysts likeFigure 2 - The Four Corners of Storage Benchmarking myself have a love/hate relationship with benchmarks. On the one hand, benchmarks are perceived as a great way to quantify ‘feeds and speeds’ of storage hardware. However, it is very difficult for benchmarks to be truly representative of how real applications work. Thus, I consider benchmarks a form of ‘virtual reality’; and like virtual reality, benchmarks may seem very realistic but they can deceive you. Therefore, I’ve written this article from the viewpoints of expanding your knowledge about how benchmarks work so you stay rooted in the real world.


Back in the dark ages, (ok, it was the 1970’s) storage performance was very simple: every I/O operation was synchronously read from or written to a spinning disk drive. It was very straightforward to predict performance based on easily measured drive metrics such as average seek, average latency and overhead. Benchmarking was done mostly to validate that the drives were performing as designed. Over time, storage system design has become increasingly advanced, making it more complex. Although it is possible to build sophisticated models of storage system performance, which we do with our IntelliMagic Direction solution, measurements are still needed to validate models. Thus, storage performance benchmarking has grown in importance.

Besides validating models, there are three other reasons to conduct performance benchmarks:

  • To verify that the storage system is working as designed. This includes both the hardware design as well as algorithms embedded in the firmware.
  • For performance acceptance testing. This is sometimes included as part of a purchase agreement.
  • To provide marketing collateral. Storage vendors often use benchmarks to try and prove ‘my storage system is faster than yours’. Unfortunately, this tends to be the shady underbelly of benchmarking.

Before we discuss their potential for misuse, we should first review how storage benchmarks work. Simple random benchmarks are sometimes called ‘four corners’ tests; these would measure how a storage system behaves when a single consistent workload is applied, and would include read hit, read miss, write hit (no de-stage) and write miss workloads. Typically, the transfer size for these tests is small – 4 KB to 8 KB per I/O operation. Initially, the test is run at a low I/O load and then the rate is ramped up by reducing the inter-arrival time or increasing the number of threads. The response time is plotted against the I/O rate producing a knee-shaped curve when a bottleneck is approached.

Figure 1 - Knee Shaped Curve

Figure 1 – Knee Shaped Curve

It is possible to create representative synthetic workloads by using a combination of these four corner tests. For example, a 70/30/50 (70% read, 30% write, 50% cache hit) workload is commonly used in open systems environments as being representative of online transaction processing applications.

Sequential benchmarks are another way to measure storage hardware capabilities. Here data is either read or written in a sequential manner. Modern storage systems will detect the sequential pattern and either pre-stage reads or gather writes to de-stage in batches. The I/O size per operation tends to be larger – normally up to 256 KB for open systems or a full cylinder for z/OS. Multiple threads are used to scale the benchmark to higher throughput levels. In this case, the result is often viewed as a plot of throughput versus the number of threads. Eventually, the curve flattens out and added threads will not increase throughput, in some cases may even decrease it.

Figure 3 - Throughput Curve

Figure 2 – Throughput Curve

Although these types of tests are good for engineering purposes, their utility for marketing is questionable. Real world workloads are typically a mixture of reads and writes, random and sequential and many I/O sizes. Also, arrival patterns range from ‘uniform’ to highly ‘burst-y’. These factors must be taken into account to design a truly realistic benchmark.

The most common benchmark quoted for marketing purposes is ‘max IOPS’. It is not uncommon to hear vendors quote millions of I/Os per second for their storage systems. Unfortunately, this is typically based on a 100% read hit workload with a small transfer size, sometimes as small as a single 512-byte sector. Since no real workloads behave like this, and there is no industry standard for how ‘max IOPS’ is measured, it is a meaningless metric.

Another number that storage vendors throw around is ‘max throughput’. When looking at this metric, it is important to know if the measurement was done from disk or cache, and if it was all reads, all writes, or a mixture of both. Again, measurements from the cache are not very useful. A truly useful throughput benchmark will execute a mix of large block sequential reads and writes from disk much like real applications would.

Another trick storage vendors may use is related to solid state or flash devices. Since these devices often function as a log structured array under the covers, their performance may suffer once all of the capacity in the device has been written several times. That is why any benchmarks of solid state or flash should be done after a period of ‘conditioning’ before the actual benchmark is run. This is also relevant to performance acceptance testing. Without conditioning, results may be overly optimistic.

So what should you do? How do you objectively compare storage platforms on a performance basis? In my opinion, the best choice for this is the Storage Performance Council (SPC). The SPC is a vendor-neutral, industry standards body focused on the storage industry. Their two best-known benchmarks are called SPC-1 and SPC-2. These benchmarks were designed by a panel of storage experts and try to avoid some of the pitfalls previously discussed. The SPC-1 benchmark is designed to be representative of typical I/O intensive workloads such as transaction processing and database transactions. The SPC-2 is designed to be representative of sequentially oriented applications such as large DB queries, large file processing and video on demand.

Although the SPC has brought some structure to storage benchmarking, it is not without perils. First, not all storage vendors participate equally in providing audited SPC results. Another issue is that the configurations tested vary wildly, and it is very difficult to do a true ‘apples to apples’ comparison of two storage systems based solely on the SPC benchmarks.

Should you ignore benchmarks altogether? No, they still are useful for providing insight into storage capabilities. The important thing is to understand how the benchmark is constructed as well as the configuration of the storage measured. When possible, give more weight to audited benchmarks such as those from the SPC. And finally, meaningless benchmarks such as Max IOPS should be taken with a grain of salt.

While this article is helpful in laying down some storage benchmark basics, we understand that often times it is useful to engage an expert to help evaluate the veracity of vendor claims.  Should you need a second opinion on the storage virtual reality you are facing, IntelliMagic provides modelling services to help you understand how a particular storage configuration would behave for your specific workload. It is much better to model your specific workloads on various platforms than to rely on comparing benchmarks. Click here for more information on IntelliMagic Direction and modelling services, or send an email to sales@intellimagic.com.