Storage Management Initiative (SMI) and the Block Services Performance (BSP) sub-profile: the Good, the Bad, and the Ugly – Part 3 “The Ugly”

Brett

By Brett Allison

In the first part of this three-part blog I introduced the SMI project and provided some reasons why it is a good thing. In the second installment I discussed some of the challenges with SMI project, and in this last installment I will discuss the ugly parts. I will focus on the Block Services Performance (BSP) Sub- profile as that is our domain area.

The BSP Sub-profile is the component of the SMI specification that describes the various physical and logical components, their associated attributes and metrics. The BSP is well designed, but as mentioned in Part 2 “The Bad” the conformance standard for the BSP is set so low that vendors can provide an implementation that both conforms to the standard and is mostly useless at the same time.

Like the character Tuco, this type of implementation and the resulting data is just plain ugly.

I’m going to provide a couple of examples of the types of useless but conforming implementations:

Example 1: Insufficient metrics to conduct performance analysis. Some vendors have chosen to conceal response times that are available natively through CLI or APIs. This leads to the following ElementType6 (Front-end Port Data) example:

Description

ElementName

Element
Type

Kbytes
Transferrred

SampleInterval

Total

IOs

$VENDOR_BlockStatistical DataFCPort

$Vendor_Block StatisticalDat aFCPort

6

22,423

00000000000300. 000000:000

723

In addition to the statistic time that is removed for brevity, you can see in the table there are only two metrics: KBytesTransferred and TotalIOs. So a vendor that is not inspired to do much in regards to performance is compliant by supplying only a single ElementType, in this case the front-end port, and just two metrics. Additionally, the conformance test does not test whether or not the counters are actually incrementing as they don’t have an awareness of workload. In summary you can conform to the standard and provide absolutely no useful information. Alternatively, when using the vendor’s native interface, for the example above, you can gather response times for each port. Port response times are a true indicator of port congestion and overall utilization. Without response times, performance analysis of ports is a whole lot of guesswork.

Example 2: A good implementation of ElementType6. Unlike the first example which showed a conformant but poorly implemented ElementType6 example, the following table shows an incredible range of per storage port statistics ranging from read and write response times to remote connection response times to fabric errors. This is a good example of the type of information that is useful and exactly what performance analysts need.

InstanceID SCSIReadTimeAccumulated
ElementType SCSIWriteTimeOnChannel
StatisticTime MaxKB
TotalIOs MaxMS
AbbreviatedID MaxCount
SCSIReadOperations FCLinkFailureErrorCount
SCSIWriteOperations FCLossSynErrorCount
PPRCReceivedOperations FCLossSignalErrorCount
PPRCSendOperations FCPrimitiveSequenceErrorCount
ECKDReadOperations FCInvalidTransmissionWordCount
ECKDWriteOperations FCCRCErrorCount
SCSIBytesRead FCLRSentCount
SCSIBytesWritten FCLRReceivedCount
PPRCBytesReceived FCIllegalFrameCount
PPRCBytesSend FCOutOfOrderDataCount
ECKDBytesRead FCOutOfOrderACKCount
ECKDBytesWritten FCDupFrameCount
ECKDAccumulatedReadTime FCInvalidRelativeOffsetCount
ECKDAccumulatedWriteTime FCSequenceTimeoutCount
PPRCReceivedTimeAccumulated FCBitErrorRateCount
PPRCSendTimeAccumulated

Example 3: Ugly data. Ugly data can mean a lot of things but what I mean is that in fifteen years of looking at performance data from all kinds of sources (application, OS, middle-ware, DB) I have never seen so many problems with data ranging from missing data, counters that don’t increment, column headers being incorrect, and counters that wrap.

In this example, ElementType10 (Disk) data is demonstrated. This example reflects two intervals for a single disk. In order to calculate how much activity occurred you need to determine the delta between values at Time 2 and the values at Time 1. Intuitively, the sum of the KBRead (1,256) and KBwrite (258) should equal the measured total of 3,057. In this case the calculated sum is 1,514 which is not the measured total of 3,057.

Time
KBRead
KBWrite
KB Total (Measured)
Time 1
14,715,076
2,316,916
33,662,806
Time 2
14,716,332
2,317,174
33,665,863
Delta (T2-T1)
1,256
258
3,057
Calculated Total Read+Write
=1,256+258=1,514

Sometimes the vendors can help explain how to interpret the data. The good side for us is that we can add more interpretation in our product, sanitizing the numbers and providing more meaningful results.

These are just a couple of the examples of the ugly aspects that result from minimal conformance requirements and poor implementations. For these ugly results to be eliminated from SMI-S all the major storage vendors need to:

1) Recognize that their native interfaces are not always the primary means of platform management. Do not provide a small subset of the data from the native interfaces, but include the same counters in the native and SMI-S interfaces. Or ideally abandon their native interfaces as the primary means for platform management. Instead adopt their tools to leverage SMI-S as a means for platform management and data collection (kudos to those that already have).

2) Take pride in the performance of the hardware and the associated measurements. Invest enough time and energy in this space to provide high quality support for the BSP above and beyond the minimum requirements. This will show the masses that they have high quality products and they are not afraid to enable accurate measurement via an open standard. Design the metrics to allow problem diagnosis, rather than to hide problems.

3) Embrace changes to the BSP that put more teeth in the standard and increase the requirements for compliance such that the data becomes more broadly usable.

For more information on the standard see: http://snia.org/forums/smi

2 thoughts on “Storage Management Initiative (SMI) and the Block Services Performance (BSP) sub-profile: the Good, the Bad, and the Ugly – Part 3 “The Ugly””

  1. Anonymous says:

    Yep, and unfortunately, vendors treat SMI-S as a cost center, not a profit center, especially at startups. The situation is not gonna change until Microsoft or some other big client decides they need perf data, but if they do that, then what happens to Intellimagic?

    1. Brett Allison says:

      If Microsoft decided they needed performance data maybe they would create a tool or maybe they would just enforce existing standards. Who knows. In either case IntelliMagic would still be there. Either we would fill the niche nicely with a product designed to the standards or we would provide an alternative to some corporate tool (hmm, sounds familiar).

Leave a Reply

Your email address will not be published. Required fields are marked *