At some point or another, we have probably all experienced noisy neighbors, either at home, at work, or at school. There are just some people who don’t seem to understand the negative effect their loudness has on everyone around them.
Our storage environments also have these “noisy neighbors” whose presence or actions disrupt the performance of the rest of the storage environment. In this case, we’re going to take a look at an SVC all flash storage pool called EP-FLASH_3. Just a few bad LUNs have a profound effect on the I/O experience of the entire IBM Spectrum Virtualize (SVC) environment.
Clear Dashboards Identify SVC Front-End Issues
In the SVC Front-End Dashboard below, there are significant issues on all of these systems, as indicated by the big red dots. SVC001 in the second row has large red dots on Front-End Response Time, Front-End Read Response Time, Front-End Write Response Time, and FW Bypass. We begin the investigation here, where the most significant issues appear to be.
Drilling down to the SVC Storage Pool Front-End Dashboard, we see that several of the storage pools have risks. We want to look specifically at the pool EP-FLASH_3, which is our pool residing entirely on an IBM Flash system.
By clicking on any of the dots in the dashboard on the line labeled EP-FLASH_3 in the figure above, we drill down to all the underlying mini line-charts that show the detailed data represented by the dots. The mini charts are shown below.
These charts leverage the same rating system for the key risk indicators as the previous dashboards. The overall Front-End Response Time, Front-End Read Response Time, Front-End Write Response Time, and FW Bypass I/Os are red indicating exceptions. We ignore the FW Bypass I/Os because the write cache for this pool was disabled during this period. As you probably have guessed, green is good, and red is bad. These metrics indicate the SVC is not providing good performance for the EP-FLASH_3 storage pool.
Identifying Root Cause for Flash System Performance Issues
The next thing we need to check is the back-end performance between the SVC as it looks back at the Flash System. In the chart below, we see that the SVC is getting great performance from the Flash System – less than 1.0 ms. Therefore, we can safely conclude that the performance issues are not with the Flash system, but somewhere else within the SVC. The next step is to determine who is causing the workload to the Pool EP-FLASH_3.
The next chart shows the total throughput of the top 30 SVC volumes in the EP-FLASH_3 storage pool. What stands out is the steady-state throughput for those 4 volumes (red, blue, yellow & green) at the bottom of the chart. It would be very strange for a production application to be pumping out a steady stream of throughput like that for what turns out to be the last 3 months. We see in the chart to the right these volumes have names that end in 125, 57, 124 & 202.
Now we want to know which host systems these volumes are associated with. Looking up the properties of these suspicious volumes is pretty easy. As you can see (with good eyes), the volumes relate to hosts with the prefix ABAAIX.
Working with the AIX administrator, we determined that these volumes support an Oracle Database. We then worked with the Oracle DBA to determine the purpose and activity of these volumes. It turns out the DBAs were hitting Oracle tables on these 4 volumes to measure the performance of an Oracle query against tables without indexes. Apparently, they just left that running and forgot about them… on a production system. They were glad to turn the queries off.
Comparing Final Results
After a week, we can see the results of our efforts. The next chart shows the volume throughput for the EP-FLASH_3 storage pool on 9/28 after the issue was resolved. The workload profile has changed dramatically – it looks like a normal production workload, well-balanced across many volumes.
But the big question is, did the reduction in throughput improve the front-end response times?
On 9/23, the storage pool EP-FLASH_3 had a front-end read response time rating of 0.39. In this chart, we can see how the front-end read response time rating compares with five days earlier. The chart shows that the front-end response time improved by an average of .98ms or 32.8%! Additionally, the risk improved from a warning (red border, rating 0.39) to no-risk (green border, rating 0.0).
Using IntelliMagic Vision SVC Dashboards, we were able to proactively identify these SVC “noisy neighbors”. Once addressed, there were significant improvements in the front-end I/O response time for all the good folks in the neighborhood.