The journey to create best practices for enterprise level storage systems
For many years, I have had the opportunity to be involved in addressing enterprise infrastructure performance challenges. When I mention to friends that I am a storage performance expert, they occasionally respond: “Me too, I have this huge hard drive on my laptop and I make sure that the IO blazes.”
It is then that I smile and calmly explain that, on your laptop, you are the only user. In my world, I have hundreds to thousands of people all wanting to use hard drives at the same time. In these cases, the hard drive performance is mission critical and needs to be in top condition 24/7.
It is then that they start to get the picture.
To maximizing positive results, I cannot over-emphasize the need for proactive performance engineering for enterprise storage systems. In the first phase of a company’s use of a shared storage infrastructure, users and applications are added and the technology typically hums along incredibly fast. While it may take a while for a company to begin to fill up a storage system –once it does – it is too late to avoid performance issues.
When storage capacity is reaching its limit, slow response times and related issues become the norm. The best way to maximize the efficiency of shared resources making up a Storage Area Network (SAN), is to begin performance engineering from the get-go. Users of distributed and open systems are used to sharing servers, network and devices – but pro-actively performing performance engineering measures is often another matter. This is illogical, at best, as a storage device with hundreds of production applications is comparable to the effort required for performance engineering on a Mainframe. Workload balancing and configuration tuning are critical to keep the system running efficiently.
When I was asked to take on a project to establish performance best practices for a set of robust enterprise level storage systems, I knew that the undertaking would require me to:
- Identify key performance metrics
- Trend the data across all storage systems
- Create metric thresholds
- Set up pro-active automated notifications
What I didn’t realize, at least at the time, was that these steps would put me on the critical path to establishing industry best practices. I also didn’t know that the task would be close to impossible, given the state of the current storage vendor performance management offerings. I also did not know that IntelliMagic Vision offered all of this and more in a user-friendly, heterogeneous package. Had I known the strengths and flexibility of the IntelliMagic software, I could have saved considerable time and money. And, I would have been able to deliver to the application teams a level of visibility and productivity that would have greatly increased the value and performance of their mission critical applications.
Identify key performance metrics
As I began my journey, I realized that dozens of metrics were available to me. Some even came labeled as “critical.” However, I also discovered, as problems emerged, that these supposedly “critical” metrics would not predict all of the problems we were seeing. I realized I needed to develop my own short list of key metrics. For example, storage systems will defer actual writes to disk to save on IO time, temporarily storing the results in cache. However, when the cache becomes too full, the writes need to be written from cache to disk, which can impact the read IO speed. Hence, the best metric for this condition needs to be selected and monitored.
Trend the data across all storage systems
Once I had identified the key metrics I felt were most meaningful, my next challenge was to collect, store, trend and analyze them – especially across multiple storage systems and locations. Luckily, I was able to utilize several off-shore, low-cost resources and put to them work on an intensely manual effort. Unfortunately, there was no API available to design and implement an automated solution.
Create metric thresholds
To define thresholds required answers to a series of questions. What level of metric utilization would be considered good? What level would be bad? When would we be hitting the knee in the curve? Getting to the answers led to research and testing, as well as seeking out the advice of many peers. Progress was very slow.
While attending a storage vendor conference, one very senior member of a storage company suggested running all metrics at 50% to ensure top performance. However, to me, that was comparable to telling the Euro Railways to always run with only half the passengers it could accommodate to guarantee that they would get to where they needed to be on time. While they might get there fast, the company also could quickly go out of business.
Set up pro-active automated notifications
While we had expended a lot of resources to develop our own key metrics and manually collect the data, we found that threshold data simply wasn’t available. Upon hitting this wall, we were not in a position to set up pro-active notifications. For example, we wanted to be notified when a front-end storage system port became flooded with IO from one host, impacting all other users of the shared resource. With proper pro-active notification, this condition could be detected and the workload re-balanced.
But, despite the limitations, the project ultimately was considered a success, primarily because we were able to achieve our goal to develop best practices. However, if I’d had the IntelliMagic software available to me, I could have saved my company considerable time and expense, at a significantly higher level of competency. Out of the box, IntelliMagic Vision would have provided us the key metrics we needed, as well as proactive dashboards, dynamic thresholds, automated notifications and more.
As storage environments continue to grow and become more complex, maintaining stable, high-performing systems is a business imperative. The years of solid expert knowledge and experience that have been incorporated into IntelliMagic’s software solutions will keep you ahead of the game and arm you with the confidence to run your competitive, business critical applications on your storage infrastructure.
About the author:
Stuart Plotkin is an EMC storage performance specialist, supporting VMAX, DMX, VNX, FAST VP, Unisphere and ECC. His 20 years of award-winning performance engineering experience focuses on storage, server, database and applications areas. Through an emphasis on proactive performance methodologies and customer service, he has enabled his clients to achieve top stability and reliability levels. By leveraging performance engineering to achieve high efficiency rates and advanced resource optimization, he has consistently demonstrated innovative ways to meet new technology challenges to deliver maximum value.