Disk Performance Problem Isolation Techniques: Simple Tools
Introduction: “I told you SQL Server couldn’t scale…”
The usual conversation goes like this… “I told you SQL Server couldn’t scale…” and so the finger pointing begins. As a student of philosophy in college, I learned about rhetoric and baiting techniques. As a DBA, I had to learn to fight back and not be on the defensive from such attacks. From personal experience, I have managed many databases with tens of thousands of users that were multi-terabyte and mission critical… so I knew that these statements were spurious.
To me, the issue wasn’t whether the database could scale, but whether the customer architected a scalable architecture that could handle the load and meet the business
In a database server running an enterprise application, there are many moving main moving parts. Server, network, OS, app servers, web servers, batch servers, software, HBAs, fabric switches, application customization, etc… All can lead to issues. Regrettably, it’s usually the database that gets the blame.
This note will outline tools and techniques to eliminate the database software as the bottleneck and help the team start looking for the real issue, which for database scalability is usually the disk sub system.
Tools & Usage
The first objective is to test outside the SQL Server environment. The reason for this is to eliminate the database as an issue. If you get poor performance outside of SQL Server, then it’s not the database, it’s something else… usually the disk configuration.
The easiest test to run is a file copy. Simply find a large file, for example, 2G in size and then copy it from one disk to another. You can create this test file by making a test database with a data file of 2G. Detach the database files using sp_detach_db. Then xcopy the files from your drive F: to drive G:.
CREATE DATABASE test
( NAME = test_dat,
FILENAME = 'c:\tmp\test_dat.mdf',
SIZE = 2000MB)
( NAME = test_log,
FILENAME = 'c:\tmp\test_log.ldf',
SIZE = 1MB )
Another way to create a large file is to use the Windows NT Resource utility "creatfil.exe" which can create a large fixed size file.
Then go to a DOS command prompt and run the following batch file:
xcopy F:\temp\test_dat.mdf G:\temp\test_dat.mdf
At this point there are few things to remember:
- Make sure you are copying data to physical different drives. Please remember that different LUNs can still share the same physical drives and controller.
- Run a PERFMON trace and look at the logical transfer rates. This is outlined in the previous note on latency. If you are seeing times greater than 10 ms, you probably have a configuration issue of your SAN. You may need more spindles.
- Look at how many MB/sec you are pushing through. Take the size of the data file (ex. 100MB) and divide it by how many seconds it took to copy it. To get an idea of throughput, the following table gives a broad idea to compare your results with:
Lastly, run the same test on another server that is running “well”. Compare the results.
SQLIOSTRESS is a free tool that Microsoft offers to create a disk load that simulates SQL Server activity. Please note, you do not have to have SQL Server installed on the server to run it. Hence, it makes a good tool to run prior to creating your database and installing a lot of software that you may have to redo in case you have to flatten and rebuild your SAN to get adequate numbers.
SQLIOSTRESS can be obtained at this location:
The tool will give the DBA plenty of information. What I like to do is run the tool and then monitor the IO patterns with PERFMON.
IOMETER is another tool that creates a large workload on the disk drive sub system. The tool is not database specific and allows the user to configure the type of workload and IO patterns that need to be run against the disk This tool is also free. It can be downloaded from:
Again, use PERMON to watch while the tool is running. Look at the output reports from IOMETER.
Lastly, another Microsoft tool that can be downloaded is SQLIO. The tool is located at:
This tool will generate data and give an output report. A sample run on a small file from my laptop looks like this:
C:\Program Files\SQLIO>sqlio -kR -s360 -fsequential -o8 -b64 -LS -dE -Fparameter.txt
using system counter for latency timings, -1302407296 counts per second
parameter file used: parameter.txt
file C:\TestFile\TestFile.dat with 16 threads (0-15) using mask 0x0 (0)
16 threads reading for 360 secs from file C:\testdir\test.dat
using 64KB sequential IOs
enabling multiple I/Os per thread with 8 outstanding
using specified size: 100 MB for file: C:\testdir\test.dat
Secure a good baseline by running some of the above tools and processes. Then compare the numbers to other “good” systems you have running and to other published data. Work with your drive vendor if the throughput numbers are not like you would suspect. Keep in mind, you get what you pay for. If you spend little, don’t expect 500 MB/sec throughput!
After running the above types of test, I’ll often run a BACKUP to a NUL: device and then compare a database operation to a non-database run. For example, what is the MB/sec throughput of a BACKUP to NUL: versus an XCOPY. If they are the same, then you know the database is not the issue and it’s the disk that is probably the bottleneck.
The key idea in problem isolation and resolution is figure out which part of the equation is causing the problem To do that, you need to break up the variables and test them independently. The above tools give you mechanisms in which to do this.
IMPORTANT NOTE: When running these tests, make sure that the data used is greater than the size of the SAN cache. The reason for this is the SAN cache will buffer the file and distort the results. For example, if the SAN cache is 1G in size, then make the file used in XCOPY or SQLIO to be 2G in size.