Friday, December 6, 2013

Quickly Identify an IO bottleneck Using Oracle’s AWR Reports

When working with customers to identify if they have an IO bottleneck and if flash storage can help them I use Oracle’s AWR reports. AWR reports have come along way since they were first introduced back in 10g and they can now help you identify many performance problems very quickly. Firstly lets start with a short description of an what an IO bottleneck is:
A performance problem being caused by the storage infrastructure not able to keep up with the demands for data reads or writes made by the database. Caused by restrictions in the hardware or configuration of the storage tier.
An IO bottleneck can have a nasty impact on your applications and really effect end user experience. So it is very important to be able to quick identify if an IO bottleneck exists before diving deeper to investigate the problem.
Lets have a look at how within a few minutes you can quickly identify a potential IO bottleneck. I want to start by setting some ground rules though:
1. Never create AWR reports greater than 60 minutes in duration. Any longer can dilute statistics and hide the real problem
2. Analyse more than one AWR report to give you an idea of the trend in performance
3. Always analyse the statistics during the time the problem occurs. This might seem obviously but it is amazing how often I see reports generated during quiet periods
You now have an AWR report/s, It doesn’t matter which format (text or html) you used for the output as it all looks the same. Below I will show you the basic sections you need to check to quickly identify an IO bottleneck. I have named each section after the part of the AWR report you need to look in to make it easier to find.
IO Throughput

Ok so this one isn’t a section name but the first step is to identify if your database is actually doing any IO during the AWR report period. One place you can for this is Load Profile Sectionwhich has two rows: Physical Reads and Physical Writes. These values give you the number of blocks read or written during the time period; this is individual block counts. I always prefer to use the Per Second row for a good idea of how much IO is going on. If you see low numbers like 10 or 30 then there isnt much happening however values like 1000 or 2000 or 30000 show some level of IO. The higher the value the more IO your database is doing.

Another section you can check this is in the “Instance Activity Stats” section. Look for the value statistic names:
“physical read total I/O requests” – read IOPs
“physical write total I/O requests” – write IOPs
“physical read total bytes” – read throughput
“physical write total bytes” – write throughput
Then look in the “per Second” column for the value. For the throughput statistics it is easier to convert the value into mbs to make it easier to read so bytes/1024/1024. Again low values like 10mb/s or 30mb/s shows not much IO but numbers like 150mb/s or 1gb/s show the database is indeed doing a lot of IO.
Foreground Wait Class Section
This section is you a small break down of each wait class category and some basic statistics such as avg. wait time and importantly the % of time this wait had for db time. In 11g the % is worked out for you but in 10g you will have to add up the number of waits and then divide each section by that total number to get a %. Here is a screenshot from an 11g report:
Untitled
To identify how much of an impact IO wait having you need to look for the “User I/O” and “System I/O” wait classes; with “User I/O” being the most important. This class is the actual application I/O where “System I/O” is generate by Oracle at the backend i.e. the system processes. Then look at the “%DB Time” column to see how much of the db time was spent waiting on IO. If you have an IO problem then at least the “User I/O” will have a high % value and be right at the top. You should also look at the “Avg wait (ms)” column which gives you the average latency regardless of the type of IO. If you have an IO problem then this value will be high. So “User I/O” high up the list with a large % and latency value.
Top 5 Time Forground Events/Top 5 Timed Events
The latter name is for 10g databases. This section gives you a break down of waits showing the top 5 offenders. To identify an IO bottleneck you need to look for IO waits high up the list with high latency values in the “Avg Wait (ms)” column. THis column is the most important one to look at and shows you the average latency or average time it takes for each IO call to complete. Below is an example of an IO constrained database:
Untitled
In this you can see that random IO (db file sequential read) is top of the list with an average latency of 5ms (5ms is ok for a disk array but considered IO constrained when the application is experiencing problems during high load). The higher the “Avg Wait (ms)” number is the worse your problem is. A well tuned disk array will give you around 3ms latency so anything more then that could be a problem but it is all relative to what is happening in your application at the time. In this case the customer was experiencing application problems at 5ms which meant query response times are too slow.

The basic IO waits that you need to look out for are:
db file sequential read – Random I/O reading usually a single row into SGA
db file scattered Read – Sequential I/O doing a full scan into SGA
direct path read – Sequential I/O doing a full table scan directly into the PGA
direct path read temp – Occurs when sorting large amounts of data on disk
direct path write temp – Occurs when sorting large amounts of data on disk
read by other session – sessions waiting for another session to read a block from disk
Solution
So you have identified that the database is doing a good level of I/O, that “User I/O” is at the top of the foreground wait class section with a high average latency and the top 5 wait events section is showing IO wait events at the top of the list again with high average latency values. Remember that this was a guide on how to quickly identify a potential IO bottleneck. You must now look into the problem deeper; check with the storage team how that layer is performing, check if the server HBAs or network swtiches are not the bottleneck. Maybe you need to tune the database by adding some indexes, partitioning tables or tuning the SQL queries. Sometimes you can fix all other bottlenecks related to IO but if your database is on traditional disk storage or SSDs then you are running on legacy slow storage which is going to limit you regardless of what tuning you carry out in the DB. In this case you should look into purpose built flash storage arrays or flash based PCIe cards which can reduce your IO latency by up to 15x out-of-the-box.

1 comment: