Diagnosing Oracle Performance Using Trace-Based Analysis
Practical use of Oracle 10046 trace data, following Cary Millsap’s Method-R approach, to diagnose performance issues that are not visible through aggregated statistics — including cases where response time varies unpredictably between fast and slow executions.
Performance issues in Oracle systems are often approached through aggregated statistics — AWR reports, Statspack, or system views summarising activity over time. These can be useful, but they tend to obscure the behaviour of individual calls, particularly when response times vary.
A more precise approach is to measure the response time of each database call directly. This is the basis of the Method-R methodology described by Cary Millsap in Optimizing Oracle Performance, where response time is broken down into CPU time and wait time, allowing performance to be analysed in terms of where time is actually spent.
The focus of this approach is response time — the time experienced by the end user. From a business perspective, this is the only metric that really matters.
Internal indicators such as latch contention, buffer cache ratios, or other database metrics may suggest that a system is functioning correctly, but they do not reflect what the user actually experiences. A report that previously ran in 30 seconds but now takes 10 minutes, or a screen that takes 45 seconds to respond, is a problem regardless of what internal metrics indicate.
At a practical level, response time can be understood as the sum of the time spent on CPU and the time spent waiting for resources. This is the core idea behind the Method-R approach.
In real systems, this is observed as a sequence of database calls, each of which consumes CPU time and may incur wait events. The total response time experienced by the user is simply the accumulation of these calls and their associated waits, plus any time that is measured but not attributed to a specific call or wait event — commonly referred to as “unaccounted-for” time.
Resource Profile
================
Wall clock elapsed time: 10.7 seconds
Duration Pct # Calls Action
----------------------------------------------------------------------------------------------------
6.844849s 64.1% 53 db file scattered read
2.556234s 23.9% 430 SQL*Net message from client
0.774883s 7.3% 140 CPU-service: FETCH
0.151241s 1.4% 12 log file sync
0.144281s 1.4% unaccounted-for
0.136979s 1.3% 374 CPU-service: EXEC
0.047286s 0.4% 190 SQL*Net break/reset to client
0.019994s 0.2% 571 CPU-service: PARSE
0.000954s 0.0% 3 db file sequential read
0.000691s 0.0% 44 SQL*Net more data to client
0.000398s 0.0% 430 SQL*Net message to client
0.000125s 0.0% 3 Disk file operations I/O
0.000066s 0.0% 9 asynch descriptor resize
----------------------------------------------------------------------------------------------------
10.677981s 100.0% Total response time
Illustrative response-time profile derived from Oracle 10048 trace data.
Trace-based analysis addresses this directly by measuring the individual database calls and wait events that together determine the response time experienced by a user.
To apply this in practice, Oracle 10046 trace files can be collected for a session or process. These trace files record each database call along with its associated waits and timings. However, the raw trace output is not straightforward to interpret, particularly for complex workloads.
The key is to capture data for the duration of the problem task — from the moment it starts to when it finishes. This produces a trace file containing exactly what happened during the execution in question.
This is crucial. If too much activity is captured, the problem can be obscured by unrelated work. If too little is captured, important events may be missed.
This contrasts with aggregated approaches such as Statspack or AWR, where data is collected over a broader time window. While useful for identifying general trends, aggregation can obscure the behaviour of individual tasks, particularly when the issue is intermittent or affects only a small number of executions.
In practice, trace-based analysis is now my first approach when investigating performance issues. Measuring individual calls directly provides a level of clarity that aggregated metrics rarely achieve.
To make this analysis practical, I developed a set of small Perl utilities:
- tcprof: parses 10046 trace files, reconstructs execution flow, and attributes wait events to the correct database calls, producing a resource profile of response time
- tcskew: analyses the distribution of wait events, grouping them into elapsed-time buckets to identify skew and outliers
These tools were developed to support trace-based analysis where standard tools such as tkprof were not sufficient for detailed investigation.
In many cases, trace analysis confirms a familiar pattern: poorly written SQL dominates response time, often through excessive logical I/O or inefficient access paths. In these situations, the trace provides clear, actionable evidence of where time is being spent.
More interesting are cases where performance is inconsistent.
In one production system, a task reported by users as “sometimes slow” typically completed in under a second, but would occasionally take between six and nine seconds. The task was repeated continuously over long periods, making the variation in response time clearly noticeable.
To investigate, trace data was collected on a small number of executions until a slow instance of the task was captured. This was then analysed using the trace profiling tools.
The cause of the delay was immediately visible in the trace output. A small number of calls were responsible for the extended response time, with significantly longer wait events than those seen in normal executions.
By examining the underlying raw trace file, it was possible to identify the specific Oracle database file involved in the waits. This provided a clear indication of where the bottleneck resided and allowed the issue to be traced through to the storage layer, where it was investigated further with system administrators.
Problems of this kind are difficult to diagnose without capturing the problem task at the moment the delay occurs. By measuring individual calls and examining their associated waits, it becomes possible to move quickly from observed behaviour to a concrete point of investigation.
Trace-based analysis remains one of the most reliable ways to understand where time is actually spent in a database system. When applied carefully, it provides a level of clarity that aggregated metrics cannot.