Kevin Kempf's Blog

March 13, 2009

11i Fixed Object Statistics

Filed under: 11g, Oracle — Tags: — kkempf @ 2:14 pm

Oracle 10g and 11g Fixed Object Statistics

As an Oracle Applications 11i DBA, I recently encountered something which cost me a day of my life this week. This may not be terribly unusual in and of itself (it happens 5 days a week, right?), but the fact that the symptoms of the issue are virtually undocumented and (of course) didn’t arise in any of my Test, Development, or Training environments was maddening. In short, it was caused by bad/stale fixed object statistics.

Lets talk about statistics for a minute, at least as far as they pertain to 11i. I consider myself pretty well acquainted with the vagaries of statistics with the ERP. Before this issue arose, I ran 4 different stats jobs.

* A weekly run of the Gather Schema Stats concurrent request in 11i by sysadmin, which covers my apps schemas (Schema Objects)

* A monthly run of dbms_stats.gather_dictionary_stats, which keeps the data dictionary in line. (Data Dictionary)

* A weekly run against my Discoverer EUL and custom ERP schema, which are effectively missed by Gather Schema Stats (Schema Objects)

* A monthly run of dbms_stats.gather_system_stats, which tweaks values related to CPU & disk, and directly affects how the optimizer determines disk and CPU costs (System Performance)

After this issue, I’ll add another run, on demand: dbms_stats.gather_fixed_object_stats. This job (per Note 465787.1) fills the last category of statistics: Fixed Objects. It should be run “during normal system load” and not re-gathered until “a major new upgrade requires so, or the workload changes”. Whatever that means.

I’m getting ahead of myself, but I figured I’d present the statistics environment before I walked through the issue. I’ll continue with a brief description of the environment, and what happened to trigger the issue. I will say that there was considerably more patching and maintenance involved than what I describe, but in the interest of brevity, I’m only going to bother describing what was relevant, in retrospect.

We run the ERP in an all-Linux environment. The backend (database) runs on 64-bit Red Hat 5, and I was in the process of upgrading from 10.2.0.3 to 10.2.0.4 (what I’d consider to be a minor skip). The front end runs on 32-bit Red Hat 4, and is version 11.5.10.2. I can’t help but think this is likely one of the most common Oracle ERP deployment scenarios out there, if you disregard the host OS. When upgrading the database, all was proceeding normally, and I’d just run catupgrd.sql and utlrp.sql. Then I tried to start the 11i concurrent managers, and things got nutty. We run 16 standard managers, and a host of other managers, and they appeared to be starting up… when the backend load started spiking into the 30-70 range, and the front end showed hundreds of FNDLIBR processes when viewed through the linux ps command.

This is easy to explain and diagnose in retrospect, but what was happening is that the 11i internal manager (GSM) was running a query (select count(*) from v$thread) to determine if a manager was “alive”. This query was, simply, taking too long to return a result (7-9 seconds, best case), so the manager assumed the process was dead, and spawned another one. This explains the load on my backend (dozens, up to hundreds of connections running select count(*) from v$thread), and my front end (16 FNDLIBR processes, started every 10-30 seconds). Now, try to find any documentation on select count(*) from v$thread. Metalink. Google. Next to nothing. After some time and

analysis, I found that when I ran this query against my (reverted) 10.2.0.3 database, the execution and plan looked like this:

SQL> select count(*) from v$thread;

COUNT(*)

———-

1

Elapsed: 00:00:00.04

-----------------------------------------------------------------------------------------------
| Id  | Operation                  | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                  |     1 |    18 |     0   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE            |                  |     1 |    18 |            |          |
|   2 |   NESTED LOOPS             |                  |     1 |    18 |     0   (0)| 00:00:01 |
|   3 |    NESTED LOOPS            |                  |     1 |    12 |     0   (0)| 00:00:01 |
|*  4 |     FIXED TABLE FULL       | X$KCCRT          |     1 |     9 |     0   (0)| 00:00:01 |
|*  5 |     FIXED TABLE FIXED INDEX| X$KCCCP (ind:1)  |     1 |     3 |     0   (0)| 00:00:01 |
|*  6 |    FIXED TABLE FIXED INDEX | X$KCCTIR (ind:1) |     1 |     6 |     0   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Yet when I ran it against my (upgraded) 10.2.0.4 database, the execution and plan looked like this:

SQL> select count(*) from v$thread;

COUNT(*)

———-

1

Elapsed: 00:00:09.40

---------------------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |     1 |    18 |     0   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE           |                 |     1 |    18 |            |          |
|   2 |   NESTED LOOPS            |                 |     1 |    18 |     0   (0)| 00:00:01 |
|   3 |    MERGE JOIN CARTESIAN   |                 |     1 |     9 |     0   (0)| 00:00:01 |
|*  4 |     FIXED TABLE FULL      | X$KCCTIR        |     1 |     6 |     0   (0)| 00:00:01 |
|   5 |     BUFFER SORT           |                 |     1 |     3 |     0   (0)| 00:00:01 |
|   6 |      FIXED TABLE FULL     | X$KCCCP         |     1 |     3 |     0   (0)| 00:00:01 |
|*  7 |    FIXED TABLE FIXED INDEX| X$KCCRT (ind:1) |     1 |     9 |     0   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Yep, that’s 200+ times slower after upgrading! Note, however, the optimizer thinks that the Time and CPU cost are the same for both plans (.01 and 0, respectively). Why is the optimizer now throwing a cartesian merge join into the plan for this query? The short answer: the subtle changes to the optimizer were enough to dramatically change the execution plan for this SQL from 10.2.0.3 to 10.2.0.4, because of bad/stale fixed object statistics.

Here’s where it fell apart: the keywords I was searching on were along the lines of “V$THREAD performance” or “X$KCCRT slow read” or “infinite FNDLIBR processes spawned”. I won’t even tell you how effective my SEV1 SR with Oracle was. Let it suffice to say that I had to revert my environment to pre-10.2.0.4 and write off a day of work in a maintenance window (there was a ton of other work done during that day, but it’s not relevant to the issue).

In short, I’d recommend taking a look at Note 465787.1 regarding statistics. This procedure was new to me, and came with 10g. While knowing when to run it may be more than unclear based on the note, it’s a good thing to keep in the back of your mind when something really, thoroughly inexplicable is happening at a execution vs. explain plan level for X$ objects.

Advertisements

Blog at WordPress.com.