Kevin Kempf's Blog

March 23, 2009

11i Advanced Compression

Filed under: Oracle — Tags: , — kkempf @ 10:35 am

I’m pretty pleased with the results of my first foray into advanced compression with 11i.  It was so successful (and expensive) that I’m regression testing it now and hope to go live within a month.   The short of it is that it reduced my apps_ts_data from 77G to 20G, apps_ts_idx from 60G to 37G , and saved some room in apps_ts_archive, apps_ts_interface, and system as well.  Granted, some of these gains are  simply a result of the alter table … move reorg, but it’s still a healthy savings.   Multiply it by all my environments, and we’re easily over the half-terrabyte range in savings.  No metrics on performance yet, but I’ll post more statistics and observations as I get them.  The one bugger is that while this reduces my used space, I’ve had little success in reclaiming allocated space.  So I guess I simply won’t need to add datafiles to existing tablespaces for a long, long time….

Things to know about compression:

  1. alter table x compress for all operations only affects new inserts/updates
  2. alter table move reorganizes the table in compressed manner
  3. alter tablespaces for future upgrades/table adds: alter tablespace ts1 default compress;

Rough outline of my compression strategy:

  1. Determine what’s in the buffer cache, and use sql to write sql to compress objects > 60% in memory as a starting point
  2. gather “before” statistics
  3. compress existing object
    1. set timing on
    2. alter table schema.name compress for all operations move;
  4. gather “after” statistics

March 13, 2009

11i Fixed Object Statistics

Filed under: 11g, Oracle — Tags: — kkempf @ 2:14 pm

Oracle 10g and 11g Fixed Object Statistics

As an Oracle Applications 11i DBA, I recently encountered something which cost me a day of my life this week. This may not be terribly unusual in and of itself (it happens 5 days a week, right?), but the fact that the symptoms of the issue are virtually undocumented and (of course) didn’t arise in any of my Test, Development, or Training environments was maddening. In short, it was caused by bad/stale fixed object statistics.

Lets talk about statistics for a minute, at least as far as they pertain to 11i. I consider myself pretty well acquainted with the vagaries of statistics with the ERP. Before this issue arose, I ran 4 different stats jobs.

* A weekly run of the Gather Schema Stats concurrent request in 11i by sysadmin, which covers my apps schemas (Schema Objects)

* A monthly run of dbms_stats.gather_dictionary_stats, which keeps the data dictionary in line. (Data Dictionary)

* A weekly run against my Discoverer EUL and custom ERP schema, which are effectively missed by Gather Schema Stats (Schema Objects)

* A monthly run of dbms_stats.gather_system_stats, which tweaks values related to CPU & disk, and directly affects how the optimizer determines disk and CPU costs (System Performance)

After this issue, I’ll add another run, on demand: dbms_stats.gather_fixed_object_stats. This job (per Note 465787.1) fills the last category of statistics: Fixed Objects. It should be run “during normal system load” and not re-gathered until “a major new upgrade requires so, or the workload changes”. Whatever that means.

I’m getting ahead of myself, but I figured I’d present the statistics environment before I walked through the issue. I’ll continue with a brief description of the environment, and what happened to trigger the issue. I will say that there was considerably more patching and maintenance involved than what I describe, but in the interest of brevity, I’m only going to bother describing what was relevant, in retrospect.

We run the ERP in an all-Linux environment. The backend (database) runs on 64-bit Red Hat 5, and I was in the process of upgrading from 10.2.0.3 to 10.2.0.4 (what I’d consider to be a minor skip). The front end runs on 32-bit Red Hat 4, and is version 11.5.10.2. I can’t help but think this is likely one of the most common Oracle ERP deployment scenarios out there, if you disregard the host OS. When upgrading the database, all was proceeding normally, and I’d just run catupgrd.sql and utlrp.sql. Then I tried to start the 11i concurrent managers, and things got nutty. We run 16 standard managers, and a host of other managers, and they appeared to be starting up… when the backend load started spiking into the 30-70 range, and the front end showed hundreds of FNDLIBR processes when viewed through the linux ps command.

This is easy to explain and diagnose in retrospect, but what was happening is that the 11i internal manager (GSM) was running a query (select count(*) from v$thread) to determine if a manager was “alive”. This query was, simply, taking too long to return a result (7-9 seconds, best case), so the manager assumed the process was dead, and spawned another one. This explains the load on my backend (dozens, up to hundreds of connections running select count(*) from v$thread), and my front end (16 FNDLIBR processes, started every 10-30 seconds). Now, try to find any documentation on select count(*) from v$thread. Metalink. Google. Next to nothing. After some time and

analysis, I found that when I ran this query against my (reverted) 10.2.0.3 database, the execution and plan looked like this:

SQL> select count(*) from v$thread;

COUNT(*)

———-

1

Elapsed: 00:00:00.04

-----------------------------------------------------------------------------------------------
| Id  | Operation                  | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                  |     1 |    18 |     0   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE            |                  |     1 |    18 |            |          |
|   2 |   NESTED LOOPS             |                  |     1 |    18 |     0   (0)| 00:00:01 |
|   3 |    NESTED LOOPS            |                  |     1 |    12 |     0   (0)| 00:00:01 |
|*  4 |     FIXED TABLE FULL       | X$KCCRT          |     1 |     9 |     0   (0)| 00:00:01 |
|*  5 |     FIXED TABLE FIXED INDEX| X$KCCCP (ind:1)  |     1 |     3 |     0   (0)| 00:00:01 |
|*  6 |    FIXED TABLE FIXED INDEX | X$KCCTIR (ind:1) |     1 |     6 |     0   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Yet when I ran it against my (upgraded) 10.2.0.4 database, the execution and plan looked like this:

SQL> select count(*) from v$thread;

COUNT(*)

———-

1

Elapsed: 00:00:09.40

---------------------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |     1 |    18 |     0   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE           |                 |     1 |    18 |            |          |
|   2 |   NESTED LOOPS            |                 |     1 |    18 |     0   (0)| 00:00:01 |
|   3 |    MERGE JOIN CARTESIAN   |                 |     1 |     9 |     0   (0)| 00:00:01 |
|*  4 |     FIXED TABLE FULL      | X$KCCTIR        |     1 |     6 |     0   (0)| 00:00:01 |
|   5 |     BUFFER SORT           |                 |     1 |     3 |     0   (0)| 00:00:01 |
|   6 |      FIXED TABLE FULL     | X$KCCCP         |     1 |     3 |     0   (0)| 00:00:01 |
|*  7 |    FIXED TABLE FIXED INDEX| X$KCCRT (ind:1) |     1 |     9 |     0   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Yep, that’s 200+ times slower after upgrading! Note, however, the optimizer thinks that the Time and CPU cost are the same for both plans (.01 and 0, respectively). Why is the optimizer now throwing a cartesian merge join into the plan for this query? The short answer: the subtle changes to the optimizer were enough to dramatically change the execution plan for this SQL from 10.2.0.3 to 10.2.0.4, because of bad/stale fixed object statistics.

Here’s where it fell apart: the keywords I was searching on were along the lines of “V$THREAD performance” or “X$KCCRT slow read” or “infinite FNDLIBR processes spawned”. I won’t even tell you how effective my SEV1 SR with Oracle was. Let it suffice to say that I had to revert my environment to pre-10.2.0.4 and write off a day of work in a maintenance window (there was a ton of other work done during that day, but it’s not relevant to the issue).

In short, I’d recommend taking a look at Note 465787.1 regarding statistics. This procedure was new to me, and came with 10g. While knowing when to run it may be more than unclear based on the note, it’s a good thing to keep in the back of your mind when something really, thoroughly inexplicable is happening at a execution vs. explain plan level for X$ objects.

Blog at WordPress.com.