Kevin Kempf's Blog

June 18, 2009

11g Dataguard/Advanced Compression Bug

Filed under: 11g, Bugs, Oracle — Tags: , , — kkempf @ 3:50 pm

Ah it was inevitable. I spoke too kindly of RDBMS 11g. Now I’m stuck waiting on Oracle Development to fix a major problem. The gist of it is that after I began compressing tables with 11g Advanced Compression, the dataguard instance would crash.

First, I saw odd ORA-07445 [__INTEL_NEW_MEMCPY()+44] SIGSEGV and ORA-0600 errors in my standby alert log. Eventually the instance crashed. After much research, I enabled the init parm db_block_checking (highly recommended!) and the error became much clearer; because of db_block_checking it was no longer writing garbage it was failing the check:

Errors in file /u01/appprod/oracle/proddb/11.1.0/log/diag/rdbms/proddg/PROD/trace/PROD_pr0e_226
ORA-10562: Error occurred while applying redo to data block (file# 19, block# 445469)
ORA-10564: tablespace APPS_TS_TX_DATA
ORA-01110: data file 19: ‘/u05/appprod/proddata/apps_ts_tx_data07.dbf’
ORA-10561: block type ‘TRANSACTION MANAGED DATA BLOCK’, data object# 1186389
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [kddummy_blkchk], [19], [445469], [6110], [], [], [], [], [],
[], [], []

Datafile 19, block 445469 tied back to the APPS_TS_TX_DATA tablespace and the object was BOM.CST_ITEM_COST_DETAILS. Yep, it was one of the handful of tables which I’d compressed so far.

The analyst tied it to bug 8277580 and tells me there’s unpublished parts of the bug which mention compression
Bug 8277580 ORA-7445: [__INTEL_NEW_MEMCPY()+44]
O/S: 226 Linux x86-64

What scares me is that the bug was opened 4 months ago. If they drag their feet too long, I guess I uncompress everything, and they send me a refund for Advanced Compression and the year of support I already paid, right?

One note about db_block_checking. It defaults to “FALSE” and the docs say it incurs a 1-10% overhead. With the alternative prospect of silently corrupting my standby (Gah!) I can’t help but think this is a no-brainer to activate. With db_block_checking enabled, the behavior was what I would consider to be appropriate: The standby database managed recovery process stops, and the RDBMS stays up. If you restart the managed recovery process, it dies on the exact same log and with the exact same error.

One follow up thought: At this point, I have to ask myself, do I have a recoverable database? In other words, my assumption so far is that dataguard and the log ship process is somehow corrupting logs due to advanced compression. Interesting trivia note: if you MD5 checksum the same archivelog on the primary and the standby, it will NOT match! I only thought to do this under 11g.. anyone know if they match under 10g? Is this my core problem?

I will have to prove this by restoring a backup of my database and recovering until cancel to roll through a few hundred logs. If, on the other hand, the local archivelogs are being written “wrong”… I’m fooked.

*edit* I’m not getting far with Oracle on this bug. It turns out, when I look at the bug status codes for this issue, it’s essentially unchanged since it was open. I sent this email to my sales rep in hopes of “lighting a fire” as we used to say in the Army:

After further working with support, I’m extremely unhappy with this situation. After we spoke last week, I’d mentioned that this bug had existed since late February 2009. Since then, there have been no significant updates to the bug, and when I looked up the bug status code, I was even more dismayed. According to Doc 16660.1, this bug has been in 1 of 2 statuses since inception:

10: Use to file a bug without all the needed information (for instance, trace files from the customer), to inform Development that a bug will soon become Status 11.

16: Used by BDE to pre-sort well-formed bugs before development start fixing.

This isn’t exactly encouraging! The way I read this is that for 4 months, development has either been trying to gather needed information or pre-sorting the bug before actually working on it. This means that our (certified) configuration has been broken since we bought it.

The only option the analyst gave me was to uncompress my tables, and I’m afraid that I may have to go this route since I have absolutely no feel from support whether this bug is even being worked or when it may be fixed. While I agree this is likely to fix the Dataguard bug, does it come with a refund for Advanced Compression?

*edit* It’s been one month since support positively identified my bug was existing bug 8277580. While there is still no resolution, I am being asked for a bunch of trace files and logs this week, which hopefully at least means someone is looking at it. While they don’t always tie out to the same object or block, the data guard apply always fails with an ORA-0600 and a block reference to a compressed table. Meanwhile, our admin is keeping a live snapshot at our standby site via the SAN; meaning at worst, if our primary failed we could start up the database on the remote end without losing much (if any) information. It requires a huge amount of storage and bandwidth, however, which Dataguard does not.

*edit* Now support is asking for the possibility of uploading/providing all the components necessary to reproduce the bug on their end.  Odd, since they acknowledge the bug, but I’ll be curious to see what comes of this.  I can’t very well upload the entire database to them (at least not easily), and even to send them one offending datafile with logs to reproduce the issue seems difficult.   I break my “data” and “index” datafiles into 8GB’s per, and currently have about 9 of them (last I looked), so at a minimum we’re probably talking about 8-10GB to reproduce this at their end.

Blog at