Kevin Kempf's Blog

June 12, 2009

Logical Database Corruption in 11i

Filed under: Oracle — Tags: , — kkempf @ 3:31 pm

It’s a long one… but hopefully you may find it amusing, if not helpful…

I rather naively thought that since I used RMAN for backups, I was safe from database corruption issues. Or, to be more precise, if I hit an issue, I’d know about it because my backup would alert me, and I could perform an RMAN block level restore. Wrong. RMAN covers you for physical corruption, not logical. To check for logical corruption, you need to use the backup validate check logical (database||datafile X) syntax. So there’s my first lesson, and as a result, I now have an Enterprise Manager job which runs this command once a week so I’m not blindsided next time.

What is more interesting is the solution. Logical corruption is, in essence, a block in a state where the header and footer don’t match, so the data inside is all suspect. It’s not that it can’t be read, it’s just “wrong”. From what I read, it seems power issues to the SAN are the most likely culprit, though it’s hard to rule anything out. Regardless, it happened silently, and long enough ago that I didn’t have backups available to roll through to fix it with RMAN (due to my RMAN retention policy). This begs an interesting question – how long do you keep an RMAN backup? Really, if you’re reading this I’d be interested to know, as it’s always a balancing act between space and recoverability. I have my retention policy set to 31 days, after which I figure we’re more in a “we need to fix it in place” situation than a “roll back and roll forward” situation.

So from here I’ll walk you through my solution, though I should add that support was utterly useless. I earnestly believed that this had to be something they would be good at. Textbook case where core database support would have me patched up and un-corrupted (is that a word?) in no time. Here’s my solution after playing with it for 4 days to better understand the nature of how logical corruption behaves.

1. RMAN told me I had problems

rman target / nocatalog
Recovery Manager: Release 11.1.0.7.0 – Production on Fri Jun 12 10:32:35 2009
Copyright (c) 1982, 2007, Oracle. All rights reserved.
connected to target database: PROD (DBID=4098162468)
using target database control file instead of recovery catalog
RMAN> backup validate check logical datafile 6;
Starting backup at 12-JUN-09
allocated channel: ORA_SBT_TAPE_1
channel ORA_SBT_TAPE_1: SID=935 device type=SBT_TAPE
channel ORA_SBT_TAPE_1: NetVault Oracle APM v.5.5.11
channel ORA_SBT_TAPE_1: starting full datafile backup set
channel ORA_SBT_TAPE_1: specifying datafile(s) in backup set
input datafile file number=00006 name=/u05/appprod/proddata/ctxd01.dbf
channel ORA_SBT_TAPE_1: backup set complete, elapsed time: 00:00:15
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
— —– ——– ——- ——– ——–
6 FAILED 0 23606 65536 5974945621830
File Name: /u05/appprod/proddata/ctxd01.dbf
Block Type Blocks Failing Blocks Processed
—— ——– ————–
Data 0 40343
Index 23 1410
Other 0 177
validate found one or more corrupt blocks
See trace file /u01/appprod/oracle/proddb/11.1.0/log/diag/rdbms/prod/PROD/trace/PROD_ora_20982.trc for details
Finished backup at 12-JUN-09

At this point, EM sends an alert if you have default thresholds set for metrics; basically if there’s more than 0 rows in v$database_block_corruption it notifies you (or if you don’t have email preferences set, it at least marks it as a critical error):

Number of corrupt data blocks is 27.:Jun 12, 2009 3:32:32 PM EDT

This is a really gut-sinking, oh sh*t moment when it happens on your production system. At least it was for me.

2. Verified dbv agreed

dbv file=ctxd01.dbf

Page 17090 failed with check code 6401
Block Checking: DBA = 25182923, Block Type = KTB-managed data block
**** row 0: row length 23 past end of block
**** row 0: row skipped so other stats may be wrong
—- end index block validation
Page 17099 failed with check code 6401

DBVERIFY – Verification complete

Total Pages Examined : 131072
Total Pages Processed (Data) : 40018
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 1410
Total Pages Failing (Index): 23
Total Pages Processed (Other): 177
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 89467
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Total Pages Encrypted : 0
Highest block SCN : 627787749 (1391.627787749)

3. Determined affected blocks

select * from v$database_block_corruption;
FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO
—— —— —— ———— ——-
6 17099 1 5.9712E+12 CORRUPT
6 17089 2 5.9712E+12 CORRUPT
6 17084 1 5.9712E+12 CORRUPT
6 17082 1 5.9712E+12 CORRUPT
6 17068 1 5.9712E+12 CORRUPT
6 17048 1 5.9712E+12 CORRUPT
6 17035 1 5.9712E+12 CORRUPT
6 17032 1 5.9735E+12 CORRUPT
6 8009 1 5.9712E+12 CORRUPT
6 8004 1 5.9712E+12 CORRUPT
6 7958 1 5.9712E+12 CORRUPT
6 3924 1 5.9712E+12 CORRUPT
6 3913 1 5.9712E+12 CORRUPT
6 592 1 5.9712E+12 CORRUPT
6 590 1 5.9712E+12 CORRUPT
6 423 1 5.9712E+12 CORRUPT
6 417 1 5.9712E+12 CORRUPT
6 337 3 5.9712E+12 CORRUPT
6 297 1 5.9712E+12 CORRUPT
6 235 1 5.9712E+12 CORRUPT

4. Determined what segment(s) were affected

select tablespace_name, segment_type, owner, segment_name
from dba_extents where file_id = 6 and 297 between block_id and block_id + blocks -1;

All the corrupted blocks were in the ctxsys.dr$pending table. Specifically, this is an index-oriented table (IOT), and the corruption was on one of its indexes (CTXSYS.SYS_IOT_TOP_218818), which happened to be the primary key. DOH! You can’t disable PK constraints on IOT’s (ORA-25188: cannot drop/disable/defer the primary key constraint for index-organized tables or sorted hash cluster), nor can you do an alter index rebuild (ORA-28650: Primary index on an IOT cannot be rebuilt).

Ctxsys is Oracle texts’ owner; the dr$pending table is essentially a queue table which is a “to do” list of Oracle Text updates to indexes. Why you don’t just update the index is beyond me, but lucky for me, this table already feels very non-critical. Back in the day, Oracle had you install Oracle Text (Intermedia Text, Clear Text, I think these are all name changes of the same product) in its own tablespace. In my case it’s called CTXD.

(Side note, short rant. While I waited for support to not help me for days, I began to explore the possibility of re-installing Oracle Text via a Metalink note. Holy Cats Batman! This is the worst documented process I’ve ever seen from Oracle. I was following Metalink Docs 579601.1 and 312640.1, the latter of which is not just hard to follow, but outright wrong in so many places, it literally took me 2 days to re-create a dozen text indexes in a test environment. If you ever get to the point where you are forced to re-install Oracle Text with 11i, quit your job. Or do an RMAN drop database including backups noprompt. Because trust me, you will go mad trying to follow their document)

5. What is DR$PENDING?

Since an IOT PK index is virtually indestructable and un-alterable, and re-creating Oracle text was not an option I wanted to pursue, I began trying to understand what was actually in the table ctxsys.dr$pending (which happened to be about 115k rows). When I realized it was a queue of updates to Oracle text indexes, I
figured there had to be a way to empty it. I found some relevant help on Metalink, and by running:

select u.username , i.idx_name from dr$index i, dba_users u where u.user_id=i.idx_owner# and idx_id in (select pnd_cid from dr$pending);

I was able to see what was in the table:

USERNAME IDX_NAME
—————————- —————————-
APPLSYS FND_LOBS_CTX
JTF JTF_NOTES_TL_C1

With this information, I could run

exec ctx_ddl.sync_index(‘JTF.JTF_NOTES_TL_C1’);
exec ctx_ddl.sync_index(‘APPLSYS.FND_LOBS_CTX’);

and after it churned a while and beat up the SAN pretty good, I had and empty IOT:

select count (*) from dr$pending;
COUNT
———-
0

6. Recreate the table

After confirming the table was empty, I watched it a while to see how quickly/frequently it refilled. Seemed not to be an issue (can be minutes or more without updates). So I created a script to drop dr$pending and re-create it from the DDL in the data dictionary, this time putting it in the SYSAUX tablespace.

7. Still corrupt? How can that be?

I thought I had a win, but when I ran RMAN validate backup check logical datafile 6 again, it still showed the same corruption as before. Not understanding how this could be possible, I ran the SQL in step 4 above to determine what segments were affected. It came back with nothing! So, I had corrupted blocks in my tablespace, but none of them were tied to any segment (object). Now we’re getting somewhere…

8. Fill ‘er up

I was pretty pleased with my progress, and thinking along the lines of creating a pl/sql cursor loop to generate SQL to move all the objects in the CTXD tablespace to SYSAUX. Then I could drop the CTXD tablespace, and, presumably with it, the corruption. Then I saw there were LOB segments and LOB indexes in there, and I knew that wasn’t going to make it easier to move. Then I’m talking to a colleague, and he says “why not just fill up the tablespace with garbage and in doing so, over-write the corrupt blocks?”. Turns out, this is amazingly effective!

a. create table ctxsys.fcr as select * from apps.fnd_concurrent_requests;

b. insert into ctxsys.fcr select * from apps.fnd_concurrent_requests where rownum < 10000;

c. repeat b until the tablespace is full and decrement the 10000 number as required to keep it going

d. drop table ctxsys.fcr;

Perhaps Oracle has a procedure or tool to do this kind of thing in a more elegant way (truncate?), but I was 100% sure that if I asked my analyst this, I wouldn't get anywhere. It makes sense that there could be a procedure to say, in effect, if you find a logically corrupt block, but it's not associated with any objects, re-write the block to look like any normal unused block in a tablespace.

9. Lessons Learned

a. If you’re not running RMAN validate with check logical… do it! This is the simple script I now run weekly against all my production databases:

run {
allocate channel d1 type disk;
backup validate check logical database plus archivelog;
release channel d1;
}

b. If you are using Oracle text with 11i, and your dr$pending table has a zillion rows in it, you can run ctx_ddl.sync_index to process these rows. I will probably create a pl/sql job which cursor loops through the schema and object names in the table (#5, above), then runs ctx_ddl.sync_index against that cursor result. Seems like once a week would be sufficient, though admittedly, I will need to better understand why I had rows in there that weren’t being processed. Based on note 382809.1, I believe that this is because concurrent requests aren’t scheduled to do this automatically.

Advertisements

Create a free website or blog at WordPress.com.