Kevin Kempf's Blog

July 9, 2009

What’s your DBID?

Filed under: 11g, Oracle, RMAN, Utilities — Tags: , — kkempf @ 3:11 pm

I started reviewing recovery scenarios and realized that if I ever had to do a complete database loss recovery without my recovery catalog, determining my DBID would be problematic. So I did a little shell scripting and came up with a simple solution at the OS level. This is written for bash, on Linux, and assumes you can receive email from the server. A simple way to confirm this is to type echo “Test” | mail -s “Email Test” dba@yourcomany.com and see if it gets through. If it doesn’t, check to see if sendmail is running (#service sendmail status), or consult your system admins. Anyway, here’s the script:

cat incarnation.sh

# source your ORACLE_HOME that rman runs out of, however you do that
. /u01/appprod/oracle/proddb/11.1.0/PROD_myserver.env
RCATUSER=whatever your recovery catalog user is
RCATPASS=whatever your recovery catalog user password is
DBA=your email
# really only necessary if you’re not running this script out of the recovery catalog home
REPSID=SID of your recovery catalog

echo “list incarnation;”|rman catalog ${RCATUSER}/${RCATPASS}@${REPSID} | mail -s “RMAN DBIDs” $DBA

Now you can cron it up (crontab -e) and in the case below you’ll be updated every weekday @ 8am:
# email me the DBIDs every weekday
0 8 * * 1-5 /scratch/oracle/dba/scripts/incarnation.sh > /dev/null

Your email will look something like this:

Recovery Manager: Release 11.1.0.7.0 - Production on Thu Jul 9 15:39:21 2009
Copyright (c) 1982, 2007, Oracle.  All rights reserved.
connected to recovery catalog database
RMAN>
List of Database Incarnations
DB Key  Inc Key DB Name  DB ID            STATUS  Reset SCN  Reset Time
------- ------- -------- ---------------- --- ---------- ----------
23832   23839   ADV_TRN  292427552        PARENT  1          05-FEB-06
23832   23833   ADV_TRN  292427552        CURRENT 440636     02-MAY-07
46144   46151   ADV_TEST 491394006        PARENT  1          05-FEB-06
46144   46145   ADV_TEST 491394006        CURRENT 440636     11-JUL-07
46214   46221   AAD_TEST 914518601        PARENT  1          05-FEB-06
46214   46215   AAD_TEST 914518601        CURRENT 440636     12-JUL-07
...

There may be a more elegant way to do this, but I couldn’t find one. I would have preferred this to be an EM job, except there is no option under RMAN scripts to connect to the recovery catalog only (you must select at least 1 target).

June 30, 2009

Don’t Delete those Standby Controlfiles!

Filed under: Oracle, RMAN — Tags: , — kkempf @ 11:55 am

As I stated earlier, I started using RMAN to run disk-based backups on my ERP systems. I hit this warning at the end of my production backups:

RMAN> delete noprompt obsolete device type disk;
RMAN> retention policy will be applied to the command
RMAN> retention policy is set to recovery window of 31 days
Deleting the following obsolete backups and copies:
Type Key Completion Time Filename/Handle
——————– —— —————— ——————–
Control File Copy 3 22-MAY-09 /tmp/cntrl01.dbf
Control File Copy 4 26-MAY-09 /scratch/oracle/dba/sbcnt.dbf

RMAN-06207: WARNING: 2 objects could not be deleted for DISK channel(s) due
RMAN-06208: to mismatched status. Use CROSSCHECK command to fix status
RMAN-06210: List of Mismatched objects
RMAN-06211: ==========================
RMAN-06212: Object Type Filename/Handle
RMAN-06213: ————— —————————————————
RMAN-06214: Datafile Copy /tmp/cntrl01.dbf
RMAN-06214: Datafile Copy /scratch/oracle/dba/sbcnt.dbf

I’d created these standby control files in a “junk” directory because I was rebuilding my dataguard environment in an attempt to reproduce the Advanced Compression/Dataguard bug I am hitting. Never thought that deleting the files when I was done with them would cause me grief later. Well, turns out RMAN policy would have just cleaned them up for me… if I’d left them there…

Typical Oracle, it tells you to run crosscheck, which in reality just spits out the error shown above. What you really need to do is get rid of the references in the repository/controlfile to these files. Took me a few tries, but the syntax to fix this is below:

Recovery Manager: Release 11.1.0.7.0 – Production on Tue Jun 30 10:53:05 2009

Copyright (c) 1982, 2007, Oracle. All rights reserved.

connected to target database: PROD (DBID=4098162468)
connected to recovery catalog database

RMAN> delete force noprompt obsolete device type disk;

RMAN retention policy will be applied to the command
RMAN retention policy is set to recovery window of 31 days
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1029 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=870 device type=DISK
Deleting the following obsolete backups and copies:
Type Key Completion Time Filename/Handle
——————– —— —————— ——————–
Control File Copy 1736660 22-MAY-09 /tmp/cntrl01.dbf
Control File Copy 1754731 26-MAY-09 /scratch/oracle/dba/sbcnt.dbf
deleted control file copy
control file copy file name=/tmp/cntrl01.dbf RECID=3 STAMP=687543221
deleted control file copy
control file copy file name=/scratch/oracle/dba/sbcnt.dbf RECID=4 STAMP=687890578
Deleted 2 objects

June 12, 2009

Logical Database Corruption in 11i

Filed under: Oracle — Tags: , — kkempf @ 3:31 pm

It’s a long one… but hopefully you may find it amusing, if not helpful…

I rather naively thought that since I used RMAN for backups, I was safe from database corruption issues. Or, to be more precise, if I hit an issue, I’d know about it because my backup would alert me, and I could perform an RMAN block level restore. Wrong. RMAN covers you for physical corruption, not logical. To check for logical corruption, you need to use the backup validate check logical (database||datafile X) syntax. So there’s my first lesson, and as a result, I now have an Enterprise Manager job which runs this command once a week so I’m not blindsided next time.

What is more interesting is the solution. Logical corruption is, in essence, a block in a state where the header and footer don’t match, so the data inside is all suspect. It’s not that it can’t be read, it’s just “wrong”. From what I read, it seems power issues to the SAN are the most likely culprit, though it’s hard to rule anything out. Regardless, it happened silently, and long enough ago that I didn’t have backups available to roll through to fix it with RMAN (due to my RMAN retention policy). This begs an interesting question – how long do you keep an RMAN backup? Really, if you’re reading this I’d be interested to know, as it’s always a balancing act between space and recoverability. I have my retention policy set to 31 days, after which I figure we’re more in a “we need to fix it in place” situation than a “roll back and roll forward” situation.

So from here I’ll walk you through my solution, though I should add that support was utterly useless. I earnestly believed that this had to be something they would be good at. Textbook case where core database support would have me patched up and un-corrupted (is that a word?) in no time. Here’s my solution after playing with it for 4 days to better understand the nature of how logical corruption behaves.

1. RMAN told me I had problems

rman target / nocatalog
Recovery Manager: Release 11.1.0.7.0 – Production on Fri Jun 12 10:32:35 2009
Copyright (c) 1982, 2007, Oracle. All rights reserved.
connected to target database: PROD (DBID=4098162468)
using target database control file instead of recovery catalog
RMAN> backup validate check logical datafile 6;
Starting backup at 12-JUN-09
allocated channel: ORA_SBT_TAPE_1
channel ORA_SBT_TAPE_1: SID=935 device type=SBT_TAPE
channel ORA_SBT_TAPE_1: NetVault Oracle APM v.5.5.11
channel ORA_SBT_TAPE_1: starting full datafile backup set
channel ORA_SBT_TAPE_1: specifying datafile(s) in backup set
input datafile file number=00006 name=/u05/appprod/proddata/ctxd01.dbf
channel ORA_SBT_TAPE_1: backup set complete, elapsed time: 00:00:15
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
— —– ——– ——- ——– ——–
6 FAILED 0 23606 65536 5974945621830
File Name: /u05/appprod/proddata/ctxd01.dbf
Block Type Blocks Failing Blocks Processed
—— ——– ————–
Data 0 40343
Index 23 1410
Other 0 177
validate found one or more corrupt blocks
See trace file /u01/appprod/oracle/proddb/11.1.0/log/diag/rdbms/prod/PROD/trace/PROD_ora_20982.trc for details
Finished backup at 12-JUN-09

At this point, EM sends an alert if you have default thresholds set for metrics; basically if there’s more than 0 rows in v$database_block_corruption it notifies you (or if you don’t have email preferences set, it at least marks it as a critical error):

Number of corrupt data blocks is 27.:Jun 12, 2009 3:32:32 PM EDT

This is a really gut-sinking, oh sh*t moment when it happens on your production system. At least it was for me.

2. Verified dbv agreed

dbv file=ctxd01.dbf

Page 17090 failed with check code 6401
Block Checking: DBA = 25182923, Block Type = KTB-managed data block
**** row 0: row length 23 past end of block
**** row 0: row skipped so other stats may be wrong
—- end index block validation
Page 17099 failed with check code 6401

DBVERIFY – Verification complete

Total Pages Examined : 131072
Total Pages Processed (Data) : 40018
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 1410
Total Pages Failing (Index): 23
Total Pages Processed (Other): 177
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 89467
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Total Pages Encrypted : 0
Highest block SCN : 627787749 (1391.627787749)

3. Determined affected blocks

select * from v$database_block_corruption;
FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO
—— —— —— ———— ——-
6 17099 1 5.9712E+12 CORRUPT
6 17089 2 5.9712E+12 CORRUPT
6 17084 1 5.9712E+12 CORRUPT
6 17082 1 5.9712E+12 CORRUPT
6 17068 1 5.9712E+12 CORRUPT
6 17048 1 5.9712E+12 CORRUPT
6 17035 1 5.9712E+12 CORRUPT
6 17032 1 5.9735E+12 CORRUPT
6 8009 1 5.9712E+12 CORRUPT
6 8004 1 5.9712E+12 CORRUPT
6 7958 1 5.9712E+12 CORRUPT
6 3924 1 5.9712E+12 CORRUPT
6 3913 1 5.9712E+12 CORRUPT
6 592 1 5.9712E+12 CORRUPT
6 590 1 5.9712E+12 CORRUPT
6 423 1 5.9712E+12 CORRUPT
6 417 1 5.9712E+12 CORRUPT
6 337 3 5.9712E+12 CORRUPT
6 297 1 5.9712E+12 CORRUPT
6 235 1 5.9712E+12 CORRUPT

4. Determined what segment(s) were affected

select tablespace_name, segment_type, owner, segment_name
from dba_extents where file_id = 6 and 297 between block_id and block_id + blocks -1;

All the corrupted blocks were in the ctxsys.dr$pending table. Specifically, this is an index-oriented table (IOT), and the corruption was on one of its indexes (CTXSYS.SYS_IOT_TOP_218818), which happened to be the primary key. DOH! You can’t disable PK constraints on IOT’s (ORA-25188: cannot drop/disable/defer the primary key constraint for index-organized tables or sorted hash cluster), nor can you do an alter index rebuild (ORA-28650: Primary index on an IOT cannot be rebuilt).

Ctxsys is Oracle texts’ owner; the dr$pending table is essentially a queue table which is a “to do” list of Oracle Text updates to indexes. Why you don’t just update the index is beyond me, but lucky for me, this table already feels very non-critical. Back in the day, Oracle had you install Oracle Text (Intermedia Text, Clear Text, I think these are all name changes of the same product) in its own tablespace. In my case it’s called CTXD.

(Side note, short rant. While I waited for support to not help me for days, I began to explore the possibility of re-installing Oracle Text via a Metalink note. Holy Cats Batman! This is the worst documented process I’ve ever seen from Oracle. I was following Metalink Docs 579601.1 and 312640.1, the latter of which is not just hard to follow, but outright wrong in so many places, it literally took me 2 days to re-create a dozen text indexes in a test environment. If you ever get to the point where you are forced to re-install Oracle Text with 11i, quit your job. Or do an RMAN drop database including backups noprompt. Because trust me, you will go mad trying to follow their document)

5. What is DR$PENDING?

Since an IOT PK index is virtually indestructable and un-alterable, and re-creating Oracle text was not an option I wanted to pursue, I began trying to understand what was actually in the table ctxsys.dr$pending (which happened to be about 115k rows). When I realized it was a queue of updates to Oracle text indexes, I
figured there had to be a way to empty it. I found some relevant help on Metalink, and by running:

select u.username , i.idx_name from dr$index i, dba_users u where u.user_id=i.idx_owner# and idx_id in (select pnd_cid from dr$pending);

I was able to see what was in the table:

USERNAME IDX_NAME
—————————- —————————-
APPLSYS FND_LOBS_CTX
JTF JTF_NOTES_TL_C1

With this information, I could run

exec ctx_ddl.sync_index(‘JTF.JTF_NOTES_TL_C1’);
exec ctx_ddl.sync_index(‘APPLSYS.FND_LOBS_CTX’);

and after it churned a while and beat up the SAN pretty good, I had and empty IOT:

select count (*) from dr$pending;
COUNT
———-
0

6. Recreate the table

After confirming the table was empty, I watched it a while to see how quickly/frequently it refilled. Seemed not to be an issue (can be minutes or more without updates). So I created a script to drop dr$pending and re-create it from the DDL in the data dictionary, this time putting it in the SYSAUX tablespace.

7. Still corrupt? How can that be?

I thought I had a win, but when I ran RMAN validate backup check logical datafile 6 again, it still showed the same corruption as before. Not understanding how this could be possible, I ran the SQL in step 4 above to determine what segments were affected. It came back with nothing! So, I had corrupted blocks in my tablespace, but none of them were tied to any segment (object). Now we’re getting somewhere…

8. Fill ‘er up

I was pretty pleased with my progress, and thinking along the lines of creating a pl/sql cursor loop to generate SQL to move all the objects in the CTXD tablespace to SYSAUX. Then I could drop the CTXD tablespace, and, presumably with it, the corruption. Then I saw there were LOB segments and LOB indexes in there, and I knew that wasn’t going to make it easier to move. Then I’m talking to a colleague, and he says “why not just fill up the tablespace with garbage and in doing so, over-write the corrupt blocks?”. Turns out, this is amazingly effective!

a. create table ctxsys.fcr as select * from apps.fnd_concurrent_requests;

b. insert into ctxsys.fcr select * from apps.fnd_concurrent_requests where rownum < 10000;

c. repeat b until the tablespace is full and decrement the 10000 number as required to keep it going

d. drop table ctxsys.fcr;

Perhaps Oracle has a procedure or tool to do this kind of thing in a more elegant way (truncate?), but I was 100% sure that if I asked my analyst this, I wouldn't get anywhere. It makes sense that there could be a procedure to say, in effect, if you find a logically corrupt block, but it's not associated with any objects, re-write the block to look like any normal unused block in a tablespace.

9. Lessons Learned

a. If you’re not running RMAN validate with check logical… do it! This is the simple script I now run weekly against all my production databases:

run {
allocate channel d1 type disk;
backup validate check logical database plus archivelog;
release channel d1;
}

b. If you are using Oracle text with 11i, and your dr$pending table has a zillion rows in it, you can run ctx_ddl.sync_index to process these rows. I will probably create a pl/sql job which cursor loops through the schema and object names in the table (#5, above), then runs ctx_ddl.sync_index against that cursor result. Seems like once a week would be sufficient, though admittedly, I will need to better understand why I had rows in there that weren’t being processed. Based on note 382809.1, I believe that this is because concurrent requests aren’t scheduled to do this automatically.

Create a free website or blog at WordPress.com.