Kevin Kempf's Blog

June 30, 2009

Don’t Delete those Standby Controlfiles!

Filed under: Oracle, RMAN — Tags: , — kkempf @ 11:55 am

As I stated earlier, I started using RMAN to run disk-based backups on my ERP systems. I hit this warning at the end of my production backups:

RMAN> delete noprompt obsolete device type disk;
RMAN> retention policy will be applied to the command
RMAN> retention policy is set to recovery window of 31 days
Deleting the following obsolete backups and copies:
Type Key Completion Time Filename/Handle
——————– —— —————— ——————–
Control File Copy 3 22-MAY-09 /tmp/cntrl01.dbf
Control File Copy 4 26-MAY-09 /scratch/oracle/dba/sbcnt.dbf

RMAN-06207: WARNING: 2 objects could not be deleted for DISK channel(s) due
RMAN-06208: to mismatched status. Use CROSSCHECK command to fix status
RMAN-06210: List of Mismatched objects
RMAN-06211: ==========================
RMAN-06212: Object Type Filename/Handle
RMAN-06213: ————— —————————————————
RMAN-06214: Datafile Copy /tmp/cntrl01.dbf
RMAN-06214: Datafile Copy /scratch/oracle/dba/sbcnt.dbf

I’d created these standby control files in a “junk” directory because I was rebuilding my dataguard environment in an attempt to reproduce the Advanced Compression/Dataguard bug I am hitting. Never thought that deleting the files when I was done with them would cause me grief later. Well, turns out RMAN policy would have just cleaned them up for me… if I’d left them there…

Typical Oracle, it tells you to run crosscheck, which in reality just spits out the error shown above. What you really need to do is get rid of the references in the repository/controlfile to these files. Took me a few tries, but the syntax to fix this is below:

Recovery Manager: Release 11.1.0.7.0 – Production on Tue Jun 30 10:53:05 2009

Copyright (c) 1982, 2007, Oracle. All rights reserved.

connected to target database: PROD (DBID=4098162468)
connected to recovery catalog database

RMAN> delete force noprompt obsolete device type disk;

RMAN retention policy will be applied to the command
RMAN retention policy is set to recovery window of 31 days
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1029 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=870 device type=DISK
Deleting the following obsolete backups and copies:
Type Key Completion Time Filename/Handle
——————– —— —————— ——————–
Control File Copy 1736660 22-MAY-09 /tmp/cntrl01.dbf
Control File Copy 1754731 26-MAY-09 /scratch/oracle/dba/sbcnt.dbf
deleted control file copy
control file copy file name=/tmp/cntrl01.dbf RECID=3 STAMP=687543221
deleted control file copy
control file copy file name=/scratch/oracle/dba/sbcnt.dbf RECID=4 STAMP=687890578
Deleted 2 objects

Advertisements

June 26, 2009

What’s Safe to Blog?

Filed under: Blog — kkempf @ 12:52 pm

This topic came up at work the other day, and I thought I’d elaborate for a moment on the unofficial conclusion we arrived at. Specifically, the question was this: where is the line drawn between fact, opinion, libel, and liability on a blog?

I have the luxury of knowing an attorney who wouldn’t bill me for the question, and received three solid recommendations:

1. You can mention a company or product by name, as long as your discussion is legitimate.
2. No 3rd party domain names/pieces of domain names or trademarks can be used in the blog title or URL.
3. To protect yourself and your company, don’t specify where you work, thereby keeping the blog personal in nature.

June 25, 2009

Unable to perform the backup because the database is closed

Filed under: Oracle, RMAN — Tags: — kkempf @ 11:08 am

When we decided to convert all possible 32-bit databases to 64-bit, Netvault (tape management APM for Oracle) decided that we have to re-purchase client licenses (which were under a year old) for each host because this was a “different” product. Hmmm, thanks for treating a customer so well…. instead we’ll go with door #2, and you’ll get no new licenses nor support….

Door #2

I’m running EM Grid Control 10.2.0.5 (which I think is a significant, slick upgrade from 10.2.0.4, by the way) and in the process of putting backups into the jobs section there. The end-state goal is to have all RMAN jobs scheduled and run through EM, backed up to NFS disk, then scraped to tape via 1 filesystem job.

So I’m moving along happily enough, creating weekly fulls and nightly incrementals, testing them, and on one tiny non-ERP database, when I run the job in EM, I get this in the log:

Unable to perform the backup because the database is closed.

Odd, since the database is clearly open, has been up for months, and is completely functional and accessible, I go to Metalink to see if there’s any useful information to be gleaned. It turns out, there’s a few well documented “checks” for this error, most of them revolving around the ORACLE_SID being case sensitive, permissions, agent .xml files, etc. All of these checked out, as did a manual bequeath connection to the database. In the end, the fix was pretty Microsoft-ish. Remove the database and listener from EM. Wait a few minutes. Add them back. Re-run RMAN job. Fixed.

June 18, 2009

Change of Theme

Filed under: Blog — kkempf @ 7:09 pm

I run Linux (5.3) on my Dell Latitude D820 at work (CentOS, free Red Hat) and Firefox was not doing a stellar job of rendering the old theme. It was too busy and the fonts were drawn too large for their boxes; I hope this is an improvement

11g Dataguard/Advanced Compression Bug

Filed under: 11g, Bugs, Oracle — Tags: , , — kkempf @ 3:50 pm

Ah it was inevitable. I spoke too kindly of RDBMS 11g. Now I’m stuck waiting on Oracle Development to fix a major problem. The gist of it is that after I began compressing tables with 11g Advanced Compression, the dataguard instance would crash.

First, I saw odd ORA-07445 [__INTEL_NEW_MEMCPY()+44] SIGSEGV and ORA-0600 errors in my standby alert log. Eventually the instance crashed. After much research, I enabled the init parm db_block_checking (highly recommended!) and the error became much clearer; because of db_block_checking it was no longer writing garbage it was failing the check:

Errors in file /u01/appprod/oracle/proddb/11.1.0/log/diag/rdbms/proddg/PROD/trace/PROD_pr0e_226
66.trc:
ORA-10562: Error occurred while applying redo to data block (file# 19, block# 445469)
ORA-10564: tablespace APPS_TS_TX_DATA
ORA-01110: data file 19: ‘/u05/appprod/proddata/apps_ts_tx_data07.dbf’
ORA-10561: block type ‘TRANSACTION MANAGED DATA BLOCK’, data object# 1186389
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [kddummy_blkchk], [19], [445469], [6110], [], [], [], [], [],
[], [], []

Datafile 19, block 445469 tied back to the APPS_TS_TX_DATA tablespace and the object was BOM.CST_ITEM_COST_DETAILS. Yep, it was one of the handful of tables which I’d compressed so far.

The analyst tied it to bug 8277580 and tells me there’s unpublished parts of the bug which mention compression
Bug 8277580 ORA-7445: [__INTEL_NEW_MEMCPY()+44]
RDBMS Ver: 11.1.0.7
O/S: 226 Linux x86-64

What scares me is that the bug was opened 4 months ago. If they drag their feet too long, I guess I uncompress everything, and they send me a refund for Advanced Compression and the year of support I already paid, right?

One note about db_block_checking. It defaults to “FALSE” and the docs say it incurs a 1-10% overhead. With the alternative prospect of silently corrupting my standby (Gah!) I can’t help but think this is a no-brainer to activate. With db_block_checking enabled, the behavior was what I would consider to be appropriate: The standby database managed recovery process stops, and the RDBMS stays up. If you restart the managed recovery process, it dies on the exact same log and with the exact same error.

One follow up thought: At this point, I have to ask myself, do I have a recoverable database? In other words, my assumption so far is that dataguard and the log ship process is somehow corrupting logs due to advanced compression. Interesting trivia note: if you MD5 checksum the same archivelog on the primary and the standby, it will NOT match! I only thought to do this under 11g.. anyone know if they match under 10g? Is this my core problem?

I will have to prove this by restoring a backup of my database and recovering until cancel to roll through a few hundred logs. If, on the other hand, the local archivelogs are being written “wrong”… I’m fooked.

*edit* I’m not getting far with Oracle on this bug. It turns out, when I look at the bug status codes for this issue, it’s essentially unchanged since it was open. I sent this email to my sales rep in hopes of “lighting a fire” as we used to say in the Army:

After further working with support, I’m extremely unhappy with this situation. After we spoke last week, I’d mentioned that this bug had existed since late February 2009. Since then, there have been no significant updates to the bug, and when I looked up the bug status code, I was even more dismayed. According to Doc 16660.1, this bug has been in 1 of 2 statuses since inception:

10: Use to file a bug without all the needed information (for instance, trace files from the customer), to inform Development that a bug will soon become Status 11.

16: Used by BDE to pre-sort well-formed bugs before development start fixing.

This isn’t exactly encouraging! The way I read this is that for 4 months, development has either been trying to gather needed information or pre-sorting the bug before actually working on it. This means that our (certified) configuration has been broken since we bought it.

The only option the analyst gave me was to uncompress my tables, and I’m afraid that I may have to go this route since I have absolutely no feel from support whether this bug is even being worked or when it may be fixed. While I agree this is likely to fix the Dataguard bug, does it come with a refund for Advanced Compression?

*edit* It’s been one month since support positively identified my bug was existing bug 8277580. While there is still no resolution, I am being asked for a bunch of trace files and logs this week, which hopefully at least means someone is looking at it. While they don’t always tie out to the same object or block, the data guard apply always fails with an ORA-0600 and a block reference to a compressed table. Meanwhile, our admin is keeping a live snapshot at our standby site via the SAN; meaning at worst, if our primary failed we could start up the database on the remote end without losing much (if any) information. It requires a huge amount of storage and bandwidth, however, which Dataguard does not.

*edit* Now support is asking for the possibility of uploading/providing all the components necessary to reproduce the bug on their end.  Odd, since they acknowledge the bug, but I’ll be curious to see what comes of this.  I can’t very well upload the entire database to them (at least not easily), and even to send them one offending datafile with logs to reproduce the issue seems difficult.   I break my “data” and “index” datafiles into 8GB’s per, and currently have about 9 of them (last I looked), so at a minimum we’re probably talking about 8-10GB to reproduce this at their end.

June 16, 2009

Locked Statistics

Filed under: Oracle — Tags: , , , — kkempf @ 8:12 am

I run the concurrent request gather schema statistics weekly with default settings, against the schema ALL. The past two weeks (since shortly after the 11g RDBMS upgrade) the job failed; I called the first one a fluke but by the second failure I needed to investigate. When I got in on Monday morning, the job was still running with large “concurrent program” spikes and the process was tied up on dictionary lock waits. Did a little digging on Metalink, and found Note 732082.1 which basically says, for whatever reason, you may need to unlock schema stats before your run the gather job. In the end, I just scheduled a pl/sql procedure to run before the stats job. I suppose I could have made it a request group for running statistics gathering; that may have been the “best” answer, but I didn’t want to pollute canned functionality in the apps. I also have to admit I don’t know if it will ever surface again, or why it happened this time. The truth is I just don’t want to have to think about it. 10-20 lines of code later and I’m covered

procedure unlock_schema_stats
is
cursor c1 is
select
distinct(owner) ownername
from
dba_tab_statistics
where
stattype_locked is not null
and
owner not in (‘SYSTEM’,’APPLSYS’,’SYS’)
;
begin
for c1_rec in c1 loop
dbms_stats.unlock_schema_stats(c1_rec.ownername);
end loop
;
end unlock_schema_stats
;

June 12, 2009

Logical Database Corruption in 11i

Filed under: Oracle — Tags: , — kkempf @ 3:31 pm

It’s a long one… but hopefully you may find it amusing, if not helpful…

I rather naively thought that since I used RMAN for backups, I was safe from database corruption issues. Or, to be more precise, if I hit an issue, I’d know about it because my backup would alert me, and I could perform an RMAN block level restore. Wrong. RMAN covers you for physical corruption, not logical. To check for logical corruption, you need to use the backup validate check logical (database||datafile X) syntax. So there’s my first lesson, and as a result, I now have an Enterprise Manager job which runs this command once a week so I’m not blindsided next time.

What is more interesting is the solution. Logical corruption is, in essence, a block in a state where the header and footer don’t match, so the data inside is all suspect. It’s not that it can’t be read, it’s just “wrong”. From what I read, it seems power issues to the SAN are the most likely culprit, though it’s hard to rule anything out. Regardless, it happened silently, and long enough ago that I didn’t have backups available to roll through to fix it with RMAN (due to my RMAN retention policy). This begs an interesting question – how long do you keep an RMAN backup? Really, if you’re reading this I’d be interested to know, as it’s always a balancing act between space and recoverability. I have my retention policy set to 31 days, after which I figure we’re more in a “we need to fix it in place” situation than a “roll back and roll forward” situation.

So from here I’ll walk you through my solution, though I should add that support was utterly useless. I earnestly believed that this had to be something they would be good at. Textbook case where core database support would have me patched up and un-corrupted (is that a word?) in no time. Here’s my solution after playing with it for 4 days to better understand the nature of how logical corruption behaves.

1. RMAN told me I had problems

rman target / nocatalog
Recovery Manager: Release 11.1.0.7.0 – Production on Fri Jun 12 10:32:35 2009
Copyright (c) 1982, 2007, Oracle. All rights reserved.
connected to target database: PROD (DBID=4098162468)
using target database control file instead of recovery catalog
RMAN> backup validate check logical datafile 6;
Starting backup at 12-JUN-09
allocated channel: ORA_SBT_TAPE_1
channel ORA_SBT_TAPE_1: SID=935 device type=SBT_TAPE
channel ORA_SBT_TAPE_1: NetVault Oracle APM v.5.5.11
channel ORA_SBT_TAPE_1: starting full datafile backup set
channel ORA_SBT_TAPE_1: specifying datafile(s) in backup set
input datafile file number=00006 name=/u05/appprod/proddata/ctxd01.dbf
channel ORA_SBT_TAPE_1: backup set complete, elapsed time: 00:00:15
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
— —– ——– ——- ——– ——–
6 FAILED 0 23606 65536 5974945621830
File Name: /u05/appprod/proddata/ctxd01.dbf
Block Type Blocks Failing Blocks Processed
—— ——– ————–
Data 0 40343
Index 23 1410
Other 0 177
validate found one or more corrupt blocks
See trace file /u01/appprod/oracle/proddb/11.1.0/log/diag/rdbms/prod/PROD/trace/PROD_ora_20982.trc for details
Finished backup at 12-JUN-09

At this point, EM sends an alert if you have default thresholds set for metrics; basically if there’s more than 0 rows in v$database_block_corruption it notifies you (or if you don’t have email preferences set, it at least marks it as a critical error):

Number of corrupt data blocks is 27.:Jun 12, 2009 3:32:32 PM EDT

This is a really gut-sinking, oh sh*t moment when it happens on your production system. At least it was for me.

2. Verified dbv agreed

dbv file=ctxd01.dbf

Page 17090 failed with check code 6401
Block Checking: DBA = 25182923, Block Type = KTB-managed data block
**** row 0: row length 23 past end of block
**** row 0: row skipped so other stats may be wrong
—- end index block validation
Page 17099 failed with check code 6401

DBVERIFY – Verification complete

Total Pages Examined : 131072
Total Pages Processed (Data) : 40018
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 1410
Total Pages Failing (Index): 23
Total Pages Processed (Other): 177
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 89467
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Total Pages Encrypted : 0
Highest block SCN : 627787749 (1391.627787749)

3. Determined affected blocks

select * from v$database_block_corruption;
FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO
—— —— —— ———— ——-
6 17099 1 5.9712E+12 CORRUPT
6 17089 2 5.9712E+12 CORRUPT
6 17084 1 5.9712E+12 CORRUPT
6 17082 1 5.9712E+12 CORRUPT
6 17068 1 5.9712E+12 CORRUPT
6 17048 1 5.9712E+12 CORRUPT
6 17035 1 5.9712E+12 CORRUPT
6 17032 1 5.9735E+12 CORRUPT
6 8009 1 5.9712E+12 CORRUPT
6 8004 1 5.9712E+12 CORRUPT
6 7958 1 5.9712E+12 CORRUPT
6 3924 1 5.9712E+12 CORRUPT
6 3913 1 5.9712E+12 CORRUPT
6 592 1 5.9712E+12 CORRUPT
6 590 1 5.9712E+12 CORRUPT
6 423 1 5.9712E+12 CORRUPT
6 417 1 5.9712E+12 CORRUPT
6 337 3 5.9712E+12 CORRUPT
6 297 1 5.9712E+12 CORRUPT
6 235 1 5.9712E+12 CORRUPT

4. Determined what segment(s) were affected

select tablespace_name, segment_type, owner, segment_name
from dba_extents where file_id = 6 and 297 between block_id and block_id + blocks -1;

All the corrupted blocks were in the ctxsys.dr$pending table. Specifically, this is an index-oriented table (IOT), and the corruption was on one of its indexes (CTXSYS.SYS_IOT_TOP_218818), which happened to be the primary key. DOH! You can’t disable PK constraints on IOT’s (ORA-25188: cannot drop/disable/defer the primary key constraint for index-organized tables or sorted hash cluster), nor can you do an alter index rebuild (ORA-28650: Primary index on an IOT cannot be rebuilt).

Ctxsys is Oracle texts’ owner; the dr$pending table is essentially a queue table which is a “to do” list of Oracle Text updates to indexes. Why you don’t just update the index is beyond me, but lucky for me, this table already feels very non-critical. Back in the day, Oracle had you install Oracle Text (Intermedia Text, Clear Text, I think these are all name changes of the same product) in its own tablespace. In my case it’s called CTXD.

(Side note, short rant. While I waited for support to not help me for days, I began to explore the possibility of re-installing Oracle Text via a Metalink note. Holy Cats Batman! This is the worst documented process I’ve ever seen from Oracle. I was following Metalink Docs 579601.1 and 312640.1, the latter of which is not just hard to follow, but outright wrong in so many places, it literally took me 2 days to re-create a dozen text indexes in a test environment. If you ever get to the point where you are forced to re-install Oracle Text with 11i, quit your job. Or do an RMAN drop database including backups noprompt. Because trust me, you will go mad trying to follow their document)

5. What is DR$PENDING?

Since an IOT PK index is virtually indestructable and un-alterable, and re-creating Oracle text was not an option I wanted to pursue, I began trying to understand what was actually in the table ctxsys.dr$pending (which happened to be about 115k rows). When I realized it was a queue of updates to Oracle text indexes, I
figured there had to be a way to empty it. I found some relevant help on Metalink, and by running:

select u.username , i.idx_name from dr$index i, dba_users u where u.user_id=i.idx_owner# and idx_id in (select pnd_cid from dr$pending);

I was able to see what was in the table:

USERNAME IDX_NAME
—————————- —————————-
APPLSYS FND_LOBS_CTX
JTF JTF_NOTES_TL_C1

With this information, I could run

exec ctx_ddl.sync_index(‘JTF.JTF_NOTES_TL_C1’);
exec ctx_ddl.sync_index(‘APPLSYS.FND_LOBS_CTX’);

and after it churned a while and beat up the SAN pretty good, I had and empty IOT:

select count (*) from dr$pending;
COUNT
———-
0

6. Recreate the table

After confirming the table was empty, I watched it a while to see how quickly/frequently it refilled. Seemed not to be an issue (can be minutes or more without updates). So I created a script to drop dr$pending and re-create it from the DDL in the data dictionary, this time putting it in the SYSAUX tablespace.

7. Still corrupt? How can that be?

I thought I had a win, but when I ran RMAN validate backup check logical datafile 6 again, it still showed the same corruption as before. Not understanding how this could be possible, I ran the SQL in step 4 above to determine what segments were affected. It came back with nothing! So, I had corrupted blocks in my tablespace, but none of them were tied to any segment (object). Now we’re getting somewhere…

8. Fill ‘er up

I was pretty pleased with my progress, and thinking along the lines of creating a pl/sql cursor loop to generate SQL to move all the objects in the CTXD tablespace to SYSAUX. Then I could drop the CTXD tablespace, and, presumably with it, the corruption. Then I saw there were LOB segments and LOB indexes in there, and I knew that wasn’t going to make it easier to move. Then I’m talking to a colleague, and he says “why not just fill up the tablespace with garbage and in doing so, over-write the corrupt blocks?”. Turns out, this is amazingly effective!

a. create table ctxsys.fcr as select * from apps.fnd_concurrent_requests;

b. insert into ctxsys.fcr select * from apps.fnd_concurrent_requests where rownum < 10000;

c. repeat b until the tablespace is full and decrement the 10000 number as required to keep it going

d. drop table ctxsys.fcr;

Perhaps Oracle has a procedure or tool to do this kind of thing in a more elegant way (truncate?), but I was 100% sure that if I asked my analyst this, I wouldn't get anywhere. It makes sense that there could be a procedure to say, in effect, if you find a logically corrupt block, but it's not associated with any objects, re-write the block to look like any normal unused block in a tablespace.

9. Lessons Learned

a. If you’re not running RMAN validate with check logical… do it! This is the simple script I now run weekly against all my production databases:

run {
allocate channel d1 type disk;
backup validate check logical database plus archivelog;
release channel d1;
}

b. If you are using Oracle text with 11i, and your dr$pending table has a zillion rows in it, you can run ctx_ddl.sync_index to process these rows. I will probably create a pl/sql job which cursor loops through the schema and object names in the table (#5, above), then runs ctx_ddl.sync_index against that cursor result. Seems like once a week would be sufficient, though admittedly, I will need to better understand why I had rows in there that weren’t being processed. Based on note 382809.1, I believe that this is because concurrent requests aren’t scheduled to do this automatically.

June 5, 2009

Has support really become that bad?

Filed under: Oracle — kkempf @ 3:39 pm

I work regularly with a large software corporation on support cases. We pay a small fortune annually for the privilege of being able to open a case and find resolution on technical issues. I have a major work in progress related to a DR issue and I opened a case to get a 2nd set of eyes looking over solutions. As per the norm, I researched the details pretty thoroughly and providing them to vendor. Here’s their expert response:

“A colleague gave me this script a while back and I haven’t had an opportunity to test it but it should…”

Really? Your buddy gave you the script and even though you’re the expert, you’ve never used it, but you think it should do something? What’s next? “I found this on Google when I searched on your issue and it should be safe to run – let me know what it tells you”?

We’ve stooped to a new low.

Blog at WordPress.com.