Kevin Kempf's Blog

January 13, 2011

No more OUI/GUI Agent install for 11.1 on Windows

Filed under: 11g, Enterprise Manager, Windows 2008 — kkempf @ 9:57 am

Oil and Water

Looks like a great Enterprise Solution!

Windows Server and Oracle databases go together like oil and water.  Everything about administering an Oracle database on Windows is annoying.  From the command line interface to starting services before I can start a database it fails in many ways.  Yet all of this is just my opinion.  What I found out yesterday is a new, substantial fact and a good reason to hate Windows even more.

Server 2008

We have a non-Oracle application which has an Oracle back end database.  It happens to be certified (only) on Windows.  It’s the only Windows RDBMS server I have to administer, so I suppose I should be grateful.  Still, as a result of an upgrade, we were able to move it to 64-bit Oracle 11.1 for Windows Server 2008.  All in all, it was a nice refresh/update of the technology stack as it had previously been running 32-bit Oracle 10.2 on Windows Server 2003.

Agent Woes

So I go to pull the 11.1 agent from OTN/Metalink/MOS/Oracle/World’s Slowest Support Site and am humored to find it’s now 500+ MB!  Seriously?  It’s an agent!  Back in the dark ages, under 9i, I swear they were like 40mb.    I put the thing in my $OH/sysman/agent_download/11.1.0.1.0 directory, and sure enough, it shows up as an option under deployments in EM.  I go through the process outlined here to push the install, and it fails because SSH isn’t running on the Windows host.  Who runs SSH on Windows?  I know it’s technically possible, but seriously, who expects that?   Needless to say, I’m annoyed, but I’m not about to go try to get SSH running on a Windows server I don’t want to administer.  So I push the agent download .zip file to the host and run the installer (tried both setup.exe and installer.exe) only to get this error:

Obviously

Time to contact support

I actually did try to create a response file and run it from the Win 2008 CLI.  It failed for an unknown reason, telling me to check the logs.  Of course, the logs weren’t in the directory I was in, and I was beyond annoyed at this point. Reluctantly, I opened an SR to see what I was doing wrong with the GUI install.  It turns out, nothing.  The analyst confirmed that in 11.1, the OUI/GUI installer has been removed.

One step forward, two steps back

Step back and ask yourself, is this a step forward?  Honestly, how many people run SSH on a Windows server?  My only other recourse is to mock up some cryptic response file (in Windows, no less, with notepad!) and then use a command line interface to manually install the agent (silently!).  Seriously, Oracle, this is just plain stupid.  There’s like 4 parameters required in the old GUI: where do you want to install it, what host and port is Grid Control installed on, and what’s the dbsnmp password?  Why not just leave this in the GUI?  Whoever made this call has obviously never worked in the real world.

My Solution

After berating the analyst, I installed the 10.2.0.5 agent (via the OUI GUI) to monitor my 11.1 RDBMS.  Makes more sense than Oracle’s stance.

July 23, 2010

Securing EM 11g with your own SSL cert

Filed under: Enterprise Manager, Oracle — kkempf @ 6:24 pm

A bit of a teaser

I recently had a miserable SR with Oracle regarding how to replace their “canned” certificate (part of the 11g install) with my own, real domain certificate.  In the end, I was told that unless I was using SSO, it was not supported.

This answer didn’t sit right with me, but I would never call SSL and certificates one of my strengths.  When my sysadmin/Linux counterpart heard this answer, he spent the day trying to prove Oracle wrong.. I sat shotgun for the ride.

The Short Answer

I’m sure it’s completely unsupported, but it’s not an external product; it’s only used within IT so it felt safe enough to “play” in.  In the end, he figured out how to get our certificate and key into a p12 wallet file via a pki tool and landed them on the server (.sso and .p12 file) where the existing (Oracle HTTP server) files were.  Yeah, that’s right.  EM 11g runs on Weblogic, which only serves as a container for the old Oracle HTTP Server.  In other words, it’s iAS running inside Weblogic.  The actual certificates are under an OHS directory.  Once replaced, the main website was using “our” certificate (https://hostname:1159/em).  Then the agents stopped communicating with the OMS; the temporary fix was to go do an emctl unsecure agent on all the servers.  Not ideal, we’ll work more on that later.  Consider this a “post in progress”.  Am I the only one who finds it ridiculous that Oracle “doesn’t support” using a real cert for EM?

The Messages I no longer see

Chrome:
The site’s security certificate is not trusted!
You attempted to reach hostname.com, but the server presented a certificate issued by an entity that is not trusted by your computer’s operating system. This may mean that the server has generated its own security credentials, which Google Chrome cannot rely on for identity information, or an attacker may be trying to intercept your communications. You should not proceed, especially if you have never seen this warning before for this site.

IE:
There is a problem with this website’s security certificate.

The security certificate presented by this website was not issued by a trusted certificate authority.

Security certificate problems may indicate an attempt to fool you or intercept any data you send to the server.
We recommend that you close this webpage and do not continue to this website.
Click here to close this webpage.
Continue to this website (not recommended).

Overview of the solution

If you can find a way to get your certificate, key and any intermediate certificates into a standard keystore, you’re in good shape.  We had a problem because apparently Oracle doesn’t like wildcards in the certificate name (*.domainname.com).  So the admin used the pki tool to create a p12, then imported it into Oracle Wallet Manager.  From here, you can select the Wallet menu option, then “Auto Logon”.  Now when you save it, you get the .sso file as well.  Land the cwallet.sso and the ewallet.p12 file in gc_inst/WebTierIH1/config/OHS/ohs1/keystores/upload and restart services.  Remember to unsecure your agents; I’ll update the post if we figure out what’s messing those up.

Follow Up

It appears that the fix which we stumbled upon can be mostly accomplished (except never in the case of wildcard certificates) from the command line with emctl.

To secure the console, use emctl secure console and point it to your own certificate wallet directory.

To secure the oms, use emctl secure oms and point it to your wallet and trusted cert location.  There is a catch to this, however.  You cannot use a *.domainname.com (wildcard) cert for this, as the agents pull this certificate and attempt to connect to *.domainname.com (instead of a real server name).  In our case, if we had a certificate with the exact server EM is running on, this would have worked as well.

Another oddity is the number of ports opened and said to be in use (which may or may not be the case).  In any case, I recommend typing emctl status oms -details to determine what ports are actually available for login.  They may or may not correspond to what WebLogic says.  In our case, https for our own certificate changed to 4444 by default.  We over-rode ssl.conf to change it to 8443, but it’s still a bit of conFUSION as to why Weblogic rarely agrees with EM on what is running where.

June 15, 2010

Deploying Agents via EM Grid Control 11g

Filed under: 11g, Enterprise Manager — kkempf @ 10:06 am

Typical caliber of Oracle agents

EM as I see it (for what it’s worth)

After upgrading to EM Grid Control 11g, I faced the daunting (read: boring) task of updating agents on about 12 hosts from 10.2.0.x to 11.1.0.1.  This action is not required, mind you, but never a bad idea to keep the agents up to date.

I’d been hearing conference after conference about how people were managing their enterprises with EM and how it had become a vital cog in their day to day activity.  I find EM useful, even important, but not vital.  Life would go on if the grid control server blew up, it would just be harder.

There are so many buttons, hyperlinks and tabs I chuckle at in EM.  Meaning the ones I would never hit.  Innocuous looking buttons and tabs that allegedly perform some of the most complicated tasks in Oracle.  Does anyone actually press these things, and use EM to perform the task?  If so I’d love to hear about it.  Things like:

  • Convert to Cluster Database
  • Perform Recovery
  • Clone Database
  • Failover to standby database
  • Migrate to ASM

Regardless, the point is that I decided to give Grid Control 11g the acid test: can it actually upgrade my agents remotely?  More to the point: can I do it without reading the documentation, applying 4 interim patches, and opening SR’s?

Staging Agent Installs

About the only preparation I had to make was to stage the agent installs on the EM server.  This is really easy; simply navigate to $OMS_HOME/sysman/agent_download (/u01/oem/Middleware/oms11g/sysman/agent_download in my case) and make directories to land the agents in as necessary.  You can download the agents here.

For example, under $OMS_HOME/sysman/agent_download, I created two directories: 11.1.0.1.0 and 10.2.0.5.0 for the two possible agent version I may deploy (linux and Windows, respectively).  Under 11.1.0.1.0 I had 2 subdirectories: linux and linux_x64; under 10.2.0.5 I had 1 subdirectory: win32.

Navigation

Head to the Deployments tab in EM, it will look like this:

Deployment Tab

Choose an action

In this example, I’m doing an Upgrade.  I’ve also tested Fresh Install.  I can’t speak to the other options (Clone, Shared Agent), quite frankly they look unappealing to me.

Fresh or Upgrade?

Fill in the blanks

Of course, you need to select the right version and platform from the drop down box.  What’s available in the drop down box is dependent upon what you staged in the $OMS_HOME in the first step.

One of my favorite screens

Seriously, Oracle, enough!  Stop asking me for my email address!  If you don’t know who I am and what products I have installed by now… too bad.

I always feel like somebody's watching me

Really, I’m sure

Why don't you leave me alone?

Install in progress

This screen starts with some validations against things like attempting the ssh connection username/password, checking inventory permissions, directory structures, etc.  If there’s anything amiss, it will tell you so.

If  you watch this portion from the OS (Linux) you will see that the old agent doesn’t shut down until late in the process; it’s virtually a seamless gap between the time the old agent is down and the new one is up.

Working

Installation Complete

Conclusion

In short: it worked!  Oracle, you don’t do it often, but in this case, I’ll give credit where credit is due and say that you’ve made my life easier this time.  It not only worked, it did it faster, better and easier than I would have been able to do it manually, and that’s all I can ask.  I highly recommend trying this method out for yourself.

It is worth mentioning, that I had a Windows which was on 10.2.0.4 which I decided to upgrade, out of curiosity, and it worked in the same way as Linux, successfully.

May 8, 2010

EM 11g Grid Control Install

Filed under: Enterprise Manager, Oracle — kkempf @ 7:46 pm

Preparation

For no reason other than I could, I recently decided to undertake installing the new Oracle Enterprise Manager 11g Grid Control (technically, upgrade my 10.2.0.5 environment).   For those not in the know, it was released about 2 weeks ago, and one major change is that it utilizes Weblogic server instead of the old iAS (this is the Fusion model).   In fact, there is no application server included in the EM 11g download; the installer assumes you have a Middleware home with the appropriate version of Weblogic server already installed.   The first thing I did in preparation for the upgrade  was to contact our Oracle techstack sales guy and ensure that I was allowed to install Weblogic server.  While I’m fully licensed to run EM 10.2.0.5 with tuning and diagnostics packs, I didn’t want to make any assumptions about Oracle giving up a new application server for free.  After confirming my legality from a license perspective, I began to update our internal wiki with all of the pieces of EM which would be hard to remember/replace in the event that they were lost.  This included a list of current EM users and their access, scheduled jobs, and user-defined metrics.  Finally, I shut everything down and made a VMWare snapshot of the disk, so that I could revert painlessly, if necessary.

Weblogic Server

The next step was to install Weblogic server, which is a pre-requisite to running the EM 11g installer.  In reality, you could install this web server anytime, whether EM 10g was running or not.  I went to pull it from OTN, and noticed a few things.  First, it appears that Fusion Middleware 11gR1 (10.3.3 – in typical Oracle versioning tradition) has been release very recently.  I was only interested in what I knew was certified, Fusion Middleware 11g (10.3.2 – couldn’t Oracle put one person in charge of making sure that the technical versions match the product names, and that the versions were generally aligned?).  You can find the download page here.  I was looking for linux x86_64, which isn’t on their list, so I pulled the generic platform version.  There’s a small net installer (stub file which goes out and grabs the install later) or a package installer (traditional, everything in there version). Grab whatever floats your boat.   When I grab the generic version, it requires I have my own JDK installed, so to invoke the installer it looks something like this:

java -jar wls1032_generic.jar

The familiar Oracle looking installer starts; I’ll use screenshots from here on as it will probably be easier to follow, just go left to right:

1. Initial splash screen

2. Give us the information we already have!

My personal favorite after leaving email address blank:

Are you sure you want to remain clueless??

You didn't give us your email! Are you really really sure we can't spam you?

OK, now back to the installation… I missed the screenshot, but somewhere you are asked for 2 passwords.  Make sure you

jot them down, as you’ll need them later to administer the web server and also to do the actual EM install.

3. Where is your JDK?

4. Where do you want to put the Weblogic App Server?

5. Default components

6. Installer at Work

.

7. Complete (Uncheck Quickstart)

8. Quickstart - If you get to this screen... cancel it and exit!

Enterprise Manager 11g Installation

This installation impressed me; it was straightforward and worked.  I do have to ding it on a few things.  First, Oracle decided that EM 11g would run by default as https using an SSL cert it generates during the install.  So many problems there.  First, our proxy server (where all web traffic goes, by default, even if  redirected back internally) has a security mechanism set up to not allow SSL on non-standard (443, 8443) ports.  Oracle chose something logical: port 1159.  Second, we have an internal certificate I could use, which would be far more legitimate.  How about asking me if I have something for it to chew on, instead of having it generate some garbage cert which I have to figure out how to replace?

I ran into some RDBMS parameter issues.  EM 11g requires you to have the repository running 11.1 or 11.2, and there’s even some patches they want in place before you do the upgrade.   Well worth reviewing the prerequisites and documentation.  Before you run the installer, issue an

alter system set job_queue_processes=0;

and revert it upon completion to its original value.

Finally, I ran into a problem early on when the installer detected the Oracle Applications Management Pack in my EM 10g inventory.  At some point in the past, I’d installed it with the intention of seeing what it brought to the table.  That’s as far as I ever got, never even configured it, but the installer basically says you can’t do an upgrade with that pack in place.  You have to do a full install.  Ick.  So I uninstalled it via OUI from the OMS home and it happily proceeded next time I ran the installer.  I guess if you’re using that APM you’re out of luck.

You can grab EM 11g from OTN here; once extracted & staged you just use the old standby:

runInstaller

1. Welcome to our latest product! Can we have your email please?

2. We really need to know more about you. Can we have your username and password to the world's slowest support site?

3. I chose to upgrade 10.2.0.5

4. Hey! I found your old stuff!

.

5. I see your listener - whats the sys password?

6. Hey Bozo! Didn't you read the upgrade guide? There's some mandatory parameters you need to fix!

7. I found your Middleware home. Where do you want to install EM 11g?

8. What were your Weblogic passwords?

.

.

Did you notice in step 8, it’s adding a tablespace?

9. Do you want to change ports that you don't even understand the use of?

10. Ready to launch!

.

.

11. Chugging along for about an hour

12. The root.sh we all know and love, now available in allroot.sh!

.

13. Read the fine print; it has pertinent URLs and information!

14. More fine print

.

.

.

.

15. Do you agree to tell the truth, the whole truth, and nothing but the truth?

.

.

.

.

.

.

Success & First Impressions

First of all, it works, the installer didn’t error out too much, and after getting past the ssl port/cert issues (use IE it has the best error messages!) I have to say my first impression was that it did everything right.  My users were all still there, all my targets were there, my user-defined metrics survived, and my jobs were all running.  It even emailed me an alert I was subscribed to when I shut down a scrub database.  No tweaks.  The graphics look a little cleaner.  I am guessing that the java pages were compiled on the fly, as it was painfully slow to do anything the first time.  After that, it was noticeably faster than EM 10g.  Must be all the Fusion going on.

Here’s what the screen looks like:

A new look, but familiar in most ways.

.
.
.
.

.

.

.

.

.

.

Starting and Stopping

This has changed a little bit, but I thought I’d just copy my start/stop scripts in here wholesale and you can tweak them for your environment:

Start:

#!/bin/bash
ORACLE_HOME=/u01/oem/Middleware/oms11g
AGENT_HOME=/u01/oem/Middleware/agent11g
$ORACLE_HOME/bin/emctl start oms
$AGENT_HOME/bin/emctl start agent
$ORACLE_HOME/bin/emctl status oms -detail

Stop:

#!/bin/bash
ORACLE_HOME=/u01/oem/Middleware/oms11g
AGENT_HOME=/u01/oem/Middleware/agent11g
$ORACLE_HOME/bin/emctl stop oms -all
$AGENT_HOME/bin/emctl stop agent

Final Notes

What’s up with all the periods (.) all over this post? I fought WordPress the whole time to keep my screenshots from overlapping section headers and each other. Eventually I gave up and used periods to occupy white space as a placeholder.
A special note of acknowledgment to sysadmin guy who made me understand why no browser could find the login page (the non-standard SSL port issue noted above).   I literally did the install twice, thinking I’d screwed something up or missed some step.  In the end, I probably had it right the first time, but didn’t see any way to reach the login page so reverted and reinstalled all pieces.
The Weblogic server admin screens have about a million tabs, checkboxes, radio buttons and text-entry areas for parameters.  There is a learning curve there:

There's tabs on the tabs!

Addendum

I feel compelled to add an addendum to this posting because of a few things I’ve noticed after running 11g a few days.  First, it consumes more memory; we had to throw another gig at it (database footprint unchanged).  Second, after rebooting the host, we noticed it trying to start processes; it slipped a gcstartup file in the init.d directory.  This script basically checks your oratab file and tries to start whatever homes it finds in there.  Fine and good (I guess)  but really of no value for me.  Having the Weblogic server and the OMS “running” before the database isn’t really useful to me.

Useful URLs and other eratta

main EM login screens:

Weblogic console

It appears that some of the logs aren’t automatically rotated in $ORACLE_INSTANCE/WebtierIH1/diagnostics/logs/OHS/ohs1; I’d double check it if I were you.  My mod_wl_ohs.log was 500Mb.  I fixed this by gzipping the old mod_wl_ohs.log at the end of my stop enterprise manager script; every time I shut down EM, it will gzip it and eventually age 2 weeks and be caught by:

In addition, you probably want to consider a cron job to delete old access/error/em_upload_https logs; something like this is pretty vanilla in the crontab (delete logs older than about 2 weeks):

/usr/bin/find /u01/oem/gc_inst/WebTierIH1/diagnostics/logs/OHS/ohs1 -mtime +14 -exec rm -rf {} \;

September 29, 2009

Using Custom Metrics in Enterprise Manager to Monitor Applications 11i (Part II)

Filed under: EM to monitor 11i, Enterprise Manager, Oracle — kkempf @ 3:26 pm

As I began to look at what was involved in monitoring custom metrics, I forgot how many steps it took to actually set up email notifications.   Therefore, I’m going to devote this Part II to notifications, and spell out some cool metrics in (yet) another future entry.  I should note before I begin here, that I’m not trying to be secretive about anything in the screen shots; if anything is blanked out (using the GIMP, by the way) in the images below, it’s most likely because it has personal or company information in it.

First step in receiving email notification is setting up an administrator account.  On the top right in EM, above the blue tabs, you need to click on Setup.Setup

Now click on Administrators on the bar on the left, and then the Create button.  This should get you to here:

Create an Administrator Account

Create an Administrator Account

Once your account is set up, you should confirm that your global SMTP settings are in place.  Click on Notification Methods on the bar on the left side of the screen

SMTP Setup

SMTP Setup

Note under Outgoing Mail (SMTP) Server, you need a fully qualified hostname or IP address of your SMTP server.  The Sender’s E-Mail Address is simply who will appear in the From field of the email you receive.

Now, log out and back in as the user you just created (I’m assuming this is you!).  Click on Preferences in the top right, and General in the bar on the left.  You should see the email address you just entered, as well as a preference for Message Format.  I recommend Short Format as most mobile devices don’t really have enough screen real estate to fit all the information it sends on Long Format.  While you’re here, you might as well click on TEST to ensure your SMTP and email settings are right – this should arrive in your Inbox in short order.

General Preferences

General Preferences

Next, lets set up the notification schedule.  On the left side, click on Schedule (you should already be under Preferences in the top right).  Select an administrator (you!) with the flashlight, and hit the Change button.    Since there is no schedule, the only button to hit is Define Schedule.   From here, you simply follow the bouncing ball, so to speak.  Rotation frequency is for shops where you may not be on the hook every week for after hours notifications or the like.  In my case, it’s a one-man operation, so I use a weekly rotation beginning “today”.  From the next screen, you need to fill in the hours you will be notified.  Again, in my case, it’s me or me, unless I’m on vacation.  So I just put in 12AM to 12AM and check Monday, Tuesday, Wednesday, Thursday , Friday, Saturday and Sunday, then hit the Batch Fill-In button.

schedule definition

Click finish and you’re almost there.  The last thing to configure is what you want to be notified about.  Again from the  Preferences selection in the top right, you now need to click on Rules.  The screen will look like this:

Notification Rules

You will notice that sysman owns the “canned” events, and you can choose to check Subscribe (Send E-mail) on any you wish.  In my case, I care about:

  • Agents Unreachable (indicative of host availability)
  • Database Availability and Critical States
  • Host Availability and Critical States
  • Listener Availability
  • User Defined Metrics

All of these are self explanatory, I believe, except the last.  In this case, I’ve created a category for my User Defined Metrics.  The easiest way to do this is select an existing category (Database Availability and Critical States works well) and click the Create Like button.  Under General call it User Defined Metrics.  Under Availability check what you’re interested in.  I’d recommend everything but blackout begin/end events:

Notification Rule2

From the Metrics tab is where you need to find your custom metrics.  Hit the Add button, then search (%User%) or click next (next, next…) for User-Defined Numeric Metric.

Notification Rule3

Once you add this, you can hit the Edit pencil on the right, and select exactly which User-Defined Metrics, on which database, you wish to monitor.  I’d also check Critical, Warning and Clear, just for good form.

Edit Metric

That’s a lot of ground covered!  At this point, you can test your User-Defined Metrics and ensure you receive notification when they fall out of tolerance.  Here’s a “gotcha”: as you add new User-Defined Metrics, they do not automatically get added to the selection in the Metrics screen, above.  Meaning you need to go back afterwards and manually add it/them to the already selected metrics, in order to be informed of a change in state on the new metric(s).

Next time: Some handy User-Defined Metrics with 11i, and how to monitor things at the OS level on the front end, such as the number of forms sessions and the availability of the Xvnc server (required for some reports)…

September 23, 2009

Using Custom Metrics in Enterprise Manager to Monitor Applications 11i (Part I)

Filed under: 11i, EM to monitor 11i, Enterprise Manager, Oracle — kkempf @ 9:26 am

Oracle calls me from time to time, and inevitably every few months I’m asked why I’m not using the Applications Management pack in Enterprise Manager to monitor my 11i environment.  I tell them that I have no compelling reason to pay for something which I feel I can code myself through user-defined metrics.  I’ve found these user-defined metrics to be one of the most powerful and flexible features in EM.

At present, I use 30 custom metrics to monitor what I’ve found to be critical pieces of the ERP in a production environment.  It would be a rather long blog entry to go through all of them, but for purposes of this entry, I’ll cover how to set up a custom metric and how to monitor the Standard Managers.

To begin, you can get to the custom metrics screen by navigating to a database target (in this case, the ERP production database) and scrolling all the way to the bottom.  Click User-Defined Metrics and you will get to a screen like this:

User-Defined Metrics Base Screen

User-Defined Metrics Base Screen

At this point, let’s assume there are no existing User-Defined Metrics, and you wish to create a metric which monitors the number of Standard Managers running.  Click on the create button in the top right, and you get a screen like this:

Screenshot-Oracle Enterprise Manager (KKEMPF) - Create User-Defined Metric - Mozilla Firefox

The bottom part of the screen looks like this:

Screenshot-Oracle Enterprise Manager (KKEMPF) - Create User-Defined Metric - Mozilla Firefox-1

It’s really rather self explanatory; you name your metric, and define whether the sql query you wrote will return number or character.  Single value versus column is well explained in the form.  After you enter your SQL:

select to_number(running_processes) from apps.fnd_concurrent_queues where concurrent_queue_name = ‘STANDARD’

and the apps username and password, you can hit the TEST button to confirm you receive expected results.  Finally, you define what the critical (and warning, if desired) thresholds are (in this case, I’m using 12 standard managers so anything less that this is a problem).  My alert message uses %Key% which simply means if this alert is triggered, it will tell me what the current number of standard managers is, and finally I tell it how frequently to poll.  That’s it!  Hit OK and it will show up on your user-defined metrics page now (takes a bit of time for the first query to show up on this screen, but it will eventually, or at least it will show a metric collection error if there is a problem).

Next time: More metrics, and how to receive these user-defined metric alerts in e-mail…

August 17, 2009

PC LOAD LETTER

Filed under: Enterprise Manager, Oracle — kkempf @ 12:38 pm

Checker run found 55 new persistent data failures

This message shows up in EM, in a TEST ERP system, with no further information about the issue except when it occurred. It reminded me of PC LOAD LETTER from the movie Office Space. WTF does that mean? Who is checker? Metalink search is not particularly helpful. So I checked the alert log, and found that there were entries back in June when I had logical tablespace corruption, as well as last week, during the time when I was RMAN cloning this environment. That particular clone failed late in the process, because it wanted to create the duplicate (“B”) online redo logs in a directory that didn’t exist ($OH/dbs/blah/blah/blah some stupid default location). Then the clone sat all night and apparently checker came along and said “Hey! You have persistent data failures!”. Problem is, I’ve since redone the clone and run an RMAN backup validate check logical database without incident. So the best I can tell, this is a remnant of a failed RMAN duplicate. This is reinforced by the only relevant Metalink hit which stated this can occur at times during a create database command under some HPUX bug. Oh well, I just cleared it in EM so I didn’t have to look at it, but found it amusing nonetheless.

July 13, 2009

There’s a reason EM is free…

Filed under: Enterprise Manager, Oracle — Tags: , — kkempf @ 7:46 pm

I’m mostly kidding about the title of this entry; on the whole, I really like Enterprise Manager Grid Control.   It simplifies management of my Oracle databases, backups, and Blackberry notifications when something is amiss.  I don’t use the “pay” 11i pack; I rely on custom written (User-Defined Metrics, or UDM’s) SQL to keep me informed if there is something wrong in the apps.  It’s not that I’m opposed to the management packs (we use, and pay for, diagnostics & tuning, and they’ve saved me a lot of time), I just don’t see what it brings to the table for me beside another annual maintenance fee. 

Well it turns out this past weekend, for yet undetermined reason, EM stopped collecting information from all agents.  This is really obnoxious, as I didn’t even know I was “blind” to my Oracle databases.  It just silently stopped collecting.    Like a union on strike without the picket line.   It happens that I did some minor maintenance Sunday, which required me to bounce my ERP PROD database, and I was curious how quickly my buffer cache recovered, and how it was behaving.  Well surprise, surprise, all my data is stale as of 11pm Saturday night. 

A bit of background; I run EM 10.2.0.5 “Grid Control” on a RH5 Linux x86_64, with about 4GB devoted to the SGA and 2 processors, all in a virtual machine.  There’s nothing unusual about this, it’s always performed fine.

Alright, back to the problem at hand.  I did a little bit of checking, bounced some agents, even bounced the whole EM application server and database, just to be sure it was running normally.  Everything checks out.   The agents upload fine (or at least, think they did, as far as I could tell) and I have no problem doing real-time monitoring of any of my systems.  This is puzzling, and I open the perfunctory SR to see if there’s any intelligent life at home today at Oracle support.   Turns out, no.   The analyst asks me for reasonable files, such as logs from the agent.   But it’s going nowhere fast.  So I reinstall the agent on my PROD system, thinking it may be messed up somehow; this has happened from time to time.  No dice. 

Then the analyst starts asking me to do downright dumb things.  She noticed that there was an error message in the log about one of my custom metrics complaining about a trailing semi-colon at the end of my SQL.  Well this has never hurt in the past, and although I will admit that there was an error, and definitely a problem with one of my custom metrics, I failed to understand how this could have run for months and suddenly caused catastrophic failure Saturday night at 11pm (nobody had done anything to EM all day Saturday).   About this time, I gave up on the analyst, and started doing some real digging on my own. 

I looked in the EM application home, and noticed that there were a ton of .xml files in $ORACLE_HOME/sysman/recv/errors.  This didn’t look right; I didn’t really care about the metrics at this point that were “stuck” so I deleted everything in that directory.  Then I found a note which mentioned running this code as sysman against my EM repository, based on some odd “unavailable partition” errors I saw in the logs:

SQL > exec emd_maintenance.analyze_emd_schema(‘SYSMAN’)
SQL> exec emd_maintenance.partition_maintenance

Wouldn’t you know it, agents start reporting in and things return to normal.

It’s sad, because when I was a rookie, my Oracle mentor taught me to open a TAR/SR on every issue you couldn’t solve in short order, believing that 2 heads were better than one, and this was the analysts’ specialty.  It turns out, as of late, I’m about 0 for 5 on analysts solving my problems.  I don’t know if they’re overworked, underqualified, or just plain incompetent, but I just don’t have any faith anymore in support analysts.  I do much better searching on my own.   Don’t get me wrong, in most cases without Metalink I couldn’t have figured out the solution, but it pains me to see my company shell out so much money for support when all I really need is Metalink access.  I guess that’s why I get paid.

July 8, 2009

When EM Flips Out

Filed under: Bugs, Enterprise Manager, Oracle — Tags: , , , — kkempf @ 7:35 am

For no apparent reason, yesterday afternoon about 2pm Enterprise Manager went haywire. It was using 100% of 1 CPU, and causing considerable I/O (log writer). The truth is, it doesn’t take much to kill a 1-CPU virtual machine, but this was without precedent.
EM1EM2

After some digging, the offending SQL was:
SELECT execution_id
FROM MGMT_JOB_EXEC_SUMMARY e, MGMT_JOB j
WHERE e.job_id = j.job_id AND j.is_corrective_action = 0
AND status IN (5,4,3,18,8)
AND (CAST(SYS_EXTRACT_UTC(SYSTIMESTAMP) AS DATE) – e.start_time) > (:tf)
AND ROWNUM < 500 ORDER BY start_time desc

A few searches on Metalink, and I find DOC 839080.1, which tells me to apply patch 8517252. This patch applies against the EM Application Home, not the database. So I shut down the application, opatch in 8517252, but the post script @post_install_script.sql is hanging. Had to hard kill (kill -9) the 100% cpu process at the OS level and this cleared the lock contention. Finished installing the patch, fired up the EM application again, and all was good (this is in the graphs above, about 5pm).

« Newer Posts

Create a free website or blog at WordPress.com.