Kevin Kempf's Blog

August 26, 2011

When Discoverer stops starting

Filed under: 11i, Discoverer — kkempf @ 1:53 pm

A Bad Day

Corrupt memory on a blade in our blade center crashed about 4 virtual machines and 9 Oracle databases on me a few weeks back.  As luck would have it, one of the machines was running Oracle Enterprise Manager, so I received no alerts.   When I finally got around to starting up Weblogic Server/Discoverer, I received a rather cryptic error and opmn was kind of hung up.   I’d honestly hoped to post the exact failure log, but it seems that I can’t find it anywhere in the logs I know about in WLS and opmn.  So I’ll post the symptoms here, and perhaps you can tuck it away as a warning for some future day.

Starting Discoverer

As you may know if you’ve read my other entries on Discoverer 11g, I created a script which seemed to work great for me, because apparently Oracle thinks people like to run 4 disparate commands to get Discoverer started.

startdisco.sh

#!/bin/bash

export MIDDLEWARE_HOME=/u01/discoverer/Middleware
export DOMAIN_HOME=$MIDDLEWARE_HOME/user_projects/domains/ClassicDomain
export WL_HOME=$MIDDLEWARE_HOME/wlserver_10.3
export ORACLE_HOME=/u01/discoverer/Middleware/as_1
export ORACLE_INSTANCE=/u01/discoverer/Middleware/asinst_1

rm -rf nohup.out

echo “Ensure NO processes related to disco 11g are running or this will fail”

nohup $DOMAIN_HOME/bin/startWebLogic.sh -Dweblogic.management.username=weblogic -Dweblogic.management.password=pw > /tmp/wls_start.log &

nohup $WL_HOME/server/bin/startNodeManager.sh > /tmp/start_nodemanager.log &

echo “sleeping”
sleep 60

nohup $DOMAIN_HOME/bin/startManagedWebLogic.sh WLS_DISCO t3://myhost.mydomain:7002 > /tmp/start_mgdwls.log &

echo “sleeping”
sleep 60

$ORACLE_INSTANCE/bin/opmnctl startall
$ORACLE_INSTANCE/bin/opmnctl status
echo “If Discoverer doesn’t start properly, login to http://myhost-01.mydomain.com:7002/console”
echo “From the home page, click servers (Under Environment), then the control tab.  Check WLS_DISCO then click the start button below the checkbox”

You can take my script, or leave it, but the bottom line is that the following things need to be started to get Discoverer working:

  1. startWeblogic.sh
  2. startNodeManager.sh
  3. startManagedWeblogic.sh
  4. opmnctl startall

When I ran this script after the server crash, some OS processes would start, some would not, and I remember opmnctl status showing “pending” or something like that, instead of “starting” or “alive”.   Somewhere in some log, something pointed me to an error which said “cannot find /usr/lib/jvm/java-1.6.0-sun-1.6.0.22.x86_64” or the like.

The Nature of the Error

In short: the specific, hard coded JDK on the Linux host which WLS and Discoverer were looking for, was no longer there.  I don’t know why, but I suspect that through the course of normal RedHat updates, after a new version was installed either the sysadmin cleaned up the old version (i.e., deleted it) or it cleans up the old versions as part of the update manager process.  Either way, there’s no way Discoverer was going to start.   The reboot had, in fact, forced the issue; in theory it would have continued to run forever “in memory” despite the fact that the version it was using was no longer available on the disk.

The Quick Fix

As it turned out, I was at the airport when I got a call from our help desk, with users complaining that discoverer wasn’t available.   After finally figuring out what the issue really was (which is more confusing than it seems, since a bunch of processes start just fine without JDK), I did a ghetto-IT fix of creating a symbolic link as root:

/usr/lib/jvm/ln -s java-1.6.0-sun-1.6.0.22.x86_64 java-1.6.0-sun-1.6.0.26.x86_64

Basically, this puts a pointer on the disk saying, “if you’re looking for jdk 1.6.0.22, go look in 1.6.0.26 instead”.  It ain’t pretty, but I had a plane to catch.

Started Discoverer fine immediately afterwards.  Put this on my “to investigate” list.

The Better Fix

When I finally had time, I boiled down my investigation to two points:

  • Where was the JDK version hard coded in the WLS startup scripts
  • What versions of the JDK were certified with WLS/Disco

I got 1 out of 2.  The first one.  Turns out, I’m not the only one to see this problem.  Note 1058804.1 “How to Change Type of JDK (Sun/JRockit) for FMW 11g Domain” explains that the script $MIDDLEWARE_HOME/wlsserver_10.3/common/bin/commEnv.sh has a variable called JAVA_HOME in it which was hard coded to the path to /usr/lib/jvm/java-1.6.0-sun-1.6.0.22.x86_64.  I decided my best option here was to change it to /usr/lib/jvm/java-1.6.0-sun.x86_64 since it appears that the symbolic link in /usr/lib/jvm called java-1.6.0-sun.x86_64 would forever after point to whatever the latest version of the JDK was.  At least this way, I didn’t have to go monkey around in a WLS shell script every time the JDK changed versions.  Honestly, I don’t know if I pointed the WLS installer at a specific version of JDK, if it auto detected it, or installed it itself, but it was bad policy and needed correcting.

Addendum: After more confusion regarding this same issue, I found another file also contains a hard-coded reference to your JVM which apparently is created during installation.  In addition to commENV.sh, you must fix the SUN_JAVA_HOME reference in $MIDDLEWARE_HOME/user_projects/domains/ClassicDomain/bin/setDomainEnv.sh

The part I couldn’t discern was what version of Java was certified for my OS, WLS and Discoverer version.  The analyst was good and send me a bunch of spreadsheets which supposedly covered this, but it became apparent after a few minutes that it would take a higher power to discern what they meant.  I decided that if it’s running now, that’s what’s most important, and that if push ever came to shove, I’d downgrade to fix an issue at the insistence of an analyst.

Advertisements

Blog at WordPress.com.