Kevin Kempf's Blog

August 26, 2011

When Discoverer stops starting

Filed under: 11i, Discoverer — kkempf @ 1:53 pm

A Bad Day

Corrupt memory on a blade in our blade center crashed about 4 virtual machines and 9 Oracle databases on me a few weeks back.  As luck would have it, one of the machines was running Oracle Enterprise Manager, so I received no alerts.   When I finally got around to starting up Weblogic Server/Discoverer, I received a rather cryptic error and opmn was kind of hung up.   I’d honestly hoped to post the exact failure log, but it seems that I can’t find it anywhere in the logs I know about in WLS and opmn.  So I’ll post the symptoms here, and perhaps you can tuck it away as a warning for some future day.

Starting Discoverer

As you may know if you’ve read my other entries on Discoverer 11g, I created a script which seemed to work great for me, because apparently Oracle thinks people like to run 4 disparate commands to get Discoverer started.

startdisco.sh

#!/bin/bash

export MIDDLEWARE_HOME=/u01/discoverer/Middleware
export DOMAIN_HOME=$MIDDLEWARE_HOME/user_projects/domains/ClassicDomain
export WL_HOME=$MIDDLEWARE_HOME/wlserver_10.3
export ORACLE_HOME=/u01/discoverer/Middleware/as_1
export ORACLE_INSTANCE=/u01/discoverer/Middleware/asinst_1

rm -rf nohup.out

echo “Ensure NO processes related to disco 11g are running or this will fail”

nohup $DOMAIN_HOME/bin/startWebLogic.sh -Dweblogic.management.username=weblogic -Dweblogic.management.password=pw > /tmp/wls_start.log &

nohup $WL_HOME/server/bin/startNodeManager.sh > /tmp/start_nodemanager.log &

echo “sleeping”
sleep 60

nohup $DOMAIN_HOME/bin/startManagedWebLogic.sh WLS_DISCO t3://myhost.mydomain:7002 > /tmp/start_mgdwls.log &

echo “sleeping”
sleep 60

$ORACLE_INSTANCE/bin/opmnctl startall
$ORACLE_INSTANCE/bin/opmnctl status
echo “If Discoverer doesn’t start properly, login to http://myhost-01.mydomain.com:7002/console”
echo “From the home page, click servers (Under Environment), then the control tab.  Check WLS_DISCO then click the start button below the checkbox”

You can take my script, or leave it, but the bottom line is that the following things need to be started to get Discoverer working:

  1. startWeblogic.sh
  2. startNodeManager.sh
  3. startManagedWeblogic.sh
  4. opmnctl startall

When I ran this script after the server crash, some OS processes would start, some would not, and I remember opmnctl status showing “pending” or something like that, instead of “starting” or “alive”.   Somewhere in some log, something pointed me to an error which said “cannot find /usr/lib/jvm/java-1.6.0-sun-1.6.0.22.x86_64” or the like.

The Nature of the Error

In short: the specific, hard coded JDK on the Linux host which WLS and Discoverer were looking for, was no longer there.  I don’t know why, but I suspect that through the course of normal RedHat updates, after a new version was installed either the sysadmin cleaned up the old version (i.e., deleted it) or it cleans up the old versions as part of the update manager process.  Either way, there’s no way Discoverer was going to start.   The reboot had, in fact, forced the issue; in theory it would have continued to run forever “in memory” despite the fact that the version it was using was no longer available on the disk.

The Quick Fix

As it turned out, I was at the airport when I got a call from our help desk, with users complaining that discoverer wasn’t available.   After finally figuring out what the issue really was (which is more confusing than it seems, since a bunch of processes start just fine without JDK), I did a ghetto-IT fix of creating a symbolic link as root:

/usr/lib/jvm/ln -s java-1.6.0-sun-1.6.0.22.x86_64 java-1.6.0-sun-1.6.0.26.x86_64

Basically, this puts a pointer on the disk saying, “if you’re looking for jdk 1.6.0.22, go look in 1.6.0.26 instead”.  It ain’t pretty, but I had a plane to catch.

Started Discoverer fine immediately afterwards.  Put this on my “to investigate” list.

The Better Fix

When I finally had time, I boiled down my investigation to two points:

  • Where was the JDK version hard coded in the WLS startup scripts
  • What versions of the JDK were certified with WLS/Disco

I got 1 out of 2.  The first one.  Turns out, I’m not the only one to see this problem.  Note 1058804.1 “How to Change Type of JDK (Sun/JRockit) for FMW 11g Domain” explains that the script $MIDDLEWARE_HOME/wlsserver_10.3/common/bin/commEnv.sh has a variable called JAVA_HOME in it which was hard coded to the path to /usr/lib/jvm/java-1.6.0-sun-1.6.0.22.x86_64.  I decided my best option here was to change it to /usr/lib/jvm/java-1.6.0-sun.x86_64 since it appears that the symbolic link in /usr/lib/jvm called java-1.6.0-sun.x86_64 would forever after point to whatever the latest version of the JDK was.  At least this way, I didn’t have to go monkey around in a WLS shell script every time the JDK changed versions.  Honestly, I don’t know if I pointed the WLS installer at a specific version of JDK, if it auto detected it, or installed it itself, but it was bad policy and needed correcting.

Addendum: After more confusion regarding this same issue, I found another file also contains a hard-coded reference to your JVM which apparently is created during installation.  In addition to commENV.sh, you must fix the SUN_JAVA_HOME reference in $MIDDLEWARE_HOME/user_projects/domains/ClassicDomain/bin/setDomainEnv.sh

The part I couldn’t discern was what version of Java was certified for my OS, WLS and Discoverer version.  The analyst was good and send me a bunch of spreadsheets which supposedly covered this, but it became apparent after a few minutes that it would take a higher power to discern what they meant.  I decided that if it’s running now, that’s what’s most important, and that if push ever came to shove, I’d downgrade to fix an issue at the insistence of an analyst.

Advertisements

4 Comments

  1. Hi Sir,

    I am a regular reader of your blog. you have helped me with many questions in 2011.

    i have a task now . we are trying to do a performance testing ( load test) on 11g Discoverer with hp load runner for 200 concurrent users .

    i have modified the maxprocs parameter in opmn.xml to 1000 and maxsessions in configuration.xml to 1000 . we are not getting any specific error in logs but in load runner we see that it is not getting into next page ,meaning the login has failed

    wea re failing with 20 concurrent users .

    Please advice if we are missing anything . this is a new environment without any patches . 11.1.1.2 Discoverer with 26 gb ram on RHEL 5 .

    we really appreciate your help

    thanks

    Comment by Kanna — January 17, 2012 @ 10:28 am

    • I’m sorry, I have simply no experience tuning opmn and its configuration. It sounds like your environment is pretty vanilla. I’m assuming that before you run load runner, you’re able to log into Discoverer without incident? Meaning that your environment works normally without load testing? If I were you, I’d have already opened a case with Oracle support. Remember, if you’re running a certified version of the application on a certified operating system, it’s your right and expectation that Oracle will solve your problem!

      Comment by kkempf — January 17, 2012 @ 11:54 am

      • Sorry to not mention this before . Oracle support doesnot support the load runner tests or something related to it

        i have contacted them earlier. but no luck . (they helped me but not that much help they really could) .

        you are the expert so i thought you could help me about this

        thanks

        Comment by Kanna — January 17, 2012 @ 12:03 pm

  2. Could you please guide me to any other expert on this . i really have no options and i would like to take a expert opinion in this . Please help me sir

    Comment by Kanna — January 17, 2012 @ 12:04 pm


RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Create a free website or blog at WordPress.com.