Troubleshoot Grid Infrastructure Startup Issues (Doc ID 1050908.1)

	Purpose

	Scope

	Details

	Start up sequence:

	Cluster status

	Case 1: OHASD does not start

	Case 2: OHASD Agents do not start

	Case 3: OCSSD.BIN does not start

	Case 4: CRSD.BIN does not start

	Case 5: GPNPD.BIN does not start

	Case 6: Various other daemons do not start

	Case 7: CRSD Agents do not start

	Case 8: HAIP does not start

	Network and Naming Resolution Verification

	Log File Location, Ownership and Permission

	In Grid Infrastructure cluster environment:

	In Oracle Restart environment:

	Network Socket File Location, Ownership and Permission

	In Grid Infrastructure cluster environment:

	In Oracle Restart environment:

	Diagnostic file collection

	References

Applies to:

Oracle Database – Enterprise Edition – Version 11.2.0.1 and later

Information in this document applies to any platform.

Purpose

![1756814994851-be5fb018-bd00-44c6-bd68-f3cbc03fb310.png

](https://support.oracle.com/epmos/faces/DocumentDisplay?&id=1268927.2&cid=ocdbgeneric-ad-Document-1050908.1&parent=KM-Advert&sourceId=ocdbgeneric-ad-Document-1050908.1)

This note is to provide reference to troubleshoot 11gR2 and 12c Grid Infrastructure clusterware startup issues. It applies to issues in both new environments (during root.sh or rootupgrade.sh) and unhealthy existing environments. To look specifically at root.sh issues, see note 1053970.1 for more information.

Scope

This document is intended for Clusterware/RAC Database Administrators and Oracle support engineers.

Details

Start up sequence:

In a nutshell, the operating system starts ohasd, ohasd starts agents to start up daemons (gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd, evmd asm etc), and crsd starts agents that start user resources (database, SCAN, listener etc).

For detailed Grid Infrastructure clusterware startup sequence, please refer to note 1053147.1

Cluster status

To find out cluster and daemon status:

$GRID_HOME/bin/crsctl check crs

$GRID_HOME/bin/crsctl stat res -t -init

For 11.2.0.2 and above, there will be two more processes:

For 11.2.0.3 onward in non-Exadata, ora.diskmon will be offline:

ora.diskmon

  1        OFFLINE  OFFLINE       rac1

For 12c onward, ora.storage is introduced:

ora.storage

1 ONLINE ONLINE racnode1 STABLE

To start an offline daemon – if ora.crsd is OFFLINE:**

Case 1: OHASD does not start

As ohasd.bin is responsible to start up all other cluserware processes directly or indirectly, it needs to start up properly for the rest of the stack to come up. If ohasd.bin is not up, when checking its status, CRS-4639 (Could not contact Oracle High Availability Services) will be reported; and if ohasd.bin is already up, CRS-4640 will be reported if another start up attempt is made; if it fails to start, the following will be reported:

Automatic ohasd.bin start up depends on the following:

1. OS is at appropriate run level:

OS need to be at specified run level before CRS will try to start up.

To find out at which run level the clusterware needs to come up:

cat /etc/inittab|grep init.ohasd

35 Note: Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6) has deprecated inittab, rather, init.ohasd will be configured via upstart in /etc/init/oracle-ohasd.conf, however, the process ""/etc/init.d/init.ohasd run" should still be up.

Above example shows CRS suppose to run at run level 3 and 5; please note depend on platform, CRS comes up at different run level.

To find out current run level:

who -r

2. "init.ohasd run" is up

On Linux/UNIX, as "init.ohasd run" is configured in /etc/inittab, process init (pid 1, /sbin/init on Linux, Solaris and hp-ux, /usr/sbin/init on AIX) will start and respawn "init.ohasd run" if it fails. Without "init.ohasd run" up and running, ohasd.bin will not start:

ps -ef|grep init.ohasd|grep -v grep

Note: Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6) has deprecated inittab, rather, init.ohasd will be configured via upstart in /etc/init/oracle-ohasd.conf, however, the process ""/etc/init.d/init.ohasd run" should still be up.

If any rc S_nn_command script (located in rc_n_.d, example S98gcstartup) stuck, init process may not start "/etc/init.d/init.ohasd run"; please engage OS vendor to find out why relevant S_nn_command script stuck.

Error "[ohasd( )] CRS-0715:Oracle High Availability Service has timed out waiting for init.ohasd to be started." may be reported of init.ohasd fails to start on time.

If SA can not identify the reason why init.ohasd is not starting, the following can be a very short term workaround:

** cd <location-of-init.ohasd>

nohup ./init.ohasd run &** ／bin/sh /etc/init.ohasd run &

3. Cluserware auto start is enabled – it's enabled by default

By default CRS is enabled for auto start upon node reboot, to enable:

$GRID_HOME/bin/crsctl enable crs

To verify whether it's currently enabled or not:

$GRID_HOME/bin/crsctl config crs

If the following is in OS messages file

The reason is the file does not exist or not accessible, cause can be someone modified it manually or wrong opatch is used to apply a GI patch(i.e. opatch for Solaris X64 used to apply patch on Linux).

4. syslogd is up and OS is able to execute init script S96ohasd

OS may stuck with some other S_nn_ script while node is coming up, thus never get chance to execute S96ohasd; if that's the case, following message will not be in OS messages:

If you don't see above message, the other possibility is syslogd(/usr/sbin/syslogd) is not fully up. Grid may fail to come up in that case as well. This may not apply to AIX.

To find out whether OS is able to execute S96ohasd while node is coming up, modify S96ohasd:

From:

To:

After a node reboot, if you don't see /tmp/ohasd.start._timestamp_ get created, it means OS stuck with some other S_nn_ script. If you do see /tmp/ohasd.start._timestamp_ but not "Oracle HA daemon is enabled for autostart" in messages, likely syslogd is not fully up. For both case, you will need engage System Administrator to find out the issue on OS level. For latter case, the workaround is to "sleep" for about 2 minutes, modify ohasd:

From:

To:

5.** File System that GRID_HOME resides is online when init script S96ohasd is executed; once S96ohasd is executed, following message should be in OS messages file:

If you see the first line, but not the last line, likely the filesystem containing the GRID_HOME was not online while S96ohasd is executed.

6. Oracle Local Registry (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) is accessible and valid

**ls -l $GRID_HOME/cdata/*.olr**

If the OLR is inaccessible or corrupted, likely ohasd.log will have similar messages like following:

no resource

The solution is to restore a good backup of OLR with "ocrconfig -local -restore <ocr_backup_name>".

By default, OLR will be backed up to $GRID_HOME/cdata/$HOST/backup_$TIME_STAMP.olr once installation is complete.

7. ohasd.bin is able to access network socket files:

th_listen: CLSCLISTEN failed

In Grid Infrastructure cluster environment, ohasd related socket files should be owned by root, but in Oracle Restart environment, they should be owned by grid user, refer to "Network Socket File Location, Ownership and Permission" section for example output.

8. ohasd.bin is able to access log file location:

OS messages/syslog shows:

Refer to "Log File Location, Ownership and Permission" section for example output, if the expected directory is missing, create it with proper ownership and permission.

9. After node reboot, ohasd may fail to start on SUSE Linux after node reboot, refer to note 1325718.1 – OHASD not Starting After Reboot on SLES

10. OHASD fails to start, "ps -ef| grep ohasd.bin" shows ohasd.bin is started, but nothing in $GRID_HOME/log/<node>/ohasd/ohasd.log for many minutes, truss shows it is looping to close non-opened file handles:

Call stack of ohasd.bin from pstack shows the following:

The cause is bug 11834289 which is fixed in 11.2.0.3 and above, other symptoms of the bug is clusterware processes may fail to start with same call stack and truss output (looping on OS call "close"). If the bug happens when trying to start other resources, "CRS-5802: Unable to start the agent process" could show up as well.

11. Other potential causes/solutions listed in note 1069182.1 – OHASD Failed to Start: Inappropriate ioctl for device

12. ohasd.bin started fine, however, "crsctl check crs" shows only the following and nothing else:

And "crsctl stat res -p -init" shows nothing

The cause is that OLR is corrupted, refer to note 1193643.1 to restore.

13. On EL7/OL7: note 1959008.1 – Install of Clusterware fails while running root.sh on OL7 – ohasd fails to start

14. For EL7/OL7, patch 25606616 is needed: TRACKING BUG TO PROVIDE GI FIXES FOR OL7

15. If ohasd still fails to start, refer to ohasd.log in <grid-home>/log/<nodename>/ohasd/ohasd.log and ohasdOUT.log

Case 2: OHASD Agents do not start

OHASD.BIN will spawn four agents/monitors to start resource:

oraagent: responsible for ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd etc

orarootagent: responsible for ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs etc

cssdagent / cssdmonitor: responsible for ora.cssd(for ocssd.bin) and ora.cssdmonitor(for cssdmonitor itself)

If ohasd.bin can not start any of above agents properly, clusterware will not come to healthy state.

Common causes of agent failure are that the log file or log directory for the agents don't have proper ownership or permission.

Refer to below section "Log File Location, Ownership and Permission" for general reference.

One example is "rootcrs.pl -patch/postpatch" wasn't executed while patching manually resulting in agent start failure:

2015-02-25 15:43:54.350806 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/orarootagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/orarootagent]

2015-02-25 15:43:54.382154 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/cssdagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdagent]

2015-02-25 15:43:54.384105 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/cssdmonitor Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdmonitor]

The solution is to execute the missed steps.

If agent binary (oraagent.bin or orarootagent.bin etc) is corrupted, agent will not start resulting in related resources not coming up:

CRS-5828:Could not start agent

Failed to start the agent

Failed to start the agentno exe permission

The solution is to compare agent binary with a "good" node, and restore a good copy.

truss/strace of ohasd shows agent binary is corrupted

32555 17:38:15.953355 execve("/ocw/grid/bin/orarootagent.bin",

["/opt/grid/product/112020/grid/bi"…], [/* 38 vars */]) = 0

32555 17:38:15.954151 — SIGBUS (Bus error) @ 0 (0) —

Agent may fail to start due to bug 11834289 with error "CRS-5802: Unable to start the agent process", refer to Section "OHASD does not start" #10 for details.

Refer to: note 1964240.1 – CRS-5823:Could not initialize agent framework

Case 3: OCSSD.BIN does not start

Successful cssd.bin startup depends on the following:

1. GPnP profile is accessible – gpnpd needs to be fully up to serve profile

If ocssd.bin is able to get the profile successfully, likely ocssd.log will have similar messages like following:

Otherwise messages like following will show in ocssd.log

clsgpnp_getProfile failed The solution is to ensure gpnpd is up and running properly.

2. Voting Disk is accessible

In 11gR2, ocssd.bin discover voting disk with setting from GPnP profile, if not enough voting disks can be identified, ocssd.bin will abort itself.

clssnmvDiskVerify: Successful discovery of 0 disks

ocssd.bin may not come up with the following error if all nodes failed while there's a voting file change in progress:

The solution is to start ocssd.bin in exclusive mode with note 1364971.1

If the voting disk is located on a non-ASM device, ownership and permissions should be:

-rw-r—– 1 ogrid oinstall 21004288 Feb 4 09:13 votedisk1

3. Network is functional and name resolution is working:

If ocssd.bin can't bind to any network, likely the ocssd.log will have messages like following:

failed to open gipc endp

If there's connectivity issue on private network (including multicast is off), likely the ocssd.log will have messages like following:

has a disk HB, but no network HB

after a long delay

To validate network, please refer to note 1054902.1

Please also check if the network interface name is matching the gpnp profile definition ("gpnptool get") for cluster_interconnect if CSSD could not start after a network change.

In 11.2.0.1, ocssd.bin may bind to public network if private network is unavailable

4. Vendor clusterware is up (if using vendor clusterware)

Grid Infrastructure provide full clusterware functionality and doesn't need Vendor clusterware to be installed; but if you happened to have Grid Infrastructure on top of Vendor clusterware in your environment, then Vendor clusterware need to come up fully before CRS can be started, to verify, as grid user:

**$GRID_HOME/bin/lsnodes -n

If vendor clusterware is not fully up, likely ocssd.log will have similar messages like following:

CSSD signal 11 in thread skgxnmon

Before the clusterware is installed, execute the command below as grid user:

$INSTALL_SOURCE/install/lsnodes -v

One issue on hp-ux: note 2130230.1 – Grid infrastructure startup fails due to vendor Clusterware did not start (HP-UX Service guard)

5. Command "crsctl" being executed from wrong GRID_HOME

Command "crsctl" must be executed from correct GRID_HOME to start the stack, or similar message will be reported:

Case 4: CRSD.BIN does not start

Successful crsd.bin startup depends on the following:

1. ocssd is fully up

If ocssd.bin is not fully up, crsd.log will show messages like following:

2. OCR is accessible

If the OCR is located on ASM, ora.asm resource (ASM instance) must be up and diskgroup for OCR must be mounted, if not, likely the crsd.log will show messages like:

Note: in 11.2 ASM starts before crsd.bin, and brings up the diskgroup automatically if it contains the OCR.

If the OCR is located on a non-ASM device, expected ownership and permissions are:

-rw-r—– 1 root oinstall 272756736 Feb 3 23:24 ocr

If OCR is located on non-ASM device and it's unavailable, likely crsd.log will show similar message like following:

If the OCR is corrupted, likely crsd.log will show messages like the following:

If owner or group of grid user got changed, even ASM is available, likely crsd.log will show following:

If oracle binary in GRID_HOME has wrong ownership or permission regardless whether ASM is up and running, or if grid user can not write in ORACLE_BASE, likely crsd.log will show following:

ORA-12547: TNS:lost contact

PROC-26

The expected ownership and permission of oracle binary in GRID_HOME should be:

If OCR or mirror is unavailable (could be ASM is up, but diskgroup for OCR/mirror is unmounted), likely crsd.log will show following:

3. crsd.bin pid file exists and points to running crsd.bin process

If pid file does not exist, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:

The solution is to create a dummy pid file ($GRID_HOME/crs/init/$HOST.pid) manually as grid user with "touch" command and restart resource ora.crsd

If pid file does exist and the PID in this file references a running process which is NOT the crsd.bin process, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:

To verify on OS level:

**ls -l /ocw/grid/crs/init/*pid**

**cat /ocw/grid/crs/init/*pid

****

ps -ef| grep 1535

**** >> Note process 1535 is not crsd.bin

The solution is to create an empty pid file and to restart the resource ora.crsd, as root:

4. Network is functional and name resolution is working:

If the network is not fully functioning, ocssd.bin may still come up, but crsd.bin may fail and the crsd.log will show messages like:

Or:

To validate the network, please refer to note 1054902.1

5. crsd executable (crsd.bin and crsd in GRID_HOME/bin) has correct ownership/permission and hasn't been manually modified, a simply way to check is to compare output of "ls -l <grid-home>/bin/crsd <grid-home>/bin/crsd.bin" with a "good" node.

6. crsd may not start due to the following:

note 1552472.1 -CRSD Will Not Start Following a Node Reboot: crsd.log reports: clsclisten: op 65 failed and/or Unable to get E2E port

note 1684332.1 – GI crsd Fails to Start: clsclisten: op 65 failed, NSerr (12560, 0), transport: (583, 0, 0)

7. To troubleshoot further, refer to note 1323698.1 – Troubleshooting CRSD Start up Issue

Case 5: GPNPD.BIN does not start

1. Name Resolution is not working

gpnpd.bin fails with following error in gpnpd.log:

In above example, please make sure current node is able to ping "node2", and no firewall between them.

2. Bug 10105195

Due to Bug 10105195, gpnp dispatch is single threaded and could be blocked by network scanning etc, the bug is fixed in 11.2.0.2 GI PSU2, 11.2.0.3 and above, refer to note 10105195.8 for more details.

Case 6: Various other daemons do not start

Common causes:

1. Log file or directory for the daemon doesn't have appropriate ownership or permission

If the log file or log directory for the daemon doesn't have proper ownership or permissions, usually there is no new info in the log file and the timestamp remains the same while the daemon tries to come up.

Refer to below section "Log File Location, Ownership and Permission" for general reference.

2. Network socket file doesn't have appropriate ownership or permission

In this case, the daemon log will show messages like:

clsclisten: Permission denied

3. OLR is corrupted

In this case, the daemon log will show messages like (this is a case that ora.ctssd fails to start):

Invalid active version [] retrieved from OLR

Error [19] retrieving active version

CTSS daemon aborting

The solution is to restore a good copy of OLR note 1193643.1

Other cases:

note 1087521.1 – CTSS Daemon Aborting With "op 65 failed, NSerr (12560, 0), transport: (583, 0, 0)"

Case 7: CRSD Agents do not start

CRSD.BIN will spawn two agents to start up user resource -the two agent share same name and binary as ohasd.bin agents:

orarootagent: responsible for ora.net_n_.network, ora._nodename_.vip, ora.scan_n_.vip and ora.gns

oraagent: responsible for ora.asm, ora.eons, ora.ons, listener, SCAN listener, diskgroup, database, service resource etc

To find out the user resource status:

If crsd.bin can not start any of the above agents properly, user resources may not come up.

Common cause of agent failure is that the log file or log directory for the agents don't have proper ownership or permissions.

Refer to below section "Log File Location, Ownership and Permission**" for general reference.

Agent may fail to start due to bug 11834289 with error "CRS-5802: Unable to start the agent process", refer to Section "OHASD does not start" #10 for details.

Case 8: HAIP does not start

HAIP may fail to start with various errors, i.e.

Refer to note 1210883.1 for more details of HAIP

#

Network and Naming Resolution Verification

CRS depends on a fully functional network and name resolution. If the network or name resolution is not fully functioning, CRS may not come up successfully.

To validate network and name resolution setup, please refer to note 1054902.1

Log File Location, Ownership and Permission

Appropriate ownership and permission of sub-directories and files in $GRID_HOME/log is critical for CRS components to come up properly.

In Grid Infrastructure cluster environment:

Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and two separate RDBMS owner rdbmsap and rdbmsar, here's what it looks like under $GRID_HOME/log in cluster environment:

Please note most log files in sub-directory inherit ownership of parent directory; and above are just for general reference to tell whether there's unexpected recursive ownership and permission changes inside the CRS home . If you have a working node with the same version, the working node should be used as a reference.

In Oracle Restart environment:

And here's what it looks like under $GRID_HOME/log in Oracle Restart environment:

For 12.1.0.2 onward, refer to note 1915729.1 – Oracle Clusterware Diagnostic and Alert Log Moved to ADR

Network Socket File Location, Ownership and Permission

Network socket files can be located in /tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle

When socket file has unexpected ownership or permission, usually daemon log file (i.e. evmd.log) will have the following:

And the following error may be reported:

The solution is to stop GI as root (crsctl stop crs -f), clean up socket files and restart GI.

Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and clustername eotcs

In Grid Infrastructure cluster environment:

Below is an example output from cluster environment:

In Oracle Restart environment:

And below is an example output from Oracle Restart environment:

Diagnostic file collection

If the issue can't be identified with the note, as root, please run $GRID_HOME/bin/diagcollection.sh on all nodes, and upload all .gz files it generated in current directory.

References

NOTE:1564555.1 – 11.2.0.3 PSU5/PSU6/PSU7 or 12.1.0.1 CSSD Fails to Start if Multicast Fails on Private Network

NOTE:1068835.1 – What to Do if 11gR2 Grid Infrastructure is Unhealthy

NOTE:1323698.1 – Troubleshooting CRSD Start up Issue

BUG:10105195 – PROC-32 ACCESSING OCR; CRS DOES NOT COME UP ON NODE

NOTE:1325718.1 – OHASD not Starting After Reboot on SLES

NOTE:1427234.1 – autorun file for ohasd is missing

NOTE:1077094.1 – How to fix the "DiscoveryString" in profile.xml or "asm_diskstring" in ASM if set wrongly

NOTE:1915729.1 – 12.1.0.2 Oracle Clusterware Diagnostic and Alert Log Moved to ADR

NOTE:1054902.1 – How to Validate Network and Name Resolution Setup for the Clusterware and RAC

NOTE:1069182.1 – OHASD Failed to Start: Inappropriate ioctl for device

NOTE:942166.1 – How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation

NOTE:10105195.8 – Bug 10105195 – Clusterware fails to start after reboot due to gpnpd fails to start

NOTE:1053147.1 – 11gR2 Clusterware and Grid Home – What You Need to Know

NOTE:1053970.1 – Troubleshooting 11.2 or 12.1 Grid Infrastructure root.sh Issues

BUG:11834289 – OHASD FAILED TO START TIMELY

NOTE:969254.1 – How to Proceed from Failed Upgrade to 11gR2 Grid Infrastructure on Linux/Unix