CRS DAEMON Can’t Up Oracle 11g Aix

Once upon time i installed RAC on one of my client, after the installation i reboot the server after that i found that the crs daemon can't up, i wondering what is the problem, before rebooting every things is find..

the error message is

2010-11-29 11:13:49.817: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203

below is the explanation about the error message..

In this Document

Symptoms

Changes

Cause

1) AIX-specific cause

2) UNIX-generic cause

Solution

1) Solution for AIX-specific cause

2) Solution for Unix-generic cause

Scalability RAC Community

References

Applies to:

Oracle Server – Enterprise Edition – Version: 11.2.0.2 and later [Release: 11.2 and later ]

Information in this document applies to any platform.

Symptoms

11.2.0.2 grid infrastructure upgrade or install on >1 node cluster

rootcrs.pl is failing and the following is found in the crsd.log

…

2010-11-29 10:52:38.603: [GIPCHALO][2314] gipchaLowerProcessNode: no valid interfaces found to node for 2614824036 ms, node 111ea99b0 { host 'racdb1', haName '1e0b-174e-37bc-a515', srcLuid 2612fa8e-3db4fcb7, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [55 : 55], createTime 2614768983, flags 0x4 }

2010-11-29 10:52:42.299: [ CRSMAIN][515] Policy Engine is not initialized yet!

2010-11-29 10:52:43.554: [ OCRMAS][3342]proath_connect_master:1: could not yet connect to master retval1 = 203, retval2 = 203

2010-11-29 10:52:43.554: [ OCRMAS][3342]th_master:110': Could not yet connect to new master [1]

2010-11-29 10:52:43.605: [GIPCHALO][2314] gipchaLowerProcessNode: no valid interfaces found to node for 2614829038 ms, node 111ea99b0 { host 'racdb1', haName '1e0b-174e-37bc-a515', srcLuid 2612fa8e-3db4fcb7, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [60 : 60], createTime 2614768983, flags 0x4 }

2010-11-29 10:52:43.754: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203

2010-11-29 10:52:43.955: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203

…

2010-11-29 11:13:49.817: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203

2010-11-29 11:13:50.018: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203

…

evmd.log shows:

2010-11-29 10:52:38.694: [ GIPCNET][2314] gipcmodNetworkProcessSend: slos op : sgipcnUdpSend

2010-11-29 10:52:38.694: [ GIPCNET][2314] gipcmodNetworkProcessSend: slos dep : Message too long (59)

2011-11-29 10:52:38.694: [ GIPCNET][2314] gipcmodNetworkProcessSend: slos loc : sendto

Changes

Upgrade or install of 11.2.0.2 grid infrastructure on >1 node cluster

Cause

2 causes found for this symptom. One cause is AIX-specific and the other cause is Unix-generic

1) AIX-specific cause

udp_sendspace is set as default 9216, it is smaller than 10240 bytes which is the size used by CRS.

#no -o udp_sendspace

will show the current setting

2) UNIX-generic cause

Netmask mismatch between the nodes. The private interface must have the same netmask on all nodes. Mismatch between netmask on different nodes can cause this symptom.

Solution

The two causes have two separate solutions.

1) Solution for AIX-specific cause

Increase udp_sendspace to >= 10240.

# no -o udp_sendspace=65536

Note that the 11gR2 documentation instructs to set udp_sendspace to 65536:

Network tuning parameter	Recommended value
ipqmaxlen	512
rfc1323	1
sb_max	4194304
tcp_recvspace	65536
tcp_sendspace	65536
udp_recvspace	655360
udp_sendspace	65536

See Oracle Grid Infrastructure Installation Guide

11g Release 2 (11.2) for IBM AIX on POWER Systems (64-Bit)

2.11.7 Configuring Network Tuning Parameters

http://download.oracle.com/docs/cd/E11882_01/install.112/e17210/preaix.htm#CWAIX219

for more details.

If problem happens during rootupgrade.sh (usually on 2nd node), please do:

1). Increase udp_sendspace to 65536:

# no -o udp_sendspace=65536

2). Stop CRS on both nodes:

# crsctl stop crs -f

# ps -ef |grep d.bin – to ensure there is no left over CRS process

3). Restart CRS on node 1:

# crsctl start crs

wait till CRS start on node 1.

4). On node 2, rerun rootupgrade.sh

# rootupgrade.sh

It should complete on node 2 this time.

Please note, if any platform, if udp_sendspace (or similar) setting is < 10240, this problem will occur.

2) Solution for Unix-generic cause

Check that netmask matches on private interface on all nodes.

[grid@mynode1 ~]$ ifconfig eth1

eth1 Link encap:Ethernet HWaddr 00:19:B9:1E:6D:97

inet addr:192.168.1.110 Bcast:192.168.1.255 Mask:255.255.255.0

…

[grid@mynode2 ~]$ ifconfig eth1

eth1 Link encap:Ethernet HWaddr 00:19:B9:1E:6D:97

inet addr:192.168.1.111 Bcast:192.168.1.255 Mask:255.255.255.0

…

In case of mismatch, customer sysadmin must correct the netmask on the private interface(s) where it's wrong.

and after long of journey finding the solution, i realize that all the parameter has change after rebooting the server, then i ask to the sysadmin to make persistent change to all the parameter is used by cluster.. 🙂 below the detail parameter

Network tuning parameter	Recommended value
ipqmaxlen	512
rfc1323	1
sb_max	4194304
tcp_recvspace	65536
tcp_sendspace	65536
udp_recvspace	655360
udp_sendspace	65536