Data Collection for Troubleshooting Oracle Clusterware

Document Display

In this Document

	Purpose

	File Formats for Data Uploaded to Oracle Support

	Troubleshooting Steps

	1. Data Gathering for All Oracle Clusterware Issues

	2. Data Gathering for Node Reboot/Eviction

	3. Data Gathering for All Real Application Cluster Issues

	4. Data Gathering for Real Application Cluster Performance/Hang Issues

	5. Data Gathering for Oracle Clusterware Installation Issues

	5.1. Failure before executing root script:

	5.2. Failure while or after executing root script

	Appendix A. RDA

	Appendix B. OS logs

	Appendix C. systemstate and hanganalyze in RAC

	References

Applies to:

Oracle Database – Enterprise Edition – Version 10.1.0.2 and later Oracle Database Exadata Cloud Machine – Version N/A and later Oracle Cloud Infrastructure – Database Service – Version N/A and later Oracle Database Cloud Exadata Service – Version N/A and later Oracle Database Exadata Express Cloud Service – Version N/A and later Information in this document applies to any platform.

Purpose

This note will be obsolete in the future, it's strongly recommended to use TFA to prune and collect files from all nodes:

Reference: note 1513912.1 TFA Collector – Tool for Enhanced Diagnostic Gathering

TFA Collector is installed in the GI HOME and comes with 11.2.0.4 GI and higher. For GI 11.2.0.3 or lower, install the TFA Collector by referring to note 1513912.1 for instruction on downloading and installing TFA collector.

$GI_HOME/tfa/bin/tfactl diagcollect -from "MMM/dd/yyyy hh:mm:ss" -to "MMM/dd/yyyy hh:mm:ss"

Format example: "Jul/1/2014 21:00:00" Specify the "from time" to be 4 hours before and the "to time" to be 4 hours after the time of error.

This note lists what to collect for different type of Oracle Clusterware and Real Application Cluster issues, it's not mandatory to upload all the files to open a SR, however, it will speed up the resolution if all relevant info are uploaded.

File Formats for Data Uploaded to Oracle Support

Oracle Support requests that you upload compressed files grouped together by node and labeled as such in a standard format, such as .tar, .gz, .Z or .zip.

Older runs of diagcollection or any other files (i.e. if diagcollection was run a few days or weeks back) may not provide current log information which can delay the resolution.

Troubleshooting Steps

1. Data Gathering for All Oracle Clusterware Issues

Provide current diagcollection output from all nodes in the cluster.

Note 330358.1 – CRS 10gR2/ 11gR1/ 11gR2 Diagnostic Collection Guide Note 272332.1 – CRS 10gR1 Diagnostic Collection Guide

2. Data Gathering for Node Reboot/Eviction

Provide files in Section "Data Gathering for All Oracle Clusterware Issues" and the followings:

Approximate date and time of the reboot, and the hostname of the rebooted node
OSWatcher archives which cover the reboot time at an interval of 20 seconds with private network monitoring configured.

Note 301137.1 – OS Watcher User Guide Note.433472.1 – OS Watcher For Windows (OSWFW) User Guide

For pre-11.2, zip of /var/opt/oracle/oprocd/* or /etc/oracle/oprocd/*
For pre-11.2, OS logs – refer to Section Appendix B
For 11gR2+, zip of /etc/oracle/lastgasp/* or /var/opt/oracle/lastgasp/*
CHM/OS data that covers the reboot time for platforms where it is available, refer to Note 1328466.1 for section "How do I collect the Cluster Health Monitor data"
If vendor clusterware is being used, upload the vendor clusterware logs

3. Data Gathering for All Real Application Cluster Issues

From all nodes:

Provide instance alert_{$ORACLE_SID}.log, lmon, lmd*, lms*, ckpt, lgwr, lck*, dia*, lmhb(11g only), and all others traces that are modified around incident time. A quick way to identify all traces and tar them up is to use incident time with the following example:

$ grep "2010-09-02 03" *.trc | awk -F: '{print $1}' | sort -u |xargs tar cvf trace.`hostname`.`date +%Y%m%d%H%M%S`.tar

$ gzip trace*.tar

For pre-11g, execute the command in bdump and udump to identify the list of files.

For 11g+, execute the command in ${ORACLE_BASE}/diag/rdbms/$DBNAME/${ORACLE_SID}/trace to identify the list of files

Incident files/packages in alert.log at time of the incident
If ASM is involved, provide same set of files for ASM
OS logs – refer to Appendix B

4. Data Gathering for Real Application Cluster Performance/Hang Issues

Provide files in Section "Data Gathering for All Real Application Cluster Issues" and the following:

systemstate and hanganalyze – refer to Appendix C
awr, addm and ash report, each report covers a period no more than 60 minutes
OSWatcher archives which cover the hang time

Note 301137.1 – OS Watcher User Guide Note.433472.1 – OS Watcher For Windows (OSWFW) User Guide

CHM/OS data what covers the hang time for platforms where it is available, refer to Note 1328466.1 for section "How do I collect the Cluster Health Monitor data"

5. Data Gathering for Oracle Clusterware Installation Issues

5.1. Failure before executing root script:

For 11gR2: note 1056322.1 – Troubleshoot 11gR2 Grid Infrastructure/RAC Database runInstaller Issues

For pre-11.2: note 406231.1 – Diagnosing RAC/RDBMS Installation Problems

5.2. Failure while or after executing root script

Provide files in Section "Data Gathering for All Oracle Clusterware Issues" and the following:

root script (root.sh or rootupgrade.sh) screen output
For 11gR2: provide zip of <$ORACLE_BASE>/cfgtoollogs and <$ORACLE_BASE>/diag for grid user.
For pre-11.2: Note 240001.1 – Troubleshooting 10g or 11.1 Oracle Clusterware Root.sh Problems

Appendix A. RDA

It's recommended to provide the latest RDA from for all issues from all nodes in the cluster

Note 314422.1 – Remote Diagnostics Agent (RDA)

Appendix B. OS logs

OS logs are in the following directory depending on platform:

Linux: /var/log/messages

AIX: /bin/errpt -a (redirect this to a file called messages.out)

Solaris: /var/adm/messages

HP-UX: /var/adm/syslog/syslog.log

Tru64: /var/adm/messages

Windows: save Application Log and System Log as .TXT files using Event Viewer

Note: From 11gR2, OS logs are part of diagcollection on Linux, Solaris, HP-UX.

Appendix C. systemstate and hanganalyze in RAC

To collect hanganalyze and systemstate in RAC, execute the following on one instance to generate cluster wide dumps:

a – Connect to sqlplus as sysdba: "sqlplus / as sysdba"; if this does not work, use "sqlplus -prelim / as sysdba"

b – Execute the following commands:

For 11g+

SQL> oradebug setospid <ospid of diag process> SQL> oradebug unlimit SQL> oradebug -g all hanganalyze 3

..Wait about 2 minutes

SQL> oradebug -g all hanganalyze 3 SQL> oradebug -g all dump systemstate 258

If possible, take another one at level 266 instead of 258

If SGA is large or fix for bug 11800959 (fixed in 11.2.0.2 DB PSU5, 11.2.0.3 and above) is not applied, level 266 could take very long time and generate a huge trace file and may not finish in hours.

For 10g

SQL> oradebug setospid <ospid of diag process> SQL> oradebug unlimit SQL> oradebug -g all dump systemstate 266##..Wait about 2 minutes SQL> oradebug -g all dump systemstate 266

Please upload *diag* trace from either bdump or trace directory.

If diag trace is huge or "oradebug -g all …" command is hanging, please collect system state dump from each instance individually at similar time:

SQL> oradebug setmypid SQL> oradebug unlimit SQL> oradebug hanganalyze 3

..Wait about 2 minutes

SQL> oradebug hanganalyze 3 SQL> oradebug dump systemstate 258 SQL> oradebug tracefile_name

Please upload the trace file listed above.

If "sqlplus -prelim / as sysdba" does not work, refer to note 121779.1

If ASM is involved, collect hanganalyze and systemstate from ASM with the instruction above.

References

NOTE:736752.1 – Introducing Cluster Health Monitor (IPD/OS) NOTE:314422.1 – Remote Diagnostic Agent (RDA) – Getting Started NOTE:330358.1 – Oracle Clusterware 10gR2/ 11gR1/ 11gR2/ 12.1.0.1 Diagnostic Collection Guide NOTE:406231.1 – Diagnosing RAC/RDBMS Installation Problems NOTE:272332.1 – CRS 10g Diagnostic Collection Guide NOTE:433472.1 – OS Watcher For Windows (OSWFW) User Guide NOTE:1328466.1 – Cluster Health Monitor (CHM) FAQ NOTE:240001.1 – Troubleshooting 10g or 11.1 Oracle Clusterware Root.sh Problems NOTE:942166.1 – How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation NOTE:969254.1 – How to Proceed from Failed Upgrade to 11gR2 Grid Infrastructure on Linux/Unix NOTE:301137.1 – OSWatcher (Includes: [Video])

NOTE:1056322.1 – Troubleshoot Grid Infrastructure/RAC Database installer/runInstaller Issues