Node Reboot is performed by CRS to maintain consistency in Cluster environment by removing node which is facing some critical issue.
A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung , or a hung ocssd.bin process etc
Whenever, Database Administrator face Node Reboot issue, First thing to look at should be/var/log/message and OS Watcher logs of the Database Node which was rebooted.
var/log/messages will give you an actual picture of reboot:Exact time of restart, status of resource like swap and RAM etc.
- High Load on Database Server :
One common scenario is due to high load RAM and SWAP space of DBnode got exhaust and system stops working and finally reboot.
So, Every time you see a node eviction start investigation with /var/log/messages and Analyze OS Watcher logs
How to avoid Node Reboot due to High Load ?
The simple and best way to avoid this is use Oracle Database Resource Manager (DBRM). DBRM help to resolve this by allowing the database to have more control over how hardware resources and their allocation.
DBA should setup Resource consumer group and Resource plan and should use them as per requirements. In Exadata system Exadata DBA can use IORM to setup resource allocation among multiple database instances.
- Voting Disk not Reachable :
One of the another reason for Node Reboot is clusterware is not able to access a minimum number of the voting files.When the node aborts for this reason, the node alert log will show CRS1606 error.
Here are few general approach for DBA to follow.
1. Use command "crsctl query css votedisk" on a node where clusterware is up to get a list of all the voting files.
2. Check that each node can access the devices underlying each voting file.
3. Check for permissions to each voting file/disk have not been changed.
4. Check OS, SAN, and storage logs for any errors from the time of the incident.
5. Apply fix for 13869978 if only one voting disk is in use. This is fixed in 11.2.0.3.4 patch set and above, and 11.2.0.4 and above
- Missed Network Connection between Nodes :
In technical term this is called asMissed Network Heartbeat (NHB). Whenever there is communication gap or no communication between nodes on private network (interconnect) due to network outage or some other reason. A node abort itself to avoid "split brain" situation. The
most common (but not exclusive) cause of missed NHB is network problems communicating over the private interconnect.
Suggestion to troubleshoot Missed Network Heartbeat.
1. Check OS statistics from the evicted node from the time of the eviction. DBA can use OS Watcher to look at OS Stats at time of issue, check oswnetstat and oswprvtnet for network related issues.
2. Validate the interconnect network setup with the Help of Network administrator.
3. Check communication over the private network.
4. Check that the OS network settings are correct by running the RACcheck tool.
- Database Or ASM Instance Hang :
Sometimes Database or ASM instance hang can cause Node reboot. In these case Database instance is hang and is terminated afterwards, which cause either reboot cluster or Node eviction. DBA should check alert log of Database and ASM instance for any hang situation which might cause this issue.
In few of the cases, bugs could be the reason for node reboot, bug may be at Database level, ASM level or at Real Application Cluster level. Here, after initial investigation from Database Administrator side, DBA should open an SR with Oracle Support.
To ensure cluster and data integrity unhealthy nodes should be forcefully evicted from a cluster.
A node eviction will be initiated if:
- Cluster member cannot communicate via network heartbeat or Network disruption or latency
==> Misscount setting is network latencies ( Time taken for data packet from one point to other) in second from Node to Node ( Interconnect)
==> Check the setting using crsctl get css misscount
==> Default timeout 30 seconds for Linux/Unix
==> You can change the setting , Shutdown the CRS on all nodes, run crsctl set misscount
- Slow interconnect or failures
- Corrupted network packets on the network may also cause CSS reboots on certain platforms
- Delayed or Missed Disk Hearbeats by cluster member using the majority of voting files
==> Check the setting in your cluster using crsctl get css disktimeout
==> disktimeout setting is Disk latencies in seconds from node to Votedisk.
==> Default value is 200 (DIsk IO)
- server hung, CPU starvation
- Known Oracle Clusterware bugs
- problems with core Oracle Clusterware processes (e.g.: OCSSD, css, cssdagent, cssdmonitor).
- No space left for the device for the GI or /var filesystem
- Sudden death or hang of CSSD processes
- ORAAGENT/ORAROOTAGENT excessive resource (CPU, Memory, Swap) consumption resulting in node eviction on specific OS platforms
No comments:
Post a Comment