MISSCOUNT DEFINITION AND DEFAULT VALUES
The CSS misscount parameter represents the maximum time, in seconds, that a network heartbeat can be missed before entering into a cluster reconfiguration to evict the node. The following are the default values for the misscount parameter and their respective versions when using Oracle Clusterware* in seconds:
The table below explains in the conditions under which the eviction will occur
* By default Misscount is less than Disktimeout seconds
CONSIDERATIONS WHEN CHANGING MISSCOUNT FROM THE DEFAULT VALUE
The next cause for the split brain clusters can be voting disk access. From 11g R2 onwards, voting files can be placed on the ASM disk groups. ASM instance do not need to be up in order for the cluster to access the voting files. Typically it is recommended to keep odd number of votings such as three or five voting files. That way when there is a failure, there will be a majority always. You can’t find the exact time by having two watches. You can have hung jury verdict when the participants are evenly split on the decision so odd number of voting files should be always used.
Similar to the network failures, voting disk failures are tracked using two parameters. LDTO and SDTO. Long disk timeout which is 200 seconds is used for normal cluster operation when network heart beats are good. Short disk timeout we explained earlier(27 seconds) is used when cluster formation or cluster node leaving. Typically the cluster will evict the node by rebooting it when a node cannot communicate with the voting disk.
The node is unable to join the cluster if it cannot access majority of the voting files and the node must leave if it cannot access majority of the voting files. Although the cluster will evict the node by rebooting the node or committing suicide for voting disk failures prior to 11g R2, starting from release 11g R2 the cluster will protect by doing reboot less fencing of the node. We will explain what is rebooting less fencing.
Reboot less Node Fencing
Prior to 11g R2, during voting disk failures the node will be rebooted to protect the integrity of the cluster. But rebooting cannot be necessarily just the communication issue. The node can be hanging or the IO operation can be hanging so potentially the reboot decision can be the incorrect one. So Oracle Clusterware will fence the node without rebooting. This is a big (and big) achievement and changes in the way the cluster is designed.
The reason why we will have to avoid the reboot is that during reboots resources need to re-mastered and the nodes remaining on the cluster should be re-formed. In a big cluster with many numbers of nodes, this can be potentially a very expensive operation so Oracle fences the node by killing the offending process so the cluster will shutdown but the node will not be shutdown. Once the IO path is available or the network heartbeat is available, the cluster will be started again. Be assured the data will be protected but it will be done without any pain rebooting the nodes. But in the cases where the reboot is needed to protect the integrity, the cluster will decide to reboot the node.
CPU Scheduling & Starvation
Ocssd.bin is responsible to ensure the disk heartbeat as well as the network heartbeat. But there are situations where certain processes can chew up the CPU and make the systems to starve thus blocking network and disk heartbeat pings. Two processes cssdmonitor and cssdagent which are responsible to check the CPU scheduling, ocssd.bin process hanging situations, hardware and IO path hanging. In prior versions, oprocd was responsible to check the CPU scheduling and Oracle recommendation was to change the diagwait to 13 seconds so that it gives enough time to check the hanging situation as well enough time to dump the memory information so that it is useful to debug why the node went down. But this parameter diagwait no longer need to be explicitly set as oracle controls cssdmonitor and cssdagent with undocumented values. As you might have guessed it right, if the two processes cssdmonitor and cssdagent not running on the cluster (or killed them) and they are down for more than 27 seconds (SDTO – reboottime) then the node will be evicted.
MsCount Parameter can be changed
With 11gR2, these settings can be changed online without taking any node down:
1) Execute crsctl as root to modify the misscount:
$CRS_HOME/bin/crsctl set css misscount <n> #### where <n> is the maximum private network latency in seconds
$CRS_HOME/bin/crsctl set css reboottime <r> [-force] #### (<r> is seconds)
$CRS_HOME/bin/crsctl set css disktimeout <d> [-force] #### (<d> is seconds)
2) Execute crsctl as root to confirm the change:
$CRS_HOME/bin/crsctl get css misscount
$CRS_HOME/bin/crsctl get css reboottime
$CRS_HOME/bin/crsctl get css disktimeout
CSS Timeout Computation in Oracle Clusterware (Doc ID 294430.1)
Steps To Change CSS Misscount, Reboottime and Disktimeout (Doc ID 284752.1)
The CSS misscount parameter represents the maximum time, in seconds, that a network heartbeat can be missed before entering into a cluster reconfiguration to evict the node. The following are the default values for the misscount parameter and their respective versions when using Oracle Clusterware* in seconds:
OS | 10g (R1 &R2) | 11g |
Linux
|
60
|
30
|
Unix
|
30
|
30
|
VMS
|
30
|
30
|
Windows
|
30
|
30
|
The table below explains in the conditions under which the eviction will occur
Network Ping | Disk Ping | Reboot |
Completes within misscount seconds | Completes within Misscount seconds |
N
|
Completes within Misscount seconds | Takes more than misscount seconds but less than Disktimeout seconds |
N
|
Completes within Misscount seconds | Takes more than Disktimeout seconds |
Y
|
Takes more than Misscount Seconds | Completes within Misscount seconds |
Y
|
* By default Misscount is less than Disktimeout seconds
CONSIDERATIONS WHEN CHANGING MISSCOUNT FROM THE DEFAULT VALUE
- Customers drive SLA and cluster availability. The customer ultimately defines Service Levels and availability for the cluster. Before recommending any change to misscount, the full impact of that change should be described and the impact to cluster availability measured.
- Customers may have timeout and retry logic in their applications. The impact of delaying reconfiguration may cause 'artificial' timeouts of the application, reconnect failures and subsequent logon storms.
- Misscount timeout values are version dependent and are subject to change. As we have seen, misscount calculations are variable between releases and between versions within a release. Creating a false dependency on misscount calculation in one version may not be appropriate for later versions.
- Internal I/O timeout interval (DTO) algorithms may change in later releases as stated above, there exists a direct relationship between the internal I/O timeout interval and misscount. This relationship is subject to change in later releases.
- An increase in misscount to compensate for i/o latencies directly effects reconfiguration times for network failures. The network heartbeat is the primary indicator of connectivity within the cluster. Misscount is the tolerance level of missed 'check ins' that trigger cluster reconfiguration. Increasing misscount will prolong the time to take corrective action in the event of network failure or other anomalies effecting the availability of a node in the cluster. This directly effects cluster availability.
- Changing misscount to workaround voting disk latencies will need to be corrected when the underlying disk latency is corrected, misscount needs to be set back to the default The customer needs to document the change and set the parameter back to the default when the underlying storage I/O latency is resolved.
- Do not change default misscount values if you are running Vendor Clusterware along with Oracle Clusterware. The default values for misscount should not be changed when using vendor clusterware. Modifying misscount in this environment may cause clusterwide outages and potential corruptions.
- Changing misscount parameter incurs a clusterwide outage. As note below, the customer will need to schedule
a clusterwide outage to make this change. - Changing misscount should not be used to compensate for poor configurations or faulty hardware
- Cluster and RDBMS availability are directly effected by high misscount settings.
- In case of stretched clusters and stretched storage systems and a site failure where we lose one storage and N number of nodes we go into a reconfiguration state and then we revert to ShortDiskTimeOut value as internal I/O timeout for the votings. Several cases are known with stretched clusters where when a site failure happen the storage failover cannot complete within SDTO. If the I/O to the votings is blocked more than SDTO the result is node evictions on the surviving side.
The next cause for the split brain clusters can be voting disk access. From 11g R2 onwards, voting files can be placed on the ASM disk groups. ASM instance do not need to be up in order for the cluster to access the voting files. Typically it is recommended to keep odd number of votings such as three or five voting files. That way when there is a failure, there will be a majority always. You can’t find the exact time by having two watches. You can have hung jury verdict when the participants are evenly split on the decision so odd number of voting files should be always used.
Similar to the network failures, voting disk failures are tracked using two parameters. LDTO and SDTO. Long disk timeout which is 200 seconds is used for normal cluster operation when network heart beats are good. Short disk timeout we explained earlier(27 seconds) is used when cluster formation or cluster node leaving. Typically the cluster will evict the node by rebooting it when a node cannot communicate with the voting disk.
The node is unable to join the cluster if it cannot access majority of the voting files and the node must leave if it cannot access majority of the voting files. Although the cluster will evict the node by rebooting the node or committing suicide for voting disk failures prior to 11g R2, starting from release 11g R2 the cluster will protect by doing reboot less fencing of the node. We will explain what is rebooting less fencing.
Reboot less Node Fencing
Prior to 11g R2, during voting disk failures the node will be rebooted to protect the integrity of the cluster. But rebooting cannot be necessarily just the communication issue. The node can be hanging or the IO operation can be hanging so potentially the reboot decision can be the incorrect one. So Oracle Clusterware will fence the node without rebooting. This is a big (and big) achievement and changes in the way the cluster is designed.
The reason why we will have to avoid the reboot is that during reboots resources need to re-mastered and the nodes remaining on the cluster should be re-formed. In a big cluster with many numbers of nodes, this can be potentially a very expensive operation so Oracle fences the node by killing the offending process so the cluster will shutdown but the node will not be shutdown. Once the IO path is available or the network heartbeat is available, the cluster will be started again. Be assured the data will be protected but it will be done without any pain rebooting the nodes. But in the cases where the reboot is needed to protect the integrity, the cluster will decide to reboot the node.
CPU Scheduling & Starvation
Ocssd.bin is responsible to ensure the disk heartbeat as well as the network heartbeat. But there are situations where certain processes can chew up the CPU and make the systems to starve thus blocking network and disk heartbeat pings. Two processes cssdmonitor and cssdagent which are responsible to check the CPU scheduling, ocssd.bin process hanging situations, hardware and IO path hanging. In prior versions, oprocd was responsible to check the CPU scheduling and Oracle recommendation was to change the diagwait to 13 seconds so that it gives enough time to check the hanging situation as well enough time to dump the memory information so that it is useful to debug why the node went down. But this parameter diagwait no longer need to be explicitly set as oracle controls cssdmonitor and cssdagent with undocumented values. As you might have guessed it right, if the two processes cssdmonitor and cssdagent not running on the cluster (or killed them) and they are down for more than 27 seconds (SDTO – reboottime) then the node will be evicted.
MsCount Parameter can be changed
With 11gR2, these settings can be changed online without taking any node down:
1) Execute crsctl as root to modify the misscount:
$CRS_HOME/bin/crsctl set css misscount <n> #### where <n> is the maximum private network latency in seconds
$CRS_HOME/bin/crsctl set css reboottime <r> [-force] #### (<r> is seconds)
$CRS_HOME/bin/crsctl set css disktimeout <d> [-force] #### (<d> is seconds)
2) Execute crsctl as root to confirm the change:
$CRS_HOME/bin/crsctl get css misscount
$CRS_HOME/bin/crsctl get css reboottime
$CRS_HOME/bin/crsctl get css disktimeout
CSS Timeout Computation in Oracle Clusterware (Doc ID 294430.1)
Steps To Change CSS Misscount, Reboottime and Disktimeout (Doc ID 284752.1)
No comments:
Post a Comment