4. Altibase Heartbeat Process#
This chapter explains the Altibase Heartbeat process in details. On which criteria Altibase Heartbeat determines failure, and once failure is detected, how Failover is performed are examined.
aheartbeat Status#
aheartbeat defines the status of itself and other aheartbeats.
aheartbeat is in one of the following three statuses, depending on its execution state.
-
Ready: aheartbeat is not yet running
-
Run: aheartbeat has been executed and is in the state of running normally.
-
Error: The node is in a state of failure
aHeartbeat also defines aheartbeats on other nodes to be in one of the following three statuses.
- Ready: aheartbeat has not yet performed an initial handshake with the corresponding node's aheartbeat.
- Run: aheartbeat has successfully performed a handshake with the corresponding node's aheartbeat and is in the state of being connected normally.
- Error: aheartbeat cannot connect to the corresponding node's aheartbeat whose status was previously detected to be 'Run'.
The following figure shows how status transition occurs and the table lists the situations under which each status transition occurs.
[Figure 4-1] Status Transition
Each status transition occurs in the following table.
[Table 4-1] Status Transition
Status Transition | Description |
---|---|
(1) | After starting up, aheartbeat has successfully performed a handshake |
(2) | aheartbeat has terminated due to failure |
(3) | After failure, aheartbeat has restarted and successfully re-performed a handshake |
(4) | aheartbeat has terminated normally |
Determing Failure#
This section examines the criteria on which Altibase Heartbeat determines failure.
Altibase Heartbeat monitors in the following order and detects failure by connecting to three objects. The table below depicts the node (local or remote) which aheartbeat registers to have failed for each monitoring target object when connection fails.
[Table 4-2] Determing Failure
Order | Monitoring Target | Registered Node Failure |
---|---|---|
1 | aheartbeat 0 | Local node failure |
2 | Altibase database server on the local node | Local node failure |
3 | aheartbeat on the remote server | Remote node failure |
Local Node Failure#
When aheartbeats of each node detect failure at #1 or #2 in the above table, they register a local node failure.
aheartbeats of each node initially monitor aheartbeat 0 which resides in the external public network; when connection fails, they register a local node failure. In this case, aheartbeat determines that database services cannot be provided due to network failure between the Altibase server and clients.
If connection to aheartbeat 0 is normal or aheartbeat 0 is nonexistent, aheartbeats of each node monitor the Altibase database server on the local node; when connection fails, they register a database failure. In this case, aheartbeat determines that database services cannot be provided due to database failure.
If aheartbeat acknowledges that a failure has occurred in the above two monitoring targets, it executes the execution file to perform Failover to the local node and terminates itself.
Remote Node Failure#
If aheartbeats of each node detect failure at #3 in the above table, they register a remote node failure. That is, when connection to another node's aheartbeat which has been identified to be in the RUN status fails, they determine that the node has failed and execute the execution file for remote node Failover.
Note: When a local node fails, the aheartbeat of the node shuts itself down, so the aheartbeats of other nodes cannot access the aheartbeats of the node. Therefore, it is determined that a failure has occurred in the Altibase server of the remote node.
Role of aheartbeat 0#
aheartbeat 0's role and features in a distributed database environment are as follows.
-
aheartbeat 0 does not monitor the Altibase server on its node.
-
aheartbeats on nodes other than aheartbeat 0, can verify disconnection with the external network by connecting to aheartbeat 0. Disconnection with the external network indicates disconnection with the clients.
How failure detection results differ in relation to whether or not aheartbeat 0 exists in a distributed database environment is explained below.
When aheartbeat 0 is Nonexistent#
Let's suppose the network has failed in a distributed database environment where aheartbeats reside in the internal network.
[Figure 4-2] Network Failure Between an Internal and External Network
If a network failure occurs between Node A and the external network as shown above, Node A's aheartbeat is not capable of providing services to the client. However, it overlooks the network failure and continues to run. By doing so, Node B's aheartbeat also overlooks the failure that has occurred on Node A and does not perform Failover to Node A.
When aheartbeat 0 is Existent#
Next, let's suppose the network has failed in a distributed database environment where aheartbeat resides in the external network.
[Figure 4-3] Network Failure Between an Internal and External Network
If a network failure occurs between Node A and the external network as shown above, it is impossible for Node A's aheartbeat to connect to aheartbeat 0. Therefore, Node A's aheartbeat registers a local node failure, terminates itself and by doing so, allows other nodes to detect its failure. Since Node B's aheartbeat can't connect Node A's aheartbeat, it executes the remote node failover execution file to perform Failover to Node A.
As seen in the examples above, since aheartbeat 0 can detect network failure between the internal and external networks, the provision of continuous database services can be enforced by using aheartbeat 0.
Failover and Failback#
Failover#
When aheartbeat detects a failure, it executes the failover execution file with the following two arguments to enable the DBA to efficiently perform Failover.
Argument | Description |
---|---|
First Argument | The number of failed nodes |
Second Argument | The IDs of the failed nodes. These are differentiated by blaks and are specified in ascending order. |
For example, let's assume that there is a distributed environment where five nodes with the respective IDs 1, 2, 3, 4, 5, which each have Altibase servers and aheartbeats running, and that the Altibase server has failed on the node with the ID 3. Once aheartbeat on the node ID 3 detects that its Altibase server has failed, it executes the following local node failover script and terminates itself.
altibaseFailureEvent.sh 1 3
Once aheartbeats of other nodes detect that aheartbeat of the node ID 3 has terminated, they execute the following remote node failover script.
remoteNodeFailureEvent.sh 1 3
If the Altibase server on the node ID 1 fails while Node ID 3 is down, the failover script is executed with the following arguments.
altibaseFailureEvent.sh 2 1 3
remoteNodeFailureEvent.sh 2 1 3
This means that two servers have failed and their IDs are 1 and 3.
Failback#
Failback after a node has recovered from failure must be manually performed by the user.
Logging#
Altibase Heartbeat writes the following information to the log file while it is processing.
-
Information of aheartbeat startup
-
Information of connection failure of the Altibase Server
-
Information of the start of an connection to another node
-
Information of aheartbeat failure situation of Altibase server and other nodes
-
Information of each node's aheartbeat status transition
Log files are fixed to $ALTI_HBP_HOME/log/aheartbeat.log.
The output format of log information is as follows.
[YYYY-MM-DD HH:MM:SS T-<threadID>] Log Body