# Failover - Active/Passive

### Introduction

Manage a product configured in an active/passive cluster mode.

### Objective

This procedure pertains to implementing Nodeum in an active-passive configuration.

### Architecture

<figure><img src="https://1946775891-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FgzbGsZqZH8Ro8zxRYgwn%2Fuploads%2FE6lsVItmqUHtQ6J3lFKx%2Fimage.png?alt=media&#x26;token=5a304a78-9f6d-471b-927d-bb768f0d732b" alt=""><figcaption></figcaption></figure>

### Implementation Overview

The failover mechanism is established via an Ansible package during the initial Nodeum deployment. The Ansible inventory defines the cluster members and associated services. Once deployed, the system provides these key services:

* **Service Redundancy**: Ensures Nodeum and its components are redundant.
* **Cache Disk Redundancy**: Guarantees redundancy for the cache disk.
* **Single Namespace IP Address**: Provides a unified access point.

In the event of a node failure, the system automatically redirects access through the second node, ensuring continuous availability.

## Ansible Inventory Overview

Below is a definition detailing how services are deployed across two servers.

20-main

```bash
[mongodb]
srv01
srv02

[web]
srv01
srv02

[core]
srv01
srv02

[scheduler]
srv01
srv02

[refparser]
srv01
srv02

[mount_point_scanning]
srv01
srv02
```

11-mariadb-cluster

```bash
[mariadb]
; When updating from a single node to a cluster, only the node that
;   previously hold the data should be set to `true`
srv01  galera_master=true
srv02

```

31-catalog-indexer

```bash
[zookeeper_nodes]
srv01
srv02

[solr]
srv01
srv02

[solr:vars]

; Product of these two variables should be lower or equal to number of hosts

; Number of shards to split the collection into.
; Default: 3
; solr_shards=3
; Number of copies of each document in the collection.
; Default: 1
; solr_replication_factor=1

[catalog_indexer]
srv01
srv02

```

31-catalog-indexer

```bash
[gluster_cache]
srv01  gluster_cache_device=/dev/sdb  gluster_logical_size=100%FREE
srv02  gluster_cache_device=/dev/sdb  gluster_logical_size=100%FREE

[gluster_cache:vars]
; Arbiter count
; gluster_arbiters=

; Disperse count
; gluster_disperses=

; Redundancy count
; gluster_redundancies=

; Replica count
; gluster_replicas=

; Stripe count
; gluster_stripes=

```

#### Resiliency Level, Cluster, and Failover

#### &#x20;**Definitions**

**Cluster**: A service running in cluster mode operates simultaneously on both nodes without interruption.

**Failover**: This mode automatically restarts services and shifts the virtual IP to the passive server when needed, ensuring continuity.

Below is a list of services and their corresponding resiliency levels:

<table><thead><tr><th width="287">Nodeum Services</th><th>Resiliency Level</th></tr></thead><tbody><tr><td>Notification Manager</td><td>Failover</td></tr><tr><td>Core Manager</td><td>Failover</td></tr><tr><td>Tape Library Manager</td><td>Failover</td></tr><tr><td>Data Mining</td><td>Failover</td></tr><tr><td>File System Virtualization</td><td>Failover</td></tr><tr><td>Watchdog</td><td>Failover</td></tr><tr><td>Ref. File Parsing</td><td>Failover</td></tr><tr><td>Scheduler</td><td>Failover</td></tr><tr><td>File Listing Processing</td><td>Failover</td></tr><tr><td>Indexation Engine</td><td>Failover</td></tr></tbody></table>

<table><thead><tr><th width="288">System Services</th><th>RESILIENCY LEVEL</th></tr></thead><tbody><tr><td>CACHE Disk</td><td>Cluster</td></tr><tr><td>Solr</td><td>Cluster</td></tr><tr><td>NGINX</td><td>Cluster</td></tr><tr><td>MariaDB</td><td>Cluster</td></tr><tr><td>MongoDB</td><td>Cluster</td></tr><tr><td>SMB</td><td>Cluster</td></tr><tr><td>NFS</td><td>Cluster</td></tr><tr><td>MinIO</td><td>Not yet available</td></tr></tbody></table>

### &#x20;System troubleshooting

#### &#x20;Service Status Monitoring

You can monitor the status of each service by accessing the web interface of each node. The active server must have all Nodeum services running, while the passive server should have the "Core Manager," "Scheduler," "File Listing Processing," and "Indexation Engine" services stopped.

| Server “Active”                                                                                                                                                                                                                                            | Server “Passive”                                                                                                                                                                                                                                           |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <img src="https://1946775891-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FgzbGsZqZH8Ro8zxRYgwn%2Fuploads%2FNKMRggo6waW5HrGQ7oBy%2Fimage.png?alt=media&#x26;token=a5919fa9-d76b-462f-8e88-6e7011e39f08" alt="" data-size="original"> | <img src="https://1946775891-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FgzbGsZqZH8Ro8zxRYgwn%2Fuploads%2FJooNWSxkQLTwZHjZ36Ca%2Fimage.png?alt=media&#x26;token=64ec6e43-d37b-4605-8f83-3f36c8b3884d" alt="" data-size="original"> |
| ![](file:///C:/Users/JESSIC~1/AppData/Local/Temp/msohtmlclip1/01/clip_image002.png)                                                                                                                                                                        | ![](file:///C:/Users/JESSIC~1/AppData/Local/Temp/msohtmlclip1/01/clip_image004.jpg)                                                                                                                                                                        |

| Server “Active”                                                                                                                                                                                                                                            | Server “Passive”                                                                                                                                                                                                                                           |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <img src="https://1946775891-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FgzbGsZqZH8Ro8zxRYgwn%2Fuploads%2FzWUsA0JzGAIGyNVSVVlb%2Fimage.png?alt=media&#x26;token=2e2a6b68-19f3-4d6a-9665-2fc1f8959b70" alt="" data-size="original"> | <img src="https://1946775891-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FgzbGsZqZH8Ro8zxRYgwn%2Fuploads%2FG36gnKJBuejFE3YOUmE5%2Fimage.png?alt=media&#x26;token=d9ab5aea-2777-455e-a237-1ac4322e0191" alt="" data-size="original"> |

#### Cluster Node Maintenance Guide

To perform maintenance on a cluster node while ensuring continuous service, follow these steps:

1. **Objective**: Stop each server one at a time, making sure at least one server remains active at all times.
2. **Options for Shutdown**:
   * **Via Nodeum Console**: Access the Nodeum Console for the server you wish to shut down and initiate the shutdown process.
   * **Via SSH**: Connect to the server using SSH and execute the shutdown command.

By adhering to this process, you can maintain the cluster effectively without service interruption.

To verify which node is active, check which one has the clustered IP assigned. Use the command `ip address show` to display the IP addresses on the network interfaces. The active server will have both its main IP address and the clustered IP address.

In this example:

* The network interface device name on both servers is `ens160`.
* The IP address of the cluster is `10.3.1.111`.
* The IP address of `nodcluster01` is `10.3.1.101`.
* The IP address of `nodcluster02` is `10.3.1.102`.

```bash
$ ip address show ens160
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group 
default qlen 1000
    link/ether 00:50:56:be:f4:1d brd ff:ff:ff:ff:ff:ff
    inet 10.3.1.154/24 brd 10.3.1.255 scope global noprefixroute ens160
       valid_lft forever preferred_lft forever
    inet6 fe80::8216:6cb0:f936:9863/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
```

```bash
$ ip address show ens160
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group 
default qlen 1000
    link/ether 00:50:56:be:cc:d4 brd ff:ff:ff:ff:ff:ff
    inet 10.3.1.155/24 brd 10.3.1.255 scope global noprefixroute ens160
       valid_lft forever preferred_lft forever
    inet 10.3.1.153/32 scope global ens160
       valid_lft forever preferred_lft forever
    inet6 fe80::641e:f11a:fe5d:e531/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
```

### Situation 1: Unexpected Stop of Active Node

#### List of Cases:

* Power outage
* Virtual server downtime
* Operating system failure

**What’s happened:**

\-        Nodeum switch the second (passive) node automatically

\-        Services are restarted on the second for the failover service.

**Note:** In this scenario, the passive node remains unaware of the cluster's state transfer. This situation arises when the cluster is presumed to be split, with the node in a smaller subset, often due to temporary network issues causing nodes to momentarily lose connection. The node uses this precaution to prevent data inconsistencies.

**Result:**&#x20;

I can't get into the Nodeum Console, it gives me a “internal 500 error”.&#x20;

**Determine the root cause**:&#x20;

Check the Status of MariaDB with this command: `‘systemctl status mariadb’`. The status may display the following error message 'WSREP has not yet prepared node for application use'.

**Resolution:**&#x20;

This temporary state which can be detected by checking wsrep\_ready value. The node allows SHOW and SET command during this period.

```bash
$ mysql -u root -padmin
MariaDB [(none)]> SHOW STATUS LIKE 'wsrep_ready';

+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_ready   | OFF   |
+---------------+-------+
```

In the Server that have the Issue:

```bash
$ mysql -u root -padmin
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 48

Server version: 10.4.18-MariaDB MariaDB Server
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SET GLOBAL wsrep_provider_options='pc.bootstrap=true';
```

#### Situation 2: Two nodes stopped Restart two nodes which fall down in the same time

&#x20;**List of cases:**

\-        Power outage on both node

\-        (Virtual) Servers down

\-        Virtual Cluster down

\-        Loss of Operating System

**What’s happened:**

\-        All servers are down and must be restarted when systems are back online

\-        Once servers are restarted, they need to elect the master one to handle the DB Cluster service.

Note : If you shut down all nodes at the same time, then you have effectively terminated the cluster. Of course, the cluster's data still exists, but the running cluster no longer exists.

**Result :**&#x20;

MariaDB does not start correctly.

**Resolution :**&#x20;

Once you restart the servers, you'll need to bootstrap the cluster again. If the cluster is not bootstrapped and MariaDB on the first node is just started normally, then the node will try to connect to at least one of the nodes listed in the wsrep\_cluster\_address option.

If no nodes are currently running, then this will fail. Bootstrapping the first node solves this problem. In some cases, Galera will refuse to bootstrap a node if it detects that it might not be the most advanced node in the cluster. Galera makes this determination if the node was not the last one in the cluster to be shut down or if the node crashed. In those cases, manual intervention is needed.

If you experience this issue the recovery\_galera command solves it

```bash
$ /usr/bin/galera_recovery
```

If we cannot recover with the recovery\_galera command it means that we will have to do it manually, for which on the server we will edit the file `/var/lib/mysql/grastate.dat` and change the value of `safe_to_bootstrap: 0` to `safe_to_bootstrap: 1` on the server that we believe has the most up-to-date data from the databases.

```bash
$ vi /var/lib/mysql/grastate.dat
```

Then on the same server we execute the following command:

```bash
$ galera_new_cluster 
```

And on the other server we start MariaDB normally

```bash
$ systemctl restart mariadb 
```

With this, our MariaDB cluster should be normalized.

#### Situation 3: Lost of network connectivity on node 1

**List of cases:**

\-        Network Equipments are down

\-        Network Cable(s) connected to the server are faulty

\-        Network interface of the server is faulty

**What’s happened:**

\-        The server is unreachable from a network point of view; then the failover service of the cluster will detect that the server is not reachable anymore from the network.

\-        The result is that the system will failover to the second server and reassign the clustered ip to the second server.

#### Situation 4: Unexpected disconnection of the cache storage

**List of cases:**

\-        Network have been disconnected – flapped

\-        Network Cable(s) connected to the server are faulty

\-        Internal disk that serves as cache has been disconnected

**What’s happened:**

\-        The server is unreachable from a network point of view; the internal volume serving the cache is not available.

\-        The result is that the system that the Container contents cannot operate properly.

\-        Task(s) can display some file with the status as ‘NO FILE’.

&#x20;**Result :**&#x20;

Service ‘nodeum\_file\_system\_virt’ does not start correctly.

**Resolution :**

On both servers, execute these actions:

Node 1: unmount the volume manually and remount it

```bash
$ umount /srv/gluster/nodeum_cache_brick
$ mount -a

```

Node 2: Unmount the volume manually and remount it

```bash
$ umount /srv/gluster/nodeum_cache_brick
$ mount -a

```

Afterwards, you can restart the GlusterFS daemon and the Nodeum File System virtualization service.

Node 1:

```bash
$ systemctl restart glusterd
$ systemctl restart nodeum_file_system_virt
```

Node 2:

```bash
$ systemctl restart glusterd 
$ systemctl restart nodeum_file_system_virt
```

At this stage, on both servers, you will be able to display the volume behind each of these volumes:

```bash
$ ls /srv/gluster/nodeum_cache_brick
$ ls /mnt/CACHE
$ ls /mnt/FUSE
```

If tasks reported file with ‘NO FILE’ status, then you have to restart the task and the problems should be resolved, meaning that all files have to be processed.

It is also important to use these commands to verify the good state of the Gluster File System

On both servers, we need to have the same results for these following commands:

```bash
$ gluster volume status nodeum_cache_brick clients
Client connections for volume nodeum_cache_brick
----------------------------------------------
Brick : 10.x.x.1:/srv/gluster/nodeum_cache_brick
Clients connected : 2
Hostname                                BytesRead    BytesWritten       OpVersion
--------                                ---------    ------------       ---------
10.x.x.1:49108       	               38043530114   230085103629        70200
10.x.x.2:49144          	         37815908       179832112        70200
----------------------------------------------
----------------------------------------------
```

```bash
$ gluster volume status nodeum_cache_brick
Status of volume: nodeum_cache_brick
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.x.x.1:/srv/gluster/nodeum_cache	
_brick                                      49152     0          Y       110881
Brick 10.x.x.2:/srv/gluster/nodeum_cache
_brick                                      N/A       N/A        Y       16258

Task Status of Volume nodeum_cache_brick
------------------------------------------------------------------------------
There are no active volume tasks
```

```bash
$ gluster volume status nodeum_cache_brick clients
Client connections for volume nodeum_cache_brick
----------------------------------------------
Brick : 10.x.x.1:/srv/gluster/nodeum_cache_brick
Clients connected : 1
Hostname                        BytesRead    BytesWritten       OpVersion
--------                        ---------    ------------       ---------
10.x.x.1:49108                  38319242436  231731737617           70200
----------------------------------------------
Brick : 10.x.x.2:/srv/gluster/nodeum_cache_brick
Clients connected : 1
Hostname                        BytesRead    BytesWritten       OpVersion
--------                        ---------    ------------       ---------
10.x.x.2:49147                  372933965      2066922877           70200
----------------------------------------------
```

```bash
$ gluster peer status
Number of Peers: 1

Hostname: 10.x.x.2
Uuid: a3b18da3-5a42-4399-9480-105f1f7032fb
State: Peer in Cluster (Connected)
```

```bash
$ gluster peer status
Number of Peers: 1

Hostname: 10.x.x.1
Uuid: a3b18da3-5a42-4399-9480-105f1f7032fb
State: Peer in Cluster (Connected)
```

### Point of attention

Make sure that the “/” directory has enough space.

```bash
$ df -h
Filesystem                                  Size  Used Avail Use% Mounted on
devtmpfs                                     16G     0   16G   0% /dev
tmpfs                                        16G  576K   16G   1% /dev/shm
tmpfs                                        16G  1.6G   15G  11% /run
tmpfs                                        16G     0   16G   0% 
/dev/mapper/centos_nodeu-root               788G   29G  719G   4% /
/dev/sdb1                                   8.2T  7.5T  254G  97% /mnt/CACHE
/dev/sda1                                  1014M  275M  740M  28% /boot
tmpfs                                       3.2G     0  3.2G   0% 
tmpfs                                       3.2G     0  3.2G   0% /run/user/0
core_fuse                                   102T  8.0T   94T   8% /mnt/FUSE
```

### Backup Feature - Manual Execution

{% hint style="warning" %} <mark style="color:red;">This procedure needs to be applied on each node of the cluster</mark>
{% endhint %}

#### How to execute a backup manually?

There is a command line that must be executed for starting a manual backup or restore.

The shell script is "`/opt/nodeum/tools/backup_restore.sh"`

```bash
$ /usr/mtc/bin/backup_restore.sh param1 param2
```

&#x20;The first parameter: f / full backup or i / incremental backup.

The second parameter: it is the target path where the backup will be saved or where the backup is located for a restore

&#x20;If the command line is configured to do an incremental, and there is no full already done, it will perform a full backup.

&#x20;The incremental option will always increment an existing full backup. This means that the incremental backup is restorable.

&#x20;Examples:

```bash
$ nohup /opt/nodeum/tools/backup_restore.sh full_backup /root/nodeum_bck_2302 &
```

"nohup" and "&" allow to run the backup script in daemon, there is a file named "nohup.out" ; this file contains the result of the executed command.

### How to execute a restore manually?

&#x20;There is a command line that must be executed for restoring a backup.

```bash
$ /opt/nodeum/tools /backup_restore.sh param1 param2
```

param1 : r for restore

param2 : source path where the backup is located

&#x20;Example :

```bash
$ nohup /opt/nodeum/tools/backup_restore.sh restore /root/nodeum_bck_2302 &
```

"nohup" and "&" allow to run the backup script in daemon, there is a file named "nohup.out"; this file contains the result of the executed command.

#### Point of Attention

&#x20;By default, when script is running, it uses a temporary folder : `/tmp/bckp/` in the main file system ; this temporary folder is used to store the backup before moved to the final location. The temporary folder can be changed in specifying another folder in the 3rd argument.

**Default temp folder :**

In this example, the backup will be stored in the folder …`/nas/backupnodeum/` and the backup system will use as implicit temporary cache which `/tmp/`

```bash
/bin/bash ./backup_restore.sh full_backup /mnt/MOUNT_POINTS/nas/backupnodeum
```

**Another temp folder**

In this example, the backup will be stored in the folder …`/nas/backupnodeum/` and the backup system will use as temporary cache, the directory `/mnt/CACHE/tempbck`

```bash
/bin/bash ./backup_restore.sh full_backup /mnt/MOUNT_POINTS/nas/backupnodeum /mnt/
CACHE/tempbck
```

#### Point of Attention

&#x20;If the backup do not run and the console mentioned that there is already another backup\_restore.sh script running, there are two things to review :

* Used a "ps -aef" command to verify if there is another process already running
* It is possible that a lock file (nodeum\_bkp\_lock) is stored, this lock file is stored into the /tmp folder ; and this even if the temporary folder location has been changed.
