Master/Master replication with MariaDB

Guide has been updated since 2018-05-31 to new methods used by MariaDB 10.2.

In this guide I'm talking about a three-node cluster. The third node could be an arbitrator or another DB node.

  • Node1 (172.16.1.11)

  • Node2 (172.16.1.12)

  • Node3 (172.16.1.13)

  • Cluster name: mycluster

I'll be using:

  • CentOS 7

  • MariaDB 10.2

Some terminology:

  • SST - State Snapshot Transfer is a full transfer of data to an empty or new node.

  • IST - Incremental State Transfer is a partial data transfer to a rebooted node.

  • gcomm-address - A comma-separated list of all cluster nodes, either by IP or by working DNS names. Can list all nodes even if they're listing themselves.

  • Arbitrator - AKA witness, an extra vote for quorum in a cluster.

  • Quorum - See Wikipedia. A vote between cluster members to determine cluster health.

Note: The idea of having an arbitrator in a small three node cluster is partly to scale the cluster across multpile DCs. But also that it can act as backup server and load balancer.

Installing packages

Not shown here is how to install the MariaDB repository, by default you'll only get 5.5 in CentOS 7 repos.

$ sudo yum install nc socat MariaDB-server MariaDB-devel MariaDB-client MariaDB-common MariaDB-compat MariaDB-backup galera

Additional helpful packages

$ sudo yum install mytop innotop mysqlreport

Configuring CentOS 7 Firewall for Galera

The following ports need to be allowed.

  • 3306/tcp
  • 4567/tcp (galera traffic between all nodes)
  • 4568/tcp (for IST)
  • 4444/tcp (for SST)

Configure datadir

I generally like to put my datadir on its own volume, in CentOS that's best done on ''/var/database''.

Tip: I always use a separate disk (or vdisk) for this volume, and I always use pvcreate directly on the disk without any partitions. Then create a new volume group so you're not combining this pv into someone elses vg. This makes later disk expansion easier to do online.

SElinux

Ensure your datadir has the correct file context in SElinux.

$ sudo mkdir /var/database/mysql
$ sudo chown -R mysql:mysql /var/database/mysql
$ sudo semanage fcontext -a -t mysqld_db_t '/var/database/mysql(/.*)?'
$ sudo restorecon -Rv /var/database/mysql

Configure MariaDB

First configuration file is for replication-related settings.

[mysqld]
datadir         = /var/database/mysql
server-id              = 81

auto-increment-increment = 2
auto-increment-offset = 1
skip-slave-start
log-basename    = mycluster
bind-address = 0.0.0.0
innodb-defragment=1

wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_provider_options="gcache.size=300M; gcache.page_size=1G"
wsrep_cluster_name="mycluster"
wsrep_cluster_address="gcomm://node1,node2,node3"
wsrep_node_name="node1"
wsrep_node_address="node1"

binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
innodb_doublewrite=1

#wsrep_sst_method=xtrabackup-v2

#wsrep_sst_auth=backup:secret.
  • ''server-id'' - Needs to be a unique server id on each server in the cluster.
  • ''log-basename'' - Logname, might be necessary to set.
  • ''wsrep_provider'' - Library path that might differ on other distros.
  • ''wsrep_provider_options'' - These settings can affect replication timeouts for slow networks or how long a node can be shutdown before it must take a full SST instead of a smaller and faster IST.
  • ''wsrep_cluster_name'' - The cluster name from earlier.
  • ''wsrep_cluster_address'' - This is the so-called gcomm-address where all nodes must be listed.
  • ''wsrep_node_name'' & ''wsrep_node_address'' - I generally put a usable FQDN for the node here. I always use DNS or failing that, local hosts-file.
  • ''wsrep_sst_auth'' - Backup user and password to use for SST. I've seen issues with passwords containing spaces here so beware.

As of 10.2 wsrep_sst_auth is no longer needed to perform SST.

Run first node

The method of starting a new cluster node has changed much but the latest method is to use ''galera_new_cluster''.

$ sudo galera_new_cluster

This will start mysqld with the proper arguments.

In past versions you would have used ''mysqld --wsrep-new-cluster''.

Add first node

Adding nodes to a healthy cluster is simple, either set --wsrep_cluster_address on the CLI or in the my.cnf configuration and then start it.

If you've set the gcomm address in the my.cnf config you should be able to use your standard init system to start the service.

$ sudo systemctl start mariadb

Check the log because a new node will start an SST.

Repeat for additional nodes.

Restart cluster

Important to know that if you shutdown all nodes then you have destroyed the logical cluster made when those nodes first started. The data is still on disk but you need to re-create the cluster when you start up again.

This is called bootstrapping the cluster in Galera terms.

Ordered shutdown

Look in the file called grastate.dat on each node and one of them will say ''safe_to_bootstrap: 1'' so start that node just like normal.

Ugly shutdown/Crash

If none of the nodes are ''safe_to_bootstrap: 1'' in their grastate.dat file then we must run mysqld with the ''--wsrep-recover'' argument to see which has the most recent data.

# sudo -u mysql mysqld --wsrep-recover
...
2016-11-18 01:42:16 36311 [Note] WSREP: Recovered position: 37bb872a-ad73-11e6-819f-f3b71d9c5ada:345628

The number after the long UUID string is meant to be as high as possible, the highest number is the node with the most recent data.

Now you can edit the grastate.dat file on this node to change ''safe_to_bootstrap: 1'' and after that start the service like normal.

Once your first node is up and healthy, starting another node will initiate an SST from the first node.

Setup backup user grants

This user is no longer needed to perform SST but if you want to do a streaming replication backup you'll need these grants.

mysql> GRANT SELECT, RELOAD, REPLICATION CLIENT, SHOW VIEW, EVENT, TRIGGER ON *.* TO 'backup'@localhost identified by 'secret.';

And remember that xtrabackup/mariabackup/innobackupex, whichever tool responsible for doing an SST, runs from the DONOR node so it needs to connect from localhost. It also logs essential data on the Donor.

But if you're also doing backups with xtrabackup then you need to add the IP of the backup server.

Arbitrator

{{:teknik:guider:database:galera_between_datacenters.png?direct&200 |Galera replication between datacenters}}

An arbitrator is an optional component that can decide which part of a cluster has quorum. The part with quorum is the live cluster that will continue working for your clients.

My personal use for an arbitrator has mainly been when spreading a cluster out between datacenters. The datacenter that still has contact with the arbitrator gets an extra vote in their quorum. See below diagram for example.

So severing the connection between DC 1 and DC3 will cause DC2 to remain in contact with the arbitrator and still have one extra node when voting for quorum. That means you can have two DB nodes in DC1 and DC2 without worrying about split brain issues if one of the datacenters goes down.

To install arbitrator

It's very simple.

$ sudo yum install galera

On Debian-based systems the package is called ''galera-arbitrator-3''.

To configure arbitrator

Mainly the gcomm address is needed, and if you use hostnames DNS resolution must of course work.

On CentOS 7 only two lines in ''/etc/sysconfig/garb'' is what I've used.

GALERA_NODES="node1:4567 node2:4567 node3:4567"
GALERA_GROUP="mycluster"

Tips & workarounds

Systemd uses its own nofile

Don't bother trying to set a max nofiles number in ''/etc/security/limits.d'' when using systemd because systemd overrides it and uses its own.

So instead create systemd override directory in ''/etc/systemd/system/mariadb.service.d'' and insert into the file ''override.conf'' the following.

[Service]
LimitNOFILE=64000

The symptom if you hit this limit might be dropped cluster communication on port 4567, so nodes keep timing out and having to re-sync.

Tip: This is best done with ''systemctl edit mariadb'' command.

SST with large databases

When you have a large database systemd will timeout trying to start mysql and might in that case cause the SST to fail by sending a kill signal to mysqld.

So override TimeoutStartSec for the mariadb.service. When my SST takes about 35-45 minutes I tend to set 2 hours in ''/etc/systemd/system/mariadb.service.d/override.conf''.

[Service]
TimeoutStartSec=7200

Prefer IST over SST with large databases

An SST might take a long time so study the two settings gcache.size and gcache.page_size in the mariadb option wsrep_provider_options.

These settings determine how much data can be written to the global cluster before a dead node must use an SST to sync itself.

For example:

wsrep_provider_options="gcache.size=1G; gcache.page_size=1G"

This means the cluster can handle about 1G of new data while a node is desynced or disabled in some way, before that node must use an SST to re-sync itself.

If you're within that span of data then the node will normally use an IST to sync itself instead of an SST. This depends on other factors too so it's a simplification.

Bug: missing /var/run/mysqld dir

When using systemd the directory /var/run/mysqld might be missing, preventing mysqld from creating a server socket file if one is configured.

This is likely a packaging bug, it appears in both RedHat and Ubuntu so far.

Fix is simply to create ''/etc/tmpfiles.d/mariadb.conf'' with this content.

d /var/run/mysqld 0755 mysql mysql -

Bug MDEV-10753

I've made an SElinux module to help with this bug.

# This is a workaround for mariadb bug MDEV-10753

policy_module(mdev-10753, 1.0)
gen_require(`
 type mysqld_t;
 type initrc_tmp_t;
 #type var_log_t;
')

allow mysqld_t initrc_tmp_t:file { open write };

Binary logs can cause random node crashes

This is reported here.

My workaround was to disable binary logging. I haven't tried with GTID yet.

Deadlocks can happen

Keep your transactions small and to the point instead of trying to aggregate tons of data with many joins that might cause a deadlock if others try to access the same data in the cluster.

See this link for more info on deadlocks.

See also