Saturday, September 14, 2013

How MySQL Replication Works

From Enterprise Backup:
meta directory
ibbackup_slave_info.txt and backup_variables.txt contains log position that need to be set at slave side to start replication

See the bellow image, it represents how asynchronous replication works on MySQL


How replication works
Bearing in mind the number on the image, let’s see how the process works:
  • Item 1 in the image represents the clients executing queries on master, note that the master is capable to handle multiple simultaneous connections (it can be configurable by max_connections variable). 
  • Master process these queries, and save it to his binary log(item number 2 in the image), then it can later on be replied on the slave.
  • The slave has 2 threads reubuntu mount ext4 partitionsponsible to deal with replication :
    • IO_THREAD – is responsible to connect to the Master, ask for new transaction (item number 3 in the image) and save them to his own log (relay log, item number 4 in the image).
    • SQL_THREAD – is responsible for read all new transaction from relay log(item number 5 in the image), and apply them to the database server(item number 6 in the image) .
------------------------------------------------------------------------------------------------------
While we do have many blog posts on replication on our blog, such as on replication being single-threaded, on semi-synchronous replication or on estimating replication capacity, I don’t think we have one that covers the very basics of how MySQL replication really works on the high level. Or it’s been so long ago I can’t even find it. So, I decided to write one now.

Of course, there are many aspects of MySQL replication, but my main focus will be the logistics – how replication events are written on the master, how they are transferred to the replication slave and then how they are applied there. Note that this is NOT a HOWTO setup replication, but rather a howstuffworks type of thing.
Replication events

I say replication events in this article because I want to avoid discussion about different replication formats. These are covered pretty well in the MySQL manual here. Put simply, the events can be one of two types:

    Statement based – in which case these are write queries
    Row based – in this case these are changes to records, sort of row diffs if you will

But other than that, I won’t be going back to differences in replication with different replication formats, mostly because there’s very little that’s different when it comes to transporting the data changes.
On the master

BINLOG DUMP IS THE REPLICATION THREAD ON MASTER

So now let me start with what is happening on the master. For replication to work, first of all master needs to be writing replication events to a special log called binary log. This is usually very lightweight activity (assuming events are not synchronized to disk), because writes are buffered and because they are sequential. The binary log file stores data that replication slave will be reading later.

Whenever a replication slave connects to a master, master creates a new thread for the connection (similar to one that’s used for just about any other server client) and then it does whatever the client – replication slave in this case – asks. Most of that is going to be (a) feeding replication slave with events from the binary log and (b) notifying slave about newly written events to its binary log.

Slaves that are up to date will mostly be reading events that are still cached in OS cache on the master, so there is not going to be any physical disk reads on the master in order to feed binary log events to slave(s). However, when you connect a replication slave that is few hours or even days behind, it will initially start reading binary logs that were written hours or days ago – master may no longer have these cached, so disk reads will occur. If master does not have free IO resources, you may feel a bump at that point.
On the replica

Now let’s see what is happening on the slave. When you start replication, two threads are started on the slave:

1. IO thread

This process called IO thread connects to a master, reads binary log events from the master as they come in and just copies them over to a local log file called relay log. That’s all.

Even though there’s only one thread reading binary log from the master and one writing relay log on the slave, very rarely copying of replication events is a slower element of the replication. There could be a network delay, causing a steady delay of few hundred milliseconds, but that’s about it.

If you want to see where IO thread currently is, check the following in “show slave status\G”:

    Master_Log_File – last file copied from the master (most of the time it would be the same as last binary log written by a master)
    Read_Master_Log_Pos – binary log from master is copied over to the relay log on the slave up until this position.

And then you can compare it to the output of “show master status\G” from the master.

2. SQL thread

The second process – SQL thread – reads events from a relay log stored locally on the replication slave (the file that was written by IO thread) and then applies them as fast as possible.

This thread is what people often blame for being single-threaded. Going back to “show slave status\G”, you can get the current status of SQL thread from the following variables:

    Relay_Master_Log_File – binary log from master, that SQL thread is “working on” (in reality it is working on relay log, so it’s just a convenient way to display information)
    Exec_Master_Log_Pos – which position from master binary log is being executed by SQL thread.


Replication lag

Now I want to briefly touch the subject of replication lag in this context. When you are dealing with replication lag, first thing you want to know is which of the two replication threads is behind. Most of the time it will be the SQL thread, still it makes sense to double check. You can do that by comparing the replication status variables mentioned above to the master binary log status from the output of “show master status\G” from the master.

If it happens to be IO thread, which, as I mentioned many times already, is very rare, one thing you may want to try to get that fixed is enabling slave compressed protocol.

Otherwise, if you are sure it is SQL thread, then you want to understand what is the reason and that you can usually observe by vmstat. Monitor server activity over time and see if it is “r” or “b” column that is “scoring” most of the time. If it is “r”, replication is CPU-bound, otherwise – IO. If it is not conclusive, mpstat will give you better visibility by CPU thread.

Note this assumes that there is no other activity happening on the server. If there is some activity, then you may also want to look at diskstats or even do a query review for SQL thread to get a good picture.

If you find that replication is CPU bound, this maybe very helpful.

If it is IO bound, then fixing it may not be as easy (or rather, as cheap). Let me explain. If replication is IO bound, most of the time that means that SQL thread is unable to read fast enough because reads are single threaded. Yes, you got that right – it is reads that are limiting replication performance, not writes. Let me explain this further.

Assume you have a RAID10 with a bunch of disks and write-back cache. Writes, even though they are serialized, will be fast because they are buffered in the controller cache and because internally RAID card can parallelize writes to disks. Hence replication slave with similar hardware can write just as fast as master can.

Now Reads. When your workset does not fit in memory, then the data that is about to get modified is going to have to be read from disk first and this is where it is limited by the single-threaded nature of the replication, because one thread will only ever read from one disk at a time.

That being said, one solution to fix IO-bound replication is to increase the amount of memory so working set fits in memory. Another – get IO device that can do much more IO operations per second even with a single thread – fastest traditional disks can do up to 250 iops, SSDs – in the order of 10,000 iops.

-----------------------------------------------------------------------------------------------------------------------------------
MySQL Limitations: Replication is Single Threaded
-----------------------------------------------------------------------
The first glaring problem is single-threaded replication. It is severe and getting much worse as servers get more and more CPUs and CPU cores. The replication replay process executes in a single thread on the replicas, and thus has no hope of keeping up with even a moderately busy write load on the primary, where many updates are occurring concurrently.

In a lot of Web applications, this isn’t really seen as a huge limitation. That’s because these apps are mostly read traffic, so a single primary can delegate the read workload across several replicas, and the write workload is still a relatively small fraction of the server’s total capacity. But eventually, it does become a problem if the app gets large enough, no matter how large the read-to-write ratio is.

What are some workarounds? Here are a few I’ve seen:

    Use DRBD replication instead of MySQL replication. Problem: you end up with an idle standby server, which can’t be used for serving reads, only for disaster recovery. That’s costly.
    Shard. Write workload is the single most legitimate reason for sharding. It’s too bad that “the replica can’t keep up with the write workload” becomes the reason to shard. I wish we could avoid that.
    Do the replication in the application. I know of applications where they just don’t use built-in replication. When they modify some data, they do it in both places. That’s a headache.
    Try obscure techniques such as external processes to prefetch the data the replica is trying to modify, so it can do it faster. This rarely works.

I’m not criticizing anyone who does these things — there really isn’t much of a good choice. It’s all a matter of picking the solution that’s least evil.

Why isn’t there multi-threaded replication? I think that a) it’s not as easy as it seems, and b) there are a hundred edge cases that make it nearly impossible to choose a solution that works for all situations. For example, mixtures of transactional and non-transactional tables are a nightmare.

Here are a few ideas:
    One thread per database. If the application is built such that each database is fully independent, then on the replica, we could start up a thread for each database we see in the binary log, and simply pass the replication events to the appropriate thread.
    A pool of threads on the replica, and a coordinating thread that hands work to each of them. The coordinator would read from the relay log until it has a complete transaction’s worth of events (not a single statement), and hand the whole transaction to the worker thread. The worker thread would run up until the COMMIT, but not actually commit, and then report back to the coordinator thread. The coordinator thread would ensure that all of the transactions begin in the same order as defined in the relay log, and commit in the same order. If any error occurred, such as a deadlock or lock wait timeout, then the coordinator would instruct the workers to roll back, and either retry or make them execute the problematic transactions in serial instead of concurrently.
    Multiple binary logs on the primary, one per database; one replication process per binary log on the replica. This would have the advantage that it would allow a replica to subscribe to multiple masters, which is currently impossible.

These solutions represent different types of trade-offs. For example, solution 1) only works for specific uses, and I don’t think it’s general-purpose enough. Solution 2) has potentially complex behavior that might not work well in some cases, such as when deadlocks are common; but it is overall the least disruptive or different from the user’s point of view. Solution 3) requires modifying the binary logging code, which is risky. It also requires maintaining many master.info files on the replica, and new SQL syntax for administering replicas, and is generally not something I personally want to administer (replication is fragile enough… imagine recovering after a crash when you have to fix multiple threads that have forgotten where they should be reading in multiple binlogs?).

Regardless of the solution, it is certain that nothing is going to work in all cases; the most common cases will require use of InnoDB with the proper transaction isolation level, at a minimum. This behavior is going to have to default to single-threaded as replication currently does, and only enable the multi-threaded behavior if the user configures it to do so.

I would be in favor of solution 2) with an option to configure it to behave as solution 1).


 why Statement-Based replication has to be single threaded: you essentially need to have serializable isolation to ensure that the slaves produce the same results when just replaying statements. It’s very difficult to ensure consistency otherwise.
 --------------------------------------------------------------------------------------------
MySQL Performance with Replication in Place
----------------------------------------------------------------------------------------------
It is easy for MySQL replication to become bottleneck when Master server is not seriously loaded and the more cores and hard drives the get the larger the difference becomes, as long as replication
remains single thread process. At the same time it is a lot easier to optimize your system when your replication runs normally – if you need to add/remove indexes and do other schema changes you probably would be looking at some methods involving replication if you can’t take your system down. So here comes the catch in many systems – we find system is in need for optimization when replication can’t catch up but yet optimization process we’re going to use relays on replication being functional and being able to catch up quickly.

So the question becomes how can we estimate replication capacity, so we can deal with replication load before slave is unable to catch up.

Need to replication capacity is not only needed in case you’re planning to use replication to perform system optimization, it is also needed on other cases. For example in sharded environment you may need to schedule downtime or set object read only to move it to another shard. It is much nicer if it can be planned in advance rather than done on emergency basics when slave(s) are unable to catch up and application is suffering because of stale data. This especially applies to Software as Service providers which often have very strict SLA agreements with their customers and which can have a lot of data per customer so move can take considerable amount of time.

So what is replication capacity I call replication capacity the ability to replicate the master load. If replication is able to replicate 3 times the write load from the master without falling behind I will call it replication capacity of 3. When used with context of applying binary logs (for example point in time recovery from backup) replication capacity of 1 will mean you can reply 1 hour worth of binary logs within 1 hour. I will call “replication load” the inverse of replication capacity – this is basically what percentage of time the replication thread was busy replicating events vs staying idle.

Note you can speak about idle replication capacity, when box does not do anything else as well as loaded replication capacity when the box serves the normal load. Both are important. You care about idle replication capacity when you have no load on the slave and need it to catch up or when restoring from backup, the loaded replication capacity matters during normal operation.

So we defined what replication capacity is. There is however no tools which can tell us straight what replication capacity is for the given system. It also tends to float depending on the load similar as loadavg metrics. Here are some of the ways to measure it:

1) Use “UserStats” functionality from Google patches, which is now available in Percona Server and MariaDB. This is the probably the easiest and most accurate approach but it
does not work in Oracle MySQL Server. set userstat_running=1 and run following query:

mysql> select * from information_schema.user_statistics where user="#mysql_system#" \G
*************************** 1. row ***************************
USER: #mysql_system#
TOTAL_CONNECTIONS: 1
CONCURRENT_CONNECTIONS: 0
CONNECTED_TIME: 446
BUSY_TIME: 74
CPU_TIME: 0
BYTES_RECEIVED: 0
BYTES_SENT: 63
BINLOG_BYTES_WRITTEN: 0
ROWS_FETCHED: 0
ROWS_UPDATED: 127576
TABLE_ROWS_READ: 4085689
SELECT_COMMANDS: 0
UPDATE_COMMANDS: 119127
OTHER_COMMANDS: 89557
COMMIT_TRANSACTIONS: 90259
ROLLBACK_TRANSACTIONS: 0
DENIED_CONNECTIONS: 1
LOST_CONNECTIONS: 0
ACCESS_DENIED: 0
EMPTY_QUERIES: 0
1 row in set (0.00 sec)

In this case CONNECTED_TIME is 446 second, out of this replication thread was busy (BUSY_TIME) 74 seconds which means replication capacity is 446/74 = 6
You normally would not like to measure it from the start but rather take the difference in these counters every 5 minutes or other interval of your choice.

2) Use full slow query log and mk-query-digest. This method is great for one time execution especially as it comes together with giving you the list of queries which load replication
the most. It however works only with statement level replication. You need to set long_query_time=0 and log_slave_slow_statements=1 for this method to work.
Get the log file which will include all queries MySQL server ran with their times and run mk-query-digest with filter to only check queries from replication thread:

mk-query-digest slow-log –filter ‘($event->{user} || “”) =~ m/[SLAVE_THREAD]/’ > /tmp/report-slave.txt

In the report you will see something like this as a header:

# 475s user time, 1.2s system time, 80.41M rss, 170.38M vsz
# Current date: Mon Jul 19 15:12:24 2010
# Files: slow-log
# Overall: 1.22M total, 1.27k unique, 558.56 QPS, 0.37x concurrency ______
# total min max avg 95% stddev median
# Exec time 819s 1us 92s 669us 260us 120ms 93us
# Lock time 28s 0 166ms 23us 49us 192us 25us
# Rows sent 4.27k 0 325 0.00 0 1.04 0
# Rows exam 30.88M 0 1.28M 26.48 0 3.07k 0
# Time range 2010-07-19 14:35:53 to 2010-07-19 15:12:22
# bytes 350.99M 5 1022.34k 301.01 719.66 5.75k 124.25
# Bytes sen 1.94M 0 9.42k 1.67 0 110.38 0
# Killed 0 0 0 0 0 0 0
# Last errn 34.11M 0 1.55k 29.26 0 185.83 0
# Merge pas 0 0 0 0 0 0 0
# Rows affe 875.19k 0 17.55k 0.73 0.99 25.61 0.99
# Rows read 2.20M 0 14.83k 1.88 1.96 24.68 1.96
# Tmp disk 4.15k 0 1 0.00 0 0.06 0
# Tmp table 14.19k 0 2 0.01 0 0.14 0
# Tmp table 8.30G 0 2.01M 7.12k 0 117.75k 0
# 0% (5k) Filesort
# 0% (5k) Full_join
# 0% (7k) Full_scan
# 0% (10k) Tmp_table
# 0% (4k) Tmp_table_on_disk

There is a lot of interesting you can find out from this header but in relation to replication capacity – you can get replication load, which is same as “concurrency” figure (0.37x) The concurrency as reported by mk-query-digest is sum of query execution time vs time range the log file covers. In this case as we know there is only one replication thread it will be same as replication load. This gives us replication capacity of 1/0.37 = 2.70

This method should work with original MySQL Server in theory, though I have not tested it. Some versions had log_slave_slow_statements unreliable and also you may need to adjust regular expression for finding users replication thread uses.

3) Processlist Polling This method is simple – the Slave thread has different status in Show Processlist depending on if it processes query or simply waiting. By pooling processlist frequently (for example 10 times a second) we can compute the approximate percentage the thread was busy vs idle. Of course running processlist very aggressively can be an overhead especially if it is busy system with a lot of connections

mysql> show processlist;
+--------+-------------+-----------+------+---------+------+-----------------------------------------------------------------------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+--------+-------------+-----------+------+---------+------+-----------------------------------------------------------------------+------------------+
| 801812 | system user | | NULL | Connect | 2665 | Waiting for master to send event | NULL |
| 801813 | system user | | NULL | Connect | 0 | Has read all relay log; waiting for the slave I/O thread to update it | NULL |
| 802354 | root | localhost | NULL | Query | 0 | NULL | show processlist |
+--------+-------------+-----------+------+---------+------+-----------------------------------------------------------------------+------------------+
3 rows in set (0.00 sec)   

4) Slave Catchup/Binlog Application method. We can just get the spare server with backups restored on it and apply binary log to it. If 1 hour worth of binary logs applies for 10 minutes we have replication capacity of 6. The challenge of course having spare server around and it is quite labor intensive. At the same time it can be good measurement to take during backup recovery trials when you’re doing this activity anyway. Using this way you can also measure “cold” vs “hot” replication capacity as well as how long replication warmup takes. It is very typical for servers with cold cache to perform a lot slower then they are warmed up. Measuring times for each binary log separately should give you these numbers.

The less intrusive process which can be done in production (especially if you have slave which is used for backups/reporting etc) is to stop the replication for some time and when see how long it takes to catch up. If you paused replication for 10 minutes and it took 5 minutes to catch up your replication capacity will be 3 (not 2) because you not only had to process the events for outstanding 10 minutes but also for these 5 minutes it took to catch up. The formula is (Time_Replication_Paused+Time_Took_To_Catchup)/Time_Took_To_Catchup.

So how much of replication capacity do you need in the healthy system ? It depends a lot on many things including how fast do you need to be able to recover from backups and how much your load variance is. A lot of systems have special requirements on the time it takes to warmup too (there are different things you can do about it too). First I would measure replication capacity on 5 minute intervals (or something similar) because it tends to vary a lot. When I would suggest to ensure the loaded replication capacity is at least 3 during the peak load and 5 during the normal load. This applies to normal operational load – if you push heavy ALTER TABLE through replication they will surely get your replication capacity down for their duration.

One more thing about these methods – methods 1,2,3 work well only if replication capacity is above 1, so system is caught up. If it is less than 1, so the master writes more binary logs than slave can process they will show number close to 1. the method 4 however with work even if replication can’t ever catch up – If 1 hour worth of binary logs takes 2 hours to apply, your replication capacity is 0.5.

No comments:

Post a Comment