Monday, October 1, 2012

With MySQL 5.6 coming closer to its release, I got a chance to sneak a look into the Row Based Replication (RBR). 5.6 release of MySQL will be a replication features packed one, and it is for this reason I write  another post on the new enhancements in Row Based Replication(RBR). RBR provides a safe way of replicating between master and slave and thats why RBR enhancements become even more important. RBR in 5.6 is far more optimized than what it was before and we hope it will be fun to deploy, maintain and use.

Folowing are some interesting enhancements in RBR in MySQL 5.6 Replication.

1. Optimized row image:

Large data transfer between master and slave while using RBR is a thing of past. Why log complete rows on the master? Why not just the changed colums?  The MySQL 5.6 release enables a DBA to configure the master to log only the changed columns in the row.  The Before Image (BI) will contain the columns for identifying the rows and the After Image (AI) will just contain the column that has changed.
  

Before MySQL 5.6: 

As one can observe in the illustration that the row images were full, i.e. all the columns in the row were logged for both the Before Image(BI) and the After Image AI

In MySQL 5.6:

In MySQL 5.6 the DBA can configure if full or minimal or NOBLOB image of rows should be logged in the binary log in the following ways.

1. start the server with --binlog-row-image=MINIMAL|FULL|NOBLOB
2. SET SESSION|GLOBAL binlog_row_image= MINIMAL|FULL|NOBLOB

In minimal image if there is a Primary Key (PK) or any kind of Unique Index/Key (UK) that can identify a row uniquely on the slave then the BI contains just that column. In the After Image (AI) only the changed columns are logged (See illustration below). 


In case of NOBLOB is used, the BLOB fields will be skipped from logging unless changed. In case  FULL is used the logging will continue as before.

As you can see the amount of data transfer will be reduced by a lot especially if you have used proper indexes/keys for the tables on the master and the slave. A detailed post of this enhancement be found here.

2. Batch operations on table without PK or UK:

I have already mentioned the details of this enhancement in a previous post. This is one of the long awaited optimization when using RBR. This enables the DBA to configure what algorithm will be used  on the slave to find the appropriate row for updating/deleting. We have introduced the concept of HASH_SCAN, in which an in-memory hash is being created on the slave, and provides a way to find and apply rows in a row event in O(n) time.  This enhancement provides a great scope for performance improvement on the slave for the events logged in row format.

3. Informational Log events: 

One of the difficulties DBA have faced till now, while using RBR is that, it is difficult to maintain. The Row images are binary in nature and is not suitable for human "consumption". This is where this enhancement comes handy. There are two ways of using this.

    1. Enabling switch --binlog-rows-query-log-events during the server startup
    2. Changing the corresponding session variable.

Once enabled this will cause the original query for the row event to be logged in the binary log as a special event which the DBA can check using one of the two  methods.

1. One can check these special events in the binary-log by using SHOW BINLOG EVENTS. The query corresponding the the row event will be logged as an event type: Row_query. below is one such example

mysql> SHOW BINLOG EVENTS;
+---------------+-----+-------------+-------+--------+-----------------------------------------------+
| Log_name      | Pos | Event_type  | S..id | End... | Info                                          |
+---------------+-----+-------------+-------+--------+-----------------------------------------------+
| master.000001 |   4 | Format_desc |     1 |    114 | Server ver: 5.6.7-m5-debug-log, Binlog ver: 4 |
| master.000001 | 114 | Query       |     1 |    200 | use `test`; CREATE TABLE NUM_INTS (a INT)     |
| master.000001 | 200 | Query       |     1 |    268 | BEGIN                                         |
| master.000001 | 268 | Rows_query  |     1 |    323 | # INSERT INTO NUM_INTS VALUES (1), (2), (3)   |
| master.000001 | 323 | Table_map   |     1 |    364 | table_id: 54 (test.NUM_INTS)                  |
| master.000001 | 364 | Write_rows  |     1 |    408 | table_id: 54 flags: STMT_END_F                |
| master.000001 | 408 | Query       |     1 |    477 | COMMIT                                        |
+---------------+-----+-------------+-------+--------+-----------------------------------------------+

In the above terminal output (snip-ed), one can see the event at pos= 268 is an informational event for the next Table_map and Write_rows events. 


2. Another way is to use the mysqlbinlog utility with  --vv switch (verbosity level 2 or more).

shell> mysqlbinlog -vv /master.000001
# at 268
#110401 14:24:29 server id 1  end_log_pos 323   Rows_query
# INSERT INTO t1 VALUES (1), (2), (3)
# at 323
#110401 14:24:29 server id 1  end_log_pos 364   Table_map: `test`.`NUM_INTS` mapped to number 54
# at 364
#110401 14:24:29 server id 1  end_log_pos 408   Write_rows: table id 54 flags:STMT_END_F

Conclusion: 

So we see how MySQL-5.6 Replication will provide a lot of features that enables a better, faster, reliable slave that is easy to maintain.  So why not just go try it out?

Tuesday, August 28, 2012



The problem

Sometime back a member of the MySQL support team informed us that the slaves are very sluggish, when he tried to run a delete query on the master. "What may be the problem?" we wondered. This is in RBR and its pretty usual that we have slower slaves when using RBR as compared to SBR. We requested for some more details and found that it is something else. It's a problem when a lot of rows are getting modified in a table without PRIMARY KEY.

"UPDATE t1 set a = a+1;" and other queries such as this are generally found to make slaves sluggish if the table does not have proper index. Is it a problem with the fact that a lot of data is being transferred over the network? After all the total amount of data that will be transferred over the network will be more than twice (before and after image + some header) the sizeof data that is modified. Could this be the problem? We decided to confirm this and found that the major chunk of the total time is taken while applying the event present in the relay-log.

Before going forward lets try to understand how the row events are applied on the slave.

A standard row event looks as follows.

+----------------------------+
|     Header: table id       |
|   column information etc.  |
+--------------+-------------+
| BEFORE IMAGE | AFTER IMAGE |
+--------------+-------------+
| BEFORE IMAGE | AFTER IMAGE |
+--------------+-------------+

As you can see here each row event has a set of rows which are modified. For each row we store the Before Image(BI) and the After Image(AI)[1]. when these events are read from the relay log  we do the following in case no primary key is present on the slave.

for each row in the event do
{
  search for the correct row to be modified using BI
  replace the  row in the table with the corresponding AI
}

In this finding the appropriate row is done by using a table scan which again has a linear complexity. There for this algorithm has now a complexity of O(n^2).

Pretty costly right? Yes it is and it is for this reason we decided to optimize this approach.

The Fix:

The fix we thought is pretty simple and yet elegant. We decided to the use in-memory Hash. We hash the rows in the event and generate key using the BI.  we store the BI as well in the hash to bail us out in case of collision, which is quite possible since the BIT and BLOB fields are not considered while creating the HASH KEY.

If we assume HASH() as the hashing function then the HASH contains:

|---KEY----||---structure containing ----|
            |-----the BI and the AI------| 
+-----------+--------------+-------------+
| HASH (BI) | BEFORE IMAGE | AFTER IMAGE |
+-----------+--------------+-------------+

We then iterate over each row in the table, hash it and check if the key exists in the hash. If so we apply the AI corresponding to the key in the HASH to the row in the table.

The above can be explained using this pseudo-code.

for each row in the event do
{
  hash the row. 
}

for each row in the table do
{
  key= hash the row;
  if (key is present in the hash)
  {
    apply the AI to the row.
  }
}

This can run in linear time since the search in the hash takes constant time.

Along with this another optimization has been done when the table has a non-unique index. We hash the rows in the events and store the distinct key set in a list. We then iterate over this key set and fetch the rows corresponding to this index value and iterate over them. This prevents us from iterating over all the rows in the table but just the rows with a given set of index values. For the further usage in this post we will use HASH OVER TABLE(Ht) for the previous scenario and HASH OVER INDEX(Hi) for this one.

This can be explained using this pseudo-code

for each row in the event do
{
  hash the row.
  store the key in a list of distinct key.
}

for each row corresponding key values in the key list do
{
  key= hash the row;
  if (key is present in the hash)
  {
    apply the AI to the row.
  }
}

How to use:

To use this start the slave  with

--slave-rows-search-algorithms= 
HASH_SCAN | TABLE_SCAN | INDEX_SCAN

or a combination of the three. You may also change the value of the variable @@slave_rows_search_algorithms on the running server as well.

For further information regarding the usage check the mysql developers' page

The following matrix shows how the algorithm is chosen for a given table.

KEYS AVAILABLE ON THE TABLE
    - PK  --> Primary Key
    - NNUK--> Not Null Unique Key
    - K   --> Key
SCAN ALGORITHMS 
    - I   --> Index scan / search
    - T   --> Table scan
    - Hi  --> Hash over index
    - Ht  --> Hash over the entire table

    +--------------+-----------+------+------+------+
    | Index\Option | I , T , H | I, T | I, H | T, H |
    +--------------+-----------+------+------+------+
    | PK / NNUK    | I         | I    | I    | Hi   |
    | K            | Hi        | I    | Hi   | Hi   |
    | No Index     | Ht        | T    | Ht   | Ht   |
    +--------------+-----------+------+------+------+

Pros and Cons:
As you must have figured out the fact the implementation of HASH_SCAN uses in memory hash and may cause the performance to degrade if the amount of memory available is low. Also if the average number of rows modified by transactions are less, the performance improvement may not be very high.

Performance:

We ran some two slaves with HASH_SCAN and TABLE_SCAN and found the results to be as expected.

The following graphs show the improvements in case of  Ht and Hi

1. Hash over index.
2. Hash over table.
Note: These runs were made as a comparative analysis with two slaves to a master on one machine and should not be used for bench-marking etc.

This feature is available for trail in the latest MySQL-5.6  community edition.


[1] We are assuming that --binlog-row-image=FULL. Even though if it is set to MINIMAL|NOBLOB it won't make a difference to the point i am trying to make.  See developers' page

Friday, March 9, 2012

A few days back, reading the article on the Moore's law, I thought it is not too difficult to extend this for an analogous situation, e.g. the amount of data being cached by a search engine like Google also grows exponentially. With changing times we need to have solution that would adapt to the changing data requirements and availability solutions like virtual infinite scale-out like cloud. Even though virtualization is not exactly a database concept, it's  not long when we will have it as a standard for RDBMS. This article is however  about using MySQL replication to scale out reads and make it fault-tolerant.

MySQL replication is essentially asynchronous  i.e. it doesnot guarantee zero latency between the master and the slave(s).  This, by extension means that in case of a master crash, data may be lost if we decide to promote one of the slave to be the master and that is why while scaling out reads we need to take care of this.  A general design for master-slave(s) configuration with reads scaled out will be like this.

MySQL Replication  design: read off-loading   

In the above diagram the application connects to the web clients connected to the replication setup. The web clients segregate the queries into "read" or "write".  This enables the web-clients to redirect the reads to the slave and the writes to the master, effectively off-loading the master. This improves the performance and enables read-intensive tasks like reporting and consolidation application to run without hampering the performance of the master. A problem can however happen if the master has a lot of writes happening there-by causing the slaves to be loaded as well.

Fault Tolerant Systems: One of the basic requirement of a modern-day system is fault tolerance. Master and slave(s) system is capable of tolerating faults caused due to failures on the master. This, however requires the DBA (or an external program) to monitor any such failures and promoting the (most up-to-date) slave to take up the role of the master and changing the topology by re-configuring the slaves to take updates from the new master.

An effective way of monitoring and managing the  clients is to create a layer that monitors the topology and send out the read an write queries to the master-slave(s) system.


 Virtual layer for interaction between the RDBMS and  the  Application  

Essentially this layer will have three modules, the load-balancer, the connection pool manager and the topology monitor. As the name suggests load balancer balances the read loads on various slaves, the connection pool manager holds the records of connection to various servers, and the topology monitor monitors the systems and does an automatic failover sequence (as described earlier) to restore the system to a working condition with no or minimal loss of data. 

Sunday, February 19, 2012

Hello Folks,

I have never been consistent with blogging and most of the time
I create a blog hoping that I would  share my thoughts and plans
with everyone, but as It has to happen, I neglect it.

On this new year I made a resolution that I will work on this blog
(actually it was to create one) and maintain it throughout. This
blog will be the reflection of my ideas, and an attempt to get a feedback on
my work in MySQL replication, together with providing tips and tricks
on using MySQL in distributed environment.

So, last July I joined MySQL replication development team, straight out of
college and it is kind of a big thing for me. In past 7 odd months I have
worked with great people and learnt a lot of things and this has finally lead
to an idea of starting this blog, and posting my thoughts in my spare time.

This blog will particularly be useful for people who work with MySQL be it
on the internals or on using MySQL in distributed systems. I guess it is
too early to tell what one can expect from this blog, but I will try to
keep this interesting and the posts... ummm informative.