InnoDB recovery is now faster…much faster!

One of the well known and much written about complaint regarding InnoDB recovery is that it does not scale well on high-end systems. Well, not any more. In InnoDB plugin 1.0.7 (which is GA) and plugin 1.1 (which is part of MySQL 5.5.4) this issue has been addressed. Two major improvements, apart from some other minor tweaks, have been made to the recovery code. In this post I’ll explain these issues and the our solution for these.

First issue reported here is about available memory check eating up too much CPU. During recovery, the first phase, called redo scan phase, is where we read the redo logs from the disk and store them in a hash table. In the second phase, the redo application phase, these redo log entries are applied to the data pages. The hash table that stores the redo log entries grows in the buffer pool i.e.: memory for the entries is allocated in 16K blocks from the buffer pool. We have to ensure that the hash table does not end up allocating all the memory in the buffer pool leaving us with no room to read in pages during the redo log application phase. For this we have to keep checking the size of the heap that we are using for allocating the memory for the hash table entries. So why would it kill the performance? Because we do not have the total size of the heap available to us. We calculate it by traversing the list of blocks so far allocated. Imagine if we have gigabytes or redo log to apply (it can be up to 4G). That would mean hundreds of thousands of blocks in the heap! And we have to make a check roughly whenever we are reading in a new redo page during scan. An O(n * m) algorithm where ‘n’ is number of blocks in the heap and ‘m’ is number of redo pages that have to be scanned.

What is the solution we came up with? Store the total size of a heap in its header. Simple and effective. Our algorithm now becomes O(m).

Lets talk about the second issue reported here. During the redo log application phase, data pages to which redo log entries are applied are to be inserted in a list called flush_list which is ordered by the LSN of the earliest modification to a page i.e.: oldest_modification. During the normal working the LSN increases monotonically therefore the insertion to the flush_list always happens at the head. But during recovery we have to linearly search the flush_list to find the appropriate spot for insertion. The length to which the flush_list can grow is the number of modified pages (called dirty pages in db parlance) we had at the time of crash. On high-end system with multi-gigabyte buffer pools this number can be very high i.e.: million or more dirty pages. A linear search for insertion won’t scale.

There has been a talk in the community about how to fix this and various solutions have been suggested and some were implemented by the community as well. What we, at InnoDB, have finally implemented is to have an auxiliary data structure (a red-black tree in this case) which is active only during the recovery phase and which is used to speed up sorted insertions in the the flush_list. The flush list remains a list and after the recovery is over the red black is tree is discarded as during the normal operations we only ever append to the flush_list.

So much for the theory. Now let us see if we can walk the talk. To evaluate the effectiveness of this fix we’d need a crash when there are a lot of redo logs to apply and there are a lot of dirty pages. I requested Michael Izioumtchenko (InnoDB’s QA supremo) to come up with something. And he did the following:
The dataset was obtained by running a 60m sysbench readwrite uniform distribution in memory workload with prewarmed cache using the following configuration parameters:
–innodb-buffer-pool-size=18g
–innodb-log-file-size=2047m
–innodb-adaptive-flushing=0
–innodb-io-capacity=100
The latter two are used to throttle flushing in order to maximize the number of dirty pages.
It took only about 20 min of running a workload to arrive to the test dataset, including cache prewarming.
So at time of crash we had:
Modified db pages  1007907
Redo bytes: 3050455773
And the recovery times were:

Plugin 1.0.7 (also Plugin 1.1): 1m52s scan, 12m04s apply, total 13m56s
Plugin 1.0.6: 31m39s scan, 7h06m21s apply, total 7h38m
1.0.7 (and Plugin 1.1) is better 16.95x on scan, 35.33x on apply, 32.87x overall
Note that all this comes to you transparently. You don’t have to set any parameter to take advantage of this feature. My only suggestion would be to use as large log files (there is a limit of 4G on total log file size) as you can. I know users have been using smaller log files to avoid recovery running into hours. They have taken a hit on throughput during normal running to avoid longer recovery time. You don’t have to do that any more. InnoDB recovery will never run into hours. It’s a guarantee!

17 Responses

  1. VadimTk Says:

    http://www.innodb.com/wp/products/innodb_plugin/download/

    There is no 1.0.7 in downloads, where we can get it

    [Reply]

    Inaam Rana Reply:

    @VadimTk,

    Sorry your comment got stuck in moderation for some reason. As I mentioned to Peter, plugin 1.0.7 is part of mysql 5.1.46 which is available for download from mysql site as of today.

    regards,
    inaam

    [Reply]

  2. Andy Says:

    XtraDB also has their own patch on improved recovery time.

    How does your solution compare to theirs?

    [Reply]

  3. MySQL 5.5.4 is Very Exicting « Jeremy Zawodny's blog Says:

    [...] InnoDB Recovery is WAY FASTER in this release.  This is the result of some small but important changes to the algorithms used when dealing with the redo log and replaying transactions. At least one O(n*n) operation has become O(n log n) and other optimizations have contributed to an order of magnitude improvement in recovery time. [...]

  4. Inaam Rana Says:

    Andy,

    Best thing would be to test both patches. If you are able to do that please share the results with us.

    Having said this, following are my observations about their solution based on what I have read:

    1) It does not deal with the redo scan phase issue that is mentioned above.
    2) The solution for redo application phase is to put the blocks to the flush_list after changing buf_page::oldest_modification such that the block goes to the end of the list. This, to my taste, is a bit hackish.
    3) By changing buf_page::oldest_modification to be the same as that of the last block in the flush_list and then appending to the flush_list may be ok from correctness perspective but that would mean a lot of blocks in the flush_list will have same oldest_modification when the recovery is over. That can result in a very heavy first checkpoint after the recovery.

    If you are able to test it and report your findings here that will be fabulous.

    inaam

    [Reply]

  5. Marcus Bointon Says:

    Is this likely to improve backup times with xtrabackup et al, since that effectively does a recovery?

    [Reply]

  6. Peter Zaitsev Says:

    Inaam,

    Thanks for great explanation on how recovery works and for “official” fix for this issue. I see you mention Innodb Plugin 1.0.7 which is GA, I however could not locate it to download. Was it published ?

    [Reply]

  7. Inaam Rana Says:

    Peter,

    Plugin 1.0.7 will be part of the upcoming MySQL 5.1.46.

    [Reply]

  8. Peter Zaitsev Says:

    Thanks Inaam ! Looking forward to it.

    [Reply]

  9. MC Says:

    Hi Inaam,
    I am using Fedora 9 ,i want to use My sql there , how to do that without an internet connection?

    [Reply]

  10. emlak ilanları Says:

    Thank you very mach.

    [Reply]

  11. Transactions on InnoDB » Blog Archive » InnoDB recovery gets even faster in Plugin 1.1, thanks to native AIO Says:

    [...] on top of what we already have in Plugin 1.0.7. The details on the latter are available in this blog. Yet, when I tried to recover another big recovery dataset I created, I got the following results [...]

  12. Jai Patel Says:

    Hi Inaam,

    Thanks for sharing this.

    Jai Patel, Yaar@KY….
    http://www.kyayaar.com

    [Reply]

  13. aditya Says:

    sir everytime i run mysql-essential5.1.48 from mysql gui tools it displays error
    ‘2006′.
    Sir,please suggest me an idea how to tackle this problem.
    thanks a lot…

    [Reply]

  14. Shelia Says:

    hanks for great explanation on how recovery works and for “official” fix for this issue.

    [Reply]

  15. Cvecara Beograd Says:

    Is there a way for partial table backup. Since we are dealing with large tables it would be helpfull

    [Reply]

  16. S. Kamal Says:

    Hi Inaam,
    Sorry to chime in with an unrelated question.
    I have a table with indices. After I truncate the table and re-populate, will the indices remain intact and be updated automatically as soon as it gets re-populated? If not, what can I do?

    [Reply]

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.