May 17

InnoDB Plugin 1.1 doesn’t add any recovery specific improvements on top of what we already have in Plugin 1.0.7. The details on the latter are available in this blog. Yet, when I tried to recover another big recovery dataset I created, I got the following results for total recovery time:

  • Plugin 1.0.7: 46min 21s
  • Plugin 1.1: 32min 41s

Plugin 1.1 recovery is 1.5 times faster. Why would that happen? The numerous concurrency improvements in Plugin 1.1 and MySQL 5.5 can’t really affect the recovery. The honor goes to Native Asynchronous IO on Linux. Let’s try without it:

  • Plugin 1.1 with –innodb-use-native-aio=0: 49min 07s

which is about the same as 1.0.7 time. My numerous other recovery runs showed that the random fluctuations account for 2-3min of a 30-45min test.

Read the rest of this entry »

Apr 16

What a busy week! Lots of MySQL 5.5 announcements that just happened to coincide with the MySQL Conference and Expo in Silicon Valley. Here are some highlights of the performance and scalability work that the InnoDB team was involved with.

A good prep for the week of news is the article Introduction to MySQL 5.5, which includes information about the major performance and scalability features. That article will lead you into the MySQL 5.5 manual for general features and the InnoDB 1.1 manual for performance & scalability info.

Then there were the conference presentations from InnoDB team members, which continued the twin themes of performance and scalability:

Read the rest of this entry »

Apr 14
With the exception of Windows InnoDB has used ’simulated AIO’ on all other platforms to perform certain IO operations. The IO requests that have been performed in a ’simulated AIO’ way are the write requests and the readahead requests for the datafile pages. Let us first look at what does ’simulated AIO’ mean in this context.

We call it ’simulated AIO’ because it appears asynchronous from the context of a query thread but from the OS perspective the IO calls are still synchronous. The query thread simply queues the request in an array and then returns to the normal working. One of the IO helper thread, which is a background thread, then takes the request from the queue and issues a synchronous IO call (pread/pwrite) meaning it blocks on the IO call. Once it returns from the pread/pwrite call, this helper thread then calls the IO completion routine on the block in question which includes doing a merge of buffered operations, if any, in case of a read. In case of a write, the block is marked as ‘clean’ and is removed from the flush_list. Some other book keeping stuff also happens in IO completion routine.

What we have changed in the InnoDB Plugin 1.1 is to use the native AIO interface on Linux. Note that this feature requires that your system has libaio installed on it. libaio is a thin wrapper around the kernelized AIO on Linux. It is different from Posix AIO which requires user level threads to service AIO requests. There is a new boolean switch, innodb_use_native_aio, to choose between simulated or native AIO, the default being to use native AIO.

Read the rest of this entry »

Apr 13
One of the well known and much written about complaint regarding InnoDB recovery is that it does not scale well on high-end systems. Well, not any more. In InnoDB plugin 1.0.7 (which is GA) and plugin 1.1 (which is part of MySQL 5.5.4) this issue has been addressed. Two major improvements, apart from some other minor tweaks, have been made to the recovery code. In this post I’ll explain these issues and the our solution for these.

First issue reported here is about available memory check eating up too much CPU. During recovery, the first phase, called redo scan phase, is where we read the redo logs from the disk and store them in a hash table. In the second phase, the redo application phase, these redo log entries are applied to the data pages. The hash table that stores the redo log entries grows in the buffer pool i.e.: memory for the entries is allocated in 16K blocks from the buffer pool. We have to ensure that the hash table does not end up allocating all the memory in the buffer pool leaving us with no room to read in pages during the redo log application phase. For this we have to keep checking the size of the heap that we are using for allocating the memory for the hash table entries. So why would it kill the performance? Because we do not have the total size of the heap available to us. We calculate it by traversing the list of blocks so far allocated. Imagine if we have gigabytes or redo log to apply (it can be up to 4G). That would mean hundreds of thousands of blocks in the heap! And we have to make a check roughly whenever we are reading in a new redo page during scan. An O(n * m) algorithm where ‘n’ is number of blocks in the heap and ‘m’ is number of redo pages that have to be scanned.

What is the solution we came up with? Store the total size of a heap in its header. Simple and effective. Our algorithm now becomes O(m).

Read the rest of this entry »

Apr 13

Performance Schema Support in InnoDB

With the plugin 1.1 release, InnoDB will have full support of Performance Schema, a new feature of MySQL 5.5 release. This allows a user to peak into some critical server synchronization events and obtain their usage statistics. On the other hand, in order to make a lot of sense of the instrumented result, you might need some understanding of InnoDB internals, especially in the area of synchronization with mutexes and rwlocks.

With this effort, the following four modules have been performance schema instrumented.

1. Mutex
2. RWLOCKs
3. File I/O
4. Thread

Almost all mutexes (42), rwlocks (10) and 6 types of threads are instrumented. Most mutex/rwlock instrumentations are turned on by default, a few of them are under special define. For File I/O, their statistics are categorized into Data, Log and Temp file I/O.

This blog is to give you a quick overview on this new machinery.

Read the rest of this entry »

Apr 13
Background

The original motivation behind this patch was the infamous Bug#26590MySQL does not allow more than 1023 open transactions. Actually the 1024 limit has to do with the number of concurrent update transactions that can run within InnoDB. Where does this magic number come from ? 1024 is the total number of UNDO log list slots on one rollback segment header page. And in the past InnoDB created just one rollback segment header page during database creation. This rollback segment header page is anchored in the system header page, there is space there for 128 rollback segments but only one was being created and used resulting in the 1024 limit. Each slot in the rollback segment header array comprises of {space_id, page_no}, where both space_id and page_no are of type uint32_t . Currently the space id is “unused” and always points to the system table space, which is tablespace 0. Now, onto the rollback segment header page. This page contains a rollback segment header (details of which are outside the scope of this blog entry :-) ), followed by an array of 1024 UNDO slots. Each slot is the base node of a file based linked list of UNDO logs. Each node in this file based list contains UNDO log records, containing the data updated by a transaction. A single UNDO log node can contain UNDO entries from several different transactions.

Performance ramifications

When a transaction is started it is allocated a rollback segment to write its modifications. Multiple transactions can write to the same rollback segment but only one transaction is allowed to write to any one UNDO slot during its lifetime. This should make clear where the 1024 limit comes from. Each rollback segment is protected by its own mutex and when we have a single rollback segment this rollback segment mutex can become a high contention mutex.

Requirements

Backward compatibility in file formats is something we take very seriously at InnoDB.  InnoDB has always had the ability to use up to 128 pages but before this fix it created only one rollback segment. We had to figure out a way to make the multiple rollback segments change backward compatible, without breaking any assumptions in the code of older versions of InnoDB about absolute locations of system pages and  changes to system data. The 128 limit is a result of the latter. While there is space for 256 rollback segments, InnoDB uses only 7 bits from that field. Once we fix that we could in the future enable 256 rollback segments, however 128 seems to be sufficient for now. There are other scalability issues that need to be addressed first before 128K concurrent transactions will become an issue :-) .

Read the rest of this entry »

Apr 1

The InnoDB Plugin manual is now available on the MySQL web site.

Aug 11

Today, the InnoDB team announced the latest release of the InnoDB Plugin, release 1.0.4. Some of the performance gains in this release are quite remarkable!

As noted in the announcement, this release contains contributions from Sun Microsystems, Google and Percona, Inc., for which we are very appreciative. This page briefly describes each of the contributions and the way we treated them. The purpose of this post is to describe the general approach the InnoDB team takes toward third party contributions.

In principle, we appreciate third party contributions. However, we simply don’t have the resources to seriously evaluate every change that someone proposes, but when we do undertake to evaluate a patch, we have some clear criteria in mind:

  • The patch has to be technically sound, reliable, and effective
  • The change should fit with the architecture, and our overall plans and philosophy for InnoDB
  • The contribution must be available to us under a suitable license

Let’s consider, in general terms, what these criteria mean in practice.

Read the rest of this entry »

May 13

Well, it took us a little while (we’ve been busy ;-) !), but we’ve now posted our presentations on InnoDB from the MySQL Conference and Expo 2009. You can download these presentations by Heikki Tuuri, Ken Jacobs and Calvin Sun from the InnoDB website, as follows:

The description of these and other presentations about InnoDB are available here.

Apr 18

Recently, it was reported (see MySQL bug #43660) that “SHOW INDEXES/ANALYZE does NOT update cardinality for indexes of InnoDB table”. The problem appeared to happen only on 64-bit systems, but not 32-bit systems. The bug turns out to be a case of mistaken identity. The real criminal here wasn’t the SHOW INDEXES or the ANALYZE command, but something else entirely. It wasn’t specific to 64-bit platforms, either. Read on for the interesting story about this mystery and its solution …

InnoDB estimates statistics for the query optimizer by picking random pages from an index. Upon detailed analysis, we found that the algorithm that picks random pages for estimation always picked the same page, thus producing the same result every time. This made it appear that the index cardinality was not updated by ANALYZE TABLE. Going deeper, the reason the algorithm always selected the same page was that the random number generator always generated numbers that, when divided by 3, always gave the same remainder (2).

The sampling algorithm selects a random leaf page by starting from the root page and then selecting a random record from it, descending into its child page and so on until it reaches a leaf page. In the particular case that was reported in the bug report, the root page contained only 3 records and the tree height was only 2 (i.e., the leaf pages were all just below the root page).

You can already guess what happened. The “random” numbers generated, not being so random, caused the algorithm to always pick the same record from the root page (the second one) and then descend to the leaf page below it. Every time. So, the 8 random pages that were sampled in order to get an estimate of the whole picture were in fact the same page, even in isolated ANALYZE TABLE runs.

So, clearly there was a problem with the random number generator. But why didn’t this problem seem to appear on 64-bit platforms? It would have, had we only enough time to wait. The random number generator, always generating numbers like 3k+2 of type unsigned long, at some point wrapped around 4 billion on 32-bit machines and started generating numbers like 3k+1. On 64-bit machines, where unsigned long is much bigger, this wrap did not occur. But it would have occurred if we ran the test for 1000 years!.

Read the rest of this entry »

« Previous Entries