Jul 25

InnoDB has an internal file system management module that primarily manages the space in the data files. One of the pain points was the coarse level of locking used when a data file has to be extended. More about this issue can be found here. In the latest labs release we have fixed this problem.
When we need to extend a data file inside InnoDB we write zero filled pages synchronously to the file. The user thread which is extending the data file holds fil_system::mutex during the whole operation. This mutex covers changes to all data structures related to file system management. What this means is that when we do a regular IO we do need to acquire fil_system::mutex though only for a short time. Because the thread doing the data file extension is holding the mutex during the whole IO operation any other thread (user or background) trying to access data file for regular read or write ends up waiting. This brings the whole system to a virtual stand still as no read or write activity can happen. This is true even if a thread is trying to access a data file that is not the one being extended.
We fixed this issue by introducing an internal flag to the data structure indicating that a file is being extended. Now if a user thread needs to extend a data file it does acquire the fil_system::mutex but releases it after setting the flag. Once it is done with the extension IO it resets the flag. This allows other threads to access data files while one of the file is being extended. This also allows multiple files to be extended in parallel. Our tests have shown that the issue of stalls due to file extension is indeed fixed by this approach.
A related feature which can be considered as future work is to off load the file extension to a background thread.

Jul 25

Yes, we are reintroducing random readahead. We took the random readahead code out after our internal benchmarking and community feedback showed that random readahead tends to adversely affect the performance. However, some recent feedback from customers have shown that for some specific workloads random readahead can be good. Based on this information we have decided to reintroduce random readahead albeit with a dynamic switch to turn it on or off. By default, random readahead is not enabled. You can turn it on by setting innodb_random_read_ahead = TRUE.

The term ‘random readahead’ is a bit of an oxymoron. Readahead seems more related to sequential access pattern. To clarify this I’ll try to explain how InnoDB internally triggers a readahead request. Data pages in InnoDB are laid out on the disk in chunks of 64 pages called extents. The decision about readahead is essentially about whether it makes sense to read in the entire extent instead of reading in the pages one by one as and when they are requested by the user. There are two types of readahead mechanisms inside InnoDB. There are some obvious and some subtle differences between the two.

Linear readahead is the one that decides whether or not to read in the ‘next’ extent and the decision is made based not only on the number of pages of the ‘current’ extent present in the buffer pool but also the access pattern. The ‘current’ extent means the extent which is being accessed currently by the user threads. Whenever a user thread tries to access a page in the buffer pool the linear readahead mechanism is triggered. We first see if it is a boundary page of an extent or not. If it is a boundary page we figure out how many pages of this extent are present in the buffer pool and what is their access pattern. If there are more than innodb_read_ahead_threshold pages in the buffer pool and the access pattern is sequential and in the right order, InnoDB will queue asynchronous read requests for the entire next extent.

Random readahead is concerned mainly with the ‘current’ extent. The evaluation is done on each page read (note the difference from linear readahead where conditions are evaluated on every page access). On every page read, if innodb_random_read_ahead is set we try to see how many pages of this extent are in the buffer pool. If we have a certain number of pages and all of them have been accessed recently i.e.: they are closer to the MRU end of the LRU list we deduce that the remaining pages of the extent will be accessed soon as well. Asynchronous read requests for the remaining pages in the current extent are queued.

To cap the difference between the two types of readahead mechanisms, in case of linear readahead the decision is made about next extent, the conditions are evaluated on each page access, the access pattern is evaluated and decision is made if and only if we are accessing a boundary page of an extent. In case of random readahead the decision is made about current extent, the conditions are evaluated only at page read, instead of access pattern just the recentness of access is taken into consideration and the decision is not tied to the page being read in being a boundary page.

Apr 11

In MySQL 5.6.2 we have introduced a new background thread named the page_cleaner in InnoDB. Adaptive flushing of modified buffer pool pages in the main background thread, async flushing by a foreground query thread if the log files are near to wrapping around, idle flushing and shutdown flushing are now moved to this thread, leaving only the last resort sync flushing in foreground query threads. We’ve also added counters for these activities.

As page_cleaner is all about the flushing of dirty pages to disk it’ll do you a world of good if you can go through this post where I have explained different types of flushing that happen inside InnoDB and the conditions that trigger flushing. The page_cleaner thread is only concerned with flush_list flushing (this may change in future releases). So let us dig a bit deeper into why flush_list flushing happens and why it would make sense to do this flushing in a separate thread. As is usually the case I have to skip some details to keep this note simple.

On a busy server flush_list flushing (which I’ll simply call flushing from this point onwards) happens under four conditions. In order to understand these conditions let us familiarize ourselves with the concept of checkpoint_age. The checkpoint_age is the difference between the current_lsn (the latest change to the database) and the last checkpoint_lsn (the lsn when last checkpoint happened). We obviously don’t want to let this difference grow beyond the log file size because if that happens then we end up overwriting redo log entries before the corresponding dirty pages are flushed to the disk, losing the ability to recover them. In order to avoid the above situation we maintain two high water marks to indicate if we are nearing the end of reusable redo log space. Lets call these water marks async_water_mark and sync_water_mark, where the later represents a more urgent situation than the former. Now that we have clarified the checkpoint_age concept let us get back to the four conditions under which flushing of dirty pages happens:

  1. checkpoint_age < async_water_mark
    • This condition means that we have enough reusable redo space. As such there is no hurry to flush dirty pages to the disk. This is the condition where we’d like our server to be most of the time.
    • Based on adaptive_flushing heuristics we flush some dirty pages in this state. This flushing happens in the background master thread.
    • During the flushing no other threads are blocked, so queries continue normally.

    Read the rest of this entry »

Sep 19

Write-heavy workloads can reach a situation where InnoDB runs out of usable space in its redo log files. When that happens, InnoDB does a lot of disk writes to create space and you can see a drop in server throughput for a few seconds. From InnoDB plugin 1.0.4 we have introduced the ‘innodb_adaptive_flushing’ method that uses a heurstic to try to flush enough pages in the background so that it is rare for the very active writing to happen. In this note I’ll try to explain how the heuristic works i.e.: what factors are taken into account when deciding how many dirty pages to flush from the buffer pool in the background. I’ll skip some details for the sake of clarity.

You may find InnoDB glossary useful to understand the terminology used in this note. adaptive_flushing is of consequence to you if your workload involves significant write activity. For example, if you are running MySQL 5.0 or MySQL 5.1 with built-in InnoDB and you see periodic drop in the performance with corresponding spikes in write activity then you should consider upgrading to InnoDB plugin.

Let us first familiarize ourselves with why and how flushing takes place in the buffer pool. Flushing is the activity of writing dirty pages to the disk. We need to flush because of two main reasons. First, we need to read in a page into the buffer pool and there are no free buffers. In this case we select a page as victim (based on LRU) and if the victim is dirty we write it to the disk before reusing it. This, I’ll call, LRU_list flush. Secondly, we need to write dirty pages to the disk when we want to reuse the redo logs. Redo logs in InnoDB (or for that matter in most other database engines as well) are used in a circular fashion. We can only overwrite redo log entries if, and only if, the dirty pages corresponding to these entries have been flushed to the disk. This type of flushing is called flush_list flush.

adaptive_flushing is related to the flush_list flush (though it is also influenced by the LRU_list flush as we’ll see shortly). If we allow the redo logs to fill up completely before we trigger flushing of the dirty pages then it will cause a serious drop in performance during this spike in the I/O. To avoid this we’d like to keep flushing dirty pages in the background at a rate that is just enough to avoid the I/O bursts. Note that we don’t want to flush too aggressively because that will reduce the chances of write combining thus resulting in more I/O than is required. So how do we determine the rate at which we do flush_list flush in the background? That is where the adaptive_flushing heuristic comes into play.

We do know the capacity of the redo logs which is constant and is equal to the combined size of the log files. If we know the rate at which redo log is being generated then we can figure out how many seconds we have before the whole redo log is filled up. Read the rest of this entry »

Apr 14
With the exception of Windows InnoDB has used ‘simulated AIO’ on all other platforms to perform certain IO operations. The IO requests that have been performed in a ‘simulated AIO’ way are the write requests and the readahead requests for the datafile pages. Let us first look at what does ‘simulated AIO’ mean in this context.

We call it ‘simulated AIO’ because it appears asynchronous from the context of a query thread but from the OS perspective the IO calls are still synchronous. The query thread simply queues the request in an array and then returns to the normal working. One of the IO helper thread, which is a background thread, then takes the request from the queue and issues a synchronous IO call (pread/pwrite) meaning it blocks on the IO call. Once it returns from the pread/pwrite call, this helper thread then calls the IO completion routine on the block in question which includes doing a merge of buffered operations, if any, in case of a read. In case of a write, the block is marked as ‘clean’ and is removed from the flush_list. Some other book keeping stuff also happens in IO completion routine.

What we have changed in the InnoDB Plugin 1.1 is to use the native AIO interface on Linux. Note that this feature requires that your system has libaio installed on it. libaio is a thin wrapper around the kernelized AIO on Linux. It is different from Posix AIO which requires user level threads to service AIO requests. There is a new boolean switch, innodb_use_native_aio, to choose between simulated or native AIO, the default being to use native AIO.

Read the rest of this entry »

Apr 13
One of the well known and much written about complaint regarding InnoDB recovery is that it does not scale well on high-end systems. Well, not any more. In InnoDB plugin 1.0.7 (which is GA) and plugin 1.1 (which is part of MySQL 5.5.4) this issue has been addressed. Two major improvements, apart from some other minor tweaks, have been made to the recovery code. In this post I’ll explain these issues and the our solution for these.

First issue reported here is about available memory check eating up too much CPU. During recovery, the first phase, called redo scan phase, is where we read the redo logs from the disk and store them in a hash table. In the second phase, the redo application phase, these redo log entries are applied to the data pages. The hash table that stores the redo log entries grows in the buffer pool i.e.: memory for the entries is allocated in 16K blocks from the buffer pool. We have to ensure that the hash table does not end up allocating all the memory in the buffer pool leaving us with no room to read in pages during the redo log application phase. For this we have to keep checking the size of the heap that we are using for allocating the memory for the hash table entries. So why would it kill the performance? Because we do not have the total size of the heap available to us. We calculate it by traversing the list of blocks so far allocated. Imagine if we have gigabytes or redo log to apply (it can be up to 4G). That would mean hundreds of thousands of blocks in the heap! And we have to make a check roughly whenever we are reading in a new redo page during scan. An O(n * m) algorithm where ‘n’ is number of blocks in the heap and ‘m’ is number of redo pages that have to be scanned.

What is the solution we came up with? Store the total size of a heap in its header. Simple and effective. Our algorithm now becomes O(m).

Read the rest of this entry »