Jul 26
Introduction

InnoDB has had the thread concurrency management code for some years now. Most will be familiar with the three configuration variables associated with this feature:
The problem with the existing code is that the queueing overhead becomes too much and negatively impacts performance, especially as the number of user threads goes up. The queueing code uses the os_event_t library to manage the queuing and dequeing of user threads in an explicit wait queue. This wait queue is implemented using strict FIFO. The FIFO constraint ensures that there is no thread starvation. To overcome this overhead one experimental feature that we are trying is to use busy polling of  free slots to enter InnoDB using sleep delays. The new scheme unlike the existing concurrency management scheme is not starvation free. Before people start complaining that event driven is better than polling, rest assured I know the theory. If there is a way to reduce the overhead and still keep it event driven, I’m very interested to see your code and experimental data :-) .  I’ve experimented with futexes too but no joy there either. It is reasonable to assume that I could have overlooked some better technique therefore feedback is important. In theory if the sleep delay can be tuned to exactly the amount required for a slot to be free then there should be perfect scheduling. This as we know is never going to be the case. However, we can try to get a good enough approximation that varies with the load. This will reduce the optimal or peak TPS due to the overhead of sleeping while there is potentially a slot that is empty. However, for applications that have lots of threads, greater than say 256 it will be able to maintain a higher TPS at the higher thread count because it doesn’t suffer from the queuing bottleneck. The results have been very encouraging in our internal experiments on high end hardware, hosts with >= 16 cores. There are some issues that need ironing out, in particular Sysbench OLTP read-only tests on 8 core hardware. This feature uses atomics to manage the thread concurrency slots.

Read the rest of this entry »

Jul 25

Introduction

The InnoDB  UNDO entries reside in a special system table called the UNDO log. This log is made up of several segments. These segments are called rollback segments. A segment in InnoDB is similar to what a file would be in a file system,e.g., user tables and indexes are also stored as separate segments within the same tablespace,  only their format is different. In that sense there is nothing special about InnoDB UNDO logs. This feature allows storing of the UNDO log across several tablespaces.

Purpose

UNDO logs  contain the before image of modified records. There are two types of UNDO records, one for insert and another for updates. The insert UNDO records can be discarded on transaction rollback. The update records are used for rollback, MVCC and by purge. It is because of purge that we can’t just remove the UNDO log records  once the UNDO logs are not referenced by any running transaction’s snapshot view (for MVCC). When a transaction is started it is assigned a rollback segment in a round robin scheme. Multiple transactions can be assigned the same rollback segment to write their changes. Up to a maximum of 1024 transactions can use a the same rollback segment. If you have more rollback segments then there is less contention around the rollback segment mutex.

Purge

The purge thread(s) run in the background and they read the UNDO log records from oldest to the latest, up to but not including the oldest active snapshot view in the system. It parses the UNDO log records and for entries that represent delete operations it uses the stored index key to search for the records in both the secondary and primary index and purges the entries, modifying the index tree structure if required. Normal DML operations simply delete mark the records but they don’t physically purge them, unless it is an insert that is being rolled back. This is to avoid expensive tree modifying operations in DML code. Once purge is finished with the UNDO entries it then truncates the UNDO log up to where it has processed the entries. For MVCC user transactions we need to follow the DATA_ROLL_PTR pointer to the UNDO log to build a previous version of the row.

Read the rest of this entry »

Jul 25

Are you running an InnoDB installation with a many-gigabytes buffer pool(s)? Does it take too long before it goes back to speed after a restart? If yes, then the following will be interesting to you.

In the latest MySQL 5.6 Labs release we have implemented an InnoDB buffer pool(s) dump and load to solve this problem.

The contents of the InnoDB buffer pool(s) can be saved on disk before MySQL is shut down and then read in after a restart so that the warm up time is drastically shortened – the buffer pool(s) go to the state they were before the server restart! The time needed for that is roughly the time needed to read data from disk that is about the size of the buffer pool(s).

Lets dive straight into the commands to perform various dump/load operations:

The buffer pool(s) dump can be done at any time when MySQL is running by doing: Read the rest of this entry »

Jul 25

This Summer’s lab release includes an important feature of InnoDB – InnoDB Fulltext Search. This feature would greatly enhance InnoDB’s capability in Text search and retrieving. Since the feature is designed for our transactional storage engine, its underlying architecture design and implementation are completely different with those of MyISAM. So it is worth to give a brief technology review of this feature, familiarize users with some important concepts in the InnoDB FTS so that they can better utilize this feature.

There are a few other posts on the subject. John Russell will give a brief tutorial on the InnoDB fulltext search command and syntax. I will also discuss some comparison with MyISAM fulltext search in another post. And Vinay in our server testing will give some performance number from his experiments.

The Design:

To begin with, I will go over briefly on some key design concepts, which would help you better understand the feature.

    Read the rest of this entry »

    Jul 25

    With this lab release, InnoDB now has its own Fulltext Search (FTS) support. People might wonder how different is this FTS comparing with that of MyISAM. From the usage standpoint, they seem very similar with no apparent difference. However, in the underlying FTS engine, the difference is huge, since two are built on different storage engines with totally different design. So in some aspect such as DML, it might be difficult to make it an apple to apple comparison. However, we do try to list and explain the difference, so to make it easier for user to understand our design and make the transition to use InnoDB FTS easier.

    The Design:

    The most significant difference is the design. Like most transaction storage engine, InnoDB stores the inverted index in auxiliary relational tables. All queries and DMLs eventually route to these tables. On the other hand, MyISAM stores the inverted list in a two-level B-Tree. The first level contains records of word and its count, and the second level contains records of this word’s weight and the rowid. So it is more of a special B-tree “index”.

    However, the MyISAM design would not be suitable for highly concurrent transaction storage engine like InnoDB. And its special structure limits its extensibility, as it still can’t include the position information in its indexing to support proximity search.

    For InnoDB, by storing the inverted index in auxiliary tables, it renders all the benefit of relational table manipulations to the FullText index. It already applied the partition concept so that multiple tables are used to handle inverted index. This not allows the create index to be parallelized, it also allow its query can be parallelized in the future. So InnoDB design apparently wins in terms of future extensibility and scalability.

    Read the rest of this entry »

    Jul 25

    One of the features found in the summer 2011 labs release is the ability to select the InnoDB page size without recompiling.  Smaller page sizes may be useful for certain storage media such as SSDs where there is no need to minimize seek time between reads.

    A new global setting called innodb-page-size can be set to 4k, 8k or 16k before creating a new MySQL instance. This sets the page size for all tablespaces used by that InnoDB instance.   This can be done in my.cnf or on the mysqld command line.  It is a read-only variable while the engine is running since it must be set before InnoDB starts up and creates a new system tablespace.  That happens when InnoDB does not find ibdata1 in the data directory.  If your system tablespace already exists using one page size and innodb-page-size is something else, the engine will not start.

    A few bugs were found and fixed related to smaller page sizes in InnoDB tablespaces. Those bug fixes can also be found in the 2011 summer labs release.

    The mysql-test-run can be run like this to test a smaller page size;

    perl mysql-test-run.pl –mysqld=–innodb-page-size=4k –suite=innodb –force

    Read the rest of this entry »

    Jul 25

    InnoDB has an internal file system management module that primarily manages the space in the data files. One of the pain points was the coarse level of locking used when a data file has to be extended. More about this issue can be found here. In the latest labs release we have fixed this problem.
    When we need to extend a data file inside InnoDB we write zero filled pages synchronously to the file. The user thread which is extending the data file holds fil_system::mutex during the whole operation. This mutex covers changes to all data structures related to file system management. What this means is that when we do a regular IO we do need to acquire fil_system::mutex though only for a short time. Because the thread doing the data file extension is holding the mutex during the whole IO operation any other thread (user or background) trying to access data file for regular read or write ends up waiting. This brings the whole system to a virtual stand still as no read or write activity can happen. This is true even if a thread is trying to access a data file that is not the one being extended.
    We fixed this issue by introducing an internal flag to the data structure indicating that a file is being extended. Now if a user thread needs to extend a data file it does acquire the fil_system::mutex but releases it after setting the flag. Once it is done with the extension IO it resets the flag. This allows other threads to access data files while one of the file is being extended. This also allows multiple files to be extended in parallel. Our tests have shown that the issue of stalls due to file extension is indeed fixed by this approach.
    A related feature which can be considered as future work is to off load the file extension to a background thread.

    Jul 25

    Yes, we are reintroducing random readahead. We took the random readahead code out after our internal benchmarking and community feedback showed that random readahead tends to adversely affect the performance. However, some recent feedback from customers have shown that for some specific workloads random readahead can be good. Based on this information we have decided to reintroduce random readahead albeit with a dynamic switch to turn it on or off. By default, random readahead is not enabled. You can turn it on by setting innodb_random_read_ahead = TRUE.

    The term ‘random readahead’ is a bit of an oxymoron. Readahead seems more related to sequential access pattern. To clarify this I’ll try to explain how InnoDB internally triggers a readahead request. Data pages in InnoDB are laid out on the disk in chunks of 64 pages called extents. The decision about readahead is essentially about whether it makes sense to read in the entire extent instead of reading in the pages one by one as and when they are requested by the user. There are two types of readahead mechanisms inside InnoDB. There are some obvious and some subtle differences between the two.

    Linear readahead is the one that decides whether or not to read in the ‘next’ extent and the decision is made based not only on the number of pages of the ‘current’ extent present in the buffer pool but also the access pattern. The ‘current’ extent means the extent which is being accessed currently by the user threads. Whenever a user thread tries to access a page in the buffer pool the linear readahead mechanism is triggered. We first see if it is a boundary page of an extent or not. If it is a boundary page we figure out how many pages of this extent are present in the buffer pool and what is their access pattern. If there are more than innodb_read_ahead_threshold pages in the buffer pool and the access pattern is sequential and in the right order, InnoDB will queue asynchronous read requests for the entire next extent.

    Random readahead is concerned mainly with the ‘current’ extent. The evaluation is done on each page read (note the difference from linear readahead where conditions are evaluated on every page access). On every page read, if innodb_random_read_ahead is set we try to see how many pages of this extent are in the buffer pool. If we have a certain number of pages and all of them have been accessed recently i.e.: they are closer to the MRU end of the LRU list we deduce that the remaining pages of the extent will be accessed soon as well. Asynchronous read requests for the remaining pages in the current extent are queued.

    To cap the difference between the two types of readahead mechanisms, in case of linear readahead the decision is made about next extent, the conditions are evaluated on each page access, the access pattern is evaluated and decision is made if and only if we are accessing a boundary page of an extent. In case of random readahead the decision is made about current extent, the conditions are evaluated only at page read, instead of access pattern just the recentness of access is taken into consideration and the decision is not tied to the page being read in being a boundary page.

    May 6

    Last month was one of the busiest months for the InnoDB team:

    The slides of the three talks are now available:

    In addition to the blogs from the InnoDB team, Dimitri published a five-part series on MySQL 5.6 Performance, all on DimitriK’s (dim) Weblog:

    1. MySQL Performance: 5.6 Notes, part 5 – Fixing Adaptive Flushing..
    2. MySQL Performance: 5.6 Notes, part 4 – Fixing Purge issue..
    3. MySQL Performance: 5.6 Notes, part 3 – More in depth..
    4. MySQL Performance: 5.6 Notes, part 2 – Under full dbSTRESS workload…
    5. MySQL Performance: 5.6 Notes, part 1 – Discovery…

    Read the rest of this entry »

    Apr 13

    In our 5.6 release, we introduced a new feature – “InnoDB Metrics Table” as part of our effort to extend InnoDB diagnostic and monitoring capability. The “Metrics Table” feature consolidates all InnoDB related Performance and Resource related counters (as of today, 176 counters) and makes them available through an information schema table. In addition, it gives the user the ability to control these counters for their monitoring need.

    In following sections, we will go over the feature in detail and focus on its usage to get you started with “metrics counters”.

    1) The user interface – InnoDB Metrics Table

    Before this feature, InnoDB already had dozens of performance counters. They are exposed to the user through “MySQL System Variables”. These counters are permanent counters, and there are no external controls on them. However, they had become important probes that allow users to peek into the server running state and make appropriate decisions to tune the system to its best state.

    In an effort to extend our system monitoring capability, we decided to consolidate the monitoring mechinery, and created a dedicated monitoring module. And the interface to the monitor counters will be a relational table, so users can perform SQL queries on the result. And thus we have the “information_schema.innodb_metrics” table as the interfaces of the “Metrics Table” feature.

    Read the rest of this entry »

    « Previous Entries Next Entries »