Dec 20

The problem and its cause

There have been several complaints over the years about InnoDB’s inability to scale beyond 256 connections. One of the main issues behind this scalability bottleneck was the read view creation that is required for MVCC (Multi Version Concurrency Control) to work. When the user starts a transaction this is what InnoDB does under the hood:

  • Create or reuse a transaction instance – usually it is reused, the transactions are reused from a pool (trx_sys_t::mysql_trx_list).
  • Initialize the transaction start time and assign a rollback segment
  • Append the transaction to an active  transaction list ordered on trx_t::id in descending order

The append to  the trx_sys_t::trx_list and corresponding remove during commit is covered by trx_sys_t::mutex. After the transaction is “started” and if the transaction has an isolation greater than or equal to REPEATABLE-READ then before the first record/row is accessed by the transaction, InnoDB creates a view (snapshot) of the running system state. It does this by examining the transactions that are active at the time of the MVCC snapshot, so that their changes can be excluded from the creating transaction’s read view. This read view creation is also covered by the trx_sys_t::mutex. As the number of active transactions in the system increases this read view creation takes longer and longer. This increases the wait times on the trx_sys_t::mutex (during transaction start and commit) and once threads are forced to wait on a condition variable (in contrast to simply spinning while waiting for the mutex) the system throughput drops dramatically.

The solution

While investigating this problem there were two observations that I made: Read the rest of this entry »

Jul 26
Introduction

InnoDB has had the thread concurrency management code for some years now. Most will be familiar with the three configuration variables associated with this feature:
The problem with the existing code is that the queueing overhead becomes too much and negatively impacts performance, especially as the number of user threads goes up. The queueing code uses the os_event_t library to manage the queuing and dequeing of user threads in an explicit wait queue. This wait queue is implemented using strict FIFO. The FIFO constraint ensures that there is no thread starvation. To overcome this overhead one experimental feature that we are trying is to use busy polling of  free slots to enter InnoDB using sleep delays. The new scheme unlike the existing concurrency management scheme is not starvation free. Before people start complaining that event driven is better than polling, rest assured I know the theory. If there is a way to reduce the overhead and still keep it event driven, I’m very interested to see your code and experimental data :-) .  I’ve experimented with futexes too but no joy there either. It is reasonable to assume that I could have overlooked some better technique therefore feedback is important. In theory if the sleep delay can be tuned to exactly the amount required for a slot to be free then there should be perfect scheduling. This as we know is never going to be the case. However, we can try to get a good enough approximation that varies with the load. This will reduce the optimal or peak TPS due to the overhead of sleeping while there is potentially a slot that is empty. However, for applications that have lots of threads, greater than say 256 it will be able to maintain a higher TPS at the higher thread count because it doesn’t suffer from the queuing bottleneck. The results have been very encouraging in our internal experiments on high end hardware, hosts with >= 16 cores. There are some issues that need ironing out, in particular Sysbench OLTP read-only tests on 8 core hardware. This feature uses atomics to manage the thread concurrency slots.

Read the rest of this entry »

Jul 25

Introduction

The InnoDB  UNDO entries reside in a special system table called the UNDO log. This log is made up of several segments. These segments are called rollback segments. A segment in InnoDB is similar to what a file would be in a file system,e.g., user tables and indexes are also stored as separate segments within the same tablespace,  only their format is different. In that sense there is nothing special about InnoDB UNDO logs. This feature allows storing of the UNDO log across several tablespaces.

Purpose

UNDO logs  contain the before image of modified records. There are two types of UNDO records, one for insert and another for updates. The insert UNDO records can be discarded on transaction rollback. The update records are used for rollback, MVCC and by purge. It is because of purge that we can’t just remove the UNDO log records  once the UNDO logs are not referenced by any running transaction’s snapshot view (for MVCC). When a transaction is started it is assigned a rollback segment in a round robin scheme. Multiple transactions can be assigned the same rollback segment to write their changes. Up to a maximum of 1024 transactions can use a the same rollback segment. If you have more rollback segments then there is less contention around the rollback segment mutex.

Purge

The purge thread(s) run in the background and they read the UNDO log records from oldest to the latest, up to but not including the oldest active snapshot view in the system. It parses the UNDO log records and for entries that represent delete operations it uses the stored index key to search for the records in both the secondary and primary index and purges the entries, modifying the index tree structure if required. Normal DML operations simply delete mark the records but they don’t physically purge them, unless it is an insert that is being rolled back. This is to avoid expensive tree modifying operations in DML code. Once purge is finished with the UNDO entries it then truncates the UNDO log up to where it has processed the entries. For MVCC user transactions we need to follow the DATA_ROLL_PTR pointer to the UNDO log to build a previous version of the row.

Read the rest of this entry »

Apr 11

For those interested in InnoDB internals, this post tries to explain why the global kernel mutex was required and the new mutexes and rw-locks that now replace it. Along with the long term benefit from this change.

InnoDB’s core sub-systems up to v5.5 are protected by a global mutex called the Kernel mutex. This makes it difficult to do even some common sense optimisations. In the past we tried optimising the code but it would invariably upset the delicate balance that was achieved by tuning of the code that used the global Kernel mutex, leading to unexpected performance regression. The kernel mutex is also abused in several places to cover operations unrelated to the core e.g., some counters in the server thread main loop.

The InnoDB core sub-systems are:

  1. The Locking sub-system
  2. The Transaction sub-system
  3. MVCC  views

For any state change in the above sub-systems we had to acquire the kernel mutex and this would reduce concurrency and made the kernel mutex very highly contended. A transaction that is creating a lock would end up blocking read view creation (for MVCC) and transaction start or commit/rollback. With the the finer granularity mutexes and rw-locks, a transaction that is creating a lock will not block transaction start or commit/rollback. MVCC read view creation will however block transaction create and commit/rollback because of the shared trx_sys_t::trx_list. But MVCC read view creations will not block each other because they will acquire an S lock.

Read the rest of this entry »

Apr 8

In MySQL 5.6 we’ve added a new feature that closes and unloads table instances from the InnoDB internal data dictionary, once a user configurable threshold is reached. This ends the situation where you could have hundreds of megabytes caching rarely used entries until the server was restarted and will be particularly appreciated by hosting and software as a service providers.

For this we’ve used an existing MySQL config variable table-definition-cache.  This cache limit is a soft limit. This means that if the user has more than table-definition-cache tables open then InnoDB will not force eviction of the table from the InnoDB data dictionary cache.

Users don’t really have to do anything to make use of this feature. It should pick up the existing setting (default or explicitly set) and just work. This feature should benefit users that have 1000s of tables and open a subset of tables only rarely. The LRU mechanism will eventually mark them for eviction and remove them from the data dictionary cache.

Other benefits

Some of the additional benefits of this change are internal to InnoDB and help in improving the design. We now have well defined interfaces for opening and closing tables. All internal sub-systems  need to use the same interface as the MySQL table handler. This has helped in loosening the tight coupling between various InnoDB sub-systems and the data dictionary. By loosening this tight coupling, it will help us in adding features to the InnoDB data dictionary more easily.

Read the rest of this entry »

Apr 6

Purpose

What does purge exactly do and why is it needed? If you have ever wondered then read on. It is really a type of garbage collector. When a user issues a DML like “DELETE FROM t WHERE c = 1;”, InnoDB doesn’t remove the matching record. This is what happens under the hood:

  1. It marks the record as deleted by setting a bit in the control bits of the record.
  2. Stores the before image of the modified columns to the UNDO log
  3. Updates the system columns DB_TRX_ID and DB_ROLL_PTR in the clustered index record. DB_TRX_ID identifies the transaction that made the last change, and DB_ROLL_PTR points to the new UNDO log record. This UNDO log record contains the old values of DB_TRX_ID and DB_ROLL_PTR, possibly pointing to an older transaction and undo log entry.

From this you should be able to visualise that the UNDO log records related to the modified clustered record are in a disk based linked list, with the head anchored in the clustered index record. For the sake of simplicity I’ve ignored the UNDO log pages and the case where DML updates a record. This information is required by  rollback and MVCC (multi version concurrency control). For MVCC we need the entry to exist in the clustered index so that we can follow a pointer back to where the “before” changes were written in the UNDO log and use that information to construct a previous version of the record. For rollback we need the before  information of the record  (UNDO entry) so that we can restore it when a transaction is rolled back.

Another benefit of purging separately in the background is that expensive B+Tree block merge operations, if the removal of the record were to lead to underflow, can be done asynchronously by purge and not by user transactions.

Garbage collection

Read the rest of this entry »

Apr 13
Background

The original motivation behind this patch was the infamous Bug#26590MySQL does not allow more than 1023 open transactions. Actually the 1024 limit has to do with the number of concurrent update transactions that can run within InnoDB. Where does this magic number come from ? 1024 is the total number of UNDO log list slots on one rollback segment header page. And in the past InnoDB created just one rollback segment header page during database creation. This rollback segment header page is anchored in the system header page, there is space there for 128 rollback segments but only one was being created and used resulting in the 1024 limit. Each slot in the rollback segment header array comprises of {space_id, page_no}, where both space_id and page_no are of type uint32_t . Currently the space id is “unused” and always points to the system table space, which is tablespace 0. Now, onto the rollback segment header page. This page contains a rollback segment header (details of which are outside the scope of this blog entry :-) ), followed by an array of 1024 UNDO slots. Each slot is the base node of a file based linked list of UNDO logs. Each node in this file based list contains UNDO log records, containing the data updated by a transaction. A single UNDO log node can contain UNDO entries from several different transactions.

Performance ramifications

When a transaction is started it is allocated a rollback segment to write its modifications. Multiple transactions can write to the same rollback segment but only one transaction is allowed to write to any one UNDO slot during its lifetime. This should make clear where the 1024 limit comes from. Each rollback segment is protected by its own mutex and when we have a single rollback segment this rollback segment mutex can become a high contention mutex.

Requirements

Backward compatibility in file formats is something we take very seriously at InnoDB.  InnoDB has always had the ability to use up to 128 pages but before this fix it created only one rollback segment. We had to figure out a way to make the multiple rollback segments change backward compatible, without breaking any assumptions in the code of older versions of InnoDB about absolute locations of system pages and  changes to system data. The 128 limit is a result of the latter. While there is space for 256 rollback segments, InnoDB uses only 7 bits from that field. Once we fix that we could in the future enable 256 rollback segments, however 128 seems to be sufficient for now. There are other scalability issues that need to be addressed first before 128K concurrent transactions will become an issue :-) .

Read the rest of this entry »