Dec 20

Last month we did a few improvements in InnoDB memory usage. We solved a challenging issue about how InnoDB uses memory in certain places of the code.

The symptom of the issue was that under a certain workloads the memory used by InnoDB kept growing infinitely, until OOM killer kicked in. It looked like a memory leak, but Valgrind wasn’t reporting any leaks and the issue was not reproducible on FreeBSD – it only happened on Linux (see Bug#57480). Especially the latest fact lead us to think that there is something in the InnoDB memory usage pattern that reveals a nasty side of the otherwise good-natured Linux’s memory manager.

It turned out to be an interesting memory fragmentation caused by a storm of malloc/free calls of various sizes. We had to track and analyze each call to malloc during the workload, including the code path that lead to it. We collected a huge set of analysis data – some code paths were executed many 10’000s of times! A hurricane of allocations and deallocations! We looked at the hottest ones hoping that some of them are not necessary, can be eliminated, avoided, minimized or stuck together. Luckily there were plenty of them!

After an extensive testing we did a numerous improvements, allocating the smallest chunks of the memory from the stack instead of from the heap, grouping allocations together where possible, removing unnecessary allocations altogether, estimating exactly how much memory will be consumed by a given operation and allocating it in advance and others and others and others.

This not only fixed Bug#57480 but improved InnoDB memory usage in general.

Read the rest of this entry »

Dec 20

The problem and its cause

There have been several complaints over the years about InnoDB’s inability to scale beyond 256 connections. One of the main issues behind this scalability bottleneck was the read view creation that is required for MVCC (Multi Version Concurrency Control) to work. When the user starts a transaction this is what InnoDB does under the hood:

  • Create or reuse a transaction instance – usually it is reused, the transactions are reused from a pool (trx_sys_t::mysql_trx_list).
  • Initialize the transaction start time and assign a rollback segment
  • Append the transaction to an active  transaction list ordered on trx_t::id in descending order

The append to  the trx_sys_t::trx_list and corresponding remove during commit is covered by trx_sys_t::mutex. After the transaction is “started” and if the transaction has an isolation greater than or equal to REPEATABLE-READ then before the first record/row is accessed by the transaction, InnoDB creates a view (snapshot) of the running system state. It does this by examining the transactions that are active at the time of the MVCC snapshot, so that their changes can be excluded from the creating transaction’s read view. This read view creation is also covered by the trx_sys_t::mutex. As the number of active transactions in the system increases this read view creation takes longer and longer. This increases the wait times on the trx_sys_t::mutex (during transaction start and commit) and once threads are forced to wait on a condition variable (in contrast to simply spinning while waiting for the mutex) the system throughput drops dramatically.

The solution

While investigating this problem there were two observations that I made: Read the rest of this entry »

Dec 20

In the 5.6.4 release it is now possible to create an InnoDB database with 4k or 8k page sizes in addition to the original 16k page size. Previously, it could be done by recompiling the engine with a different value for UNIV_PAGE_SIZE_SHIFT and UNIV_PAGE_SIZE. With this release, you can set –innodb-page-size=n when starting mysqld, or put innodb_page_size=n in the configuration file in the [mysqld] section where n can be 4k, 8k, 16k, or 4096, 8192, 16384.

The support of smaller page sizes may be useful for certain storage media such as SSDs. Performance results can vary depending on your data schema, record size, and read/write ratio. But this provides you more options to optimize your performance.

When this new setting is used, the page size is set for all tablespaces used by that InnoDB instance. You can query the current value with;

SHOW VARIABLES LIKE ‘innodb_page_size’;
or
SELECT variable_value FROM information_schema.global_status  WHERE LOWER(variable_name) = ‘innodb_page_size’;

It is a read-only variable while the engine is running since it must be set before InnoDB starts up and creates a new system tablespace. That happens when InnoDB does not find ibdata1 in the data directory. If you start mysqld with a page size other than the standard 16k, the error log will contain something like this;

Read the rest of this entry »

Apr 11

For those interested in InnoDB internals, this post tries to explain why the global kernel mutex was required and the new mutexes and rw-locks that now replace it. Along with the long term benefit from this change.

InnoDB’s core sub-systems up to v5.5 are protected by a global mutex called the Kernel mutex. This makes it difficult to do even some common sense optimisations. In the past we tried optimising the code but it would invariably upset the delicate balance that was achieved by tuning of the code that used the global Kernel mutex, leading to unexpected performance regression. The kernel mutex is also abused in several places to cover operations unrelated to the core e.g., some counters in the server thread main loop.

The InnoDB core sub-systems are:

  1. The Locking sub-system
  2. The Transaction sub-system
  3. MVCC  views

For any state change in the above sub-systems we had to acquire the kernel mutex and this would reduce concurrency and made the kernel mutex very highly contended. A transaction that is creating a lock would end up blocking read view creation (for MVCC) and transaction start or commit/rollback. With the the finer granularity mutexes and rw-locks, a transaction that is creating a lock will not block transaction start or commit/rollback. MVCC read view creation will however block transaction create and commit/rollback because of the shared trx_sys_t::trx_list. But MVCC read view creations will not block each other because they will acquire an S lock.

Read the rest of this entry »

Sep 19

At MySQL, we know our users want Performance, Scalability, Reliability, and Availability, regardless of the platform the choose to deploy. We have always had excellent benchmarks on Linux, and with MySQL 5.5, we are also working hard on improving performance on Windows.

The original patch of improving Windows performance was developed by MySQL senior developer Vladislav Vaintroub; benchmarks by QA engineer Jonathan Miller. We integrated the patch into MySQL 5.5 release.

The following two charts show the comparison of MySQL 5.5 vs. MySQL 5.1 (plugin) vs. MySQL 5.1 (builtin) using sysbench:

Read the rest of this entry »

Apr 16

What a busy week! Lots of MySQL 5.5 announcements that just happened to coincide with the MySQL Conference and Expo in Silicon Valley. Here are some highlights of the performance and scalability work that the InnoDB team was involved with.

A good prep for the week of news is the article Introduction to MySQL 5.5, which includes information about the major performance and scalability features. That article will lead you into the MySQL 5.5 manual for general features and the InnoDB 1.1 manual for performance & scalability info.

Then there were the conference presentations from InnoDB team members, which continued the twin themes of performance and scalability:

Read the rest of this entry »

Aug 11

Today, the InnoDB team announced the latest release of the InnoDB Plugin, release 1.0.4. Some of the performance gains in this release are quite remarkable!

As noted in the announcement, this release contains contributions from Sun Microsystems, Google and Percona, Inc., for which we are very appreciative. This page briefly describes each of the contributions and the way we treated them. The purpose of this post is to describe the general approach the InnoDB team takes toward third party contributions.

In principle, we appreciate third party contributions. However, we simply don’t have the resources to seriously evaluate every change that someone proposes, but when we do undertake to evaluate a patch, we have some clear criteria in mind:

  • The patch has to be technically sound, reliable, and effective
  • The change should fit with the architecture, and our overall plans and philosophy for InnoDB
  • The contribution must be available to us under a suitable license

Let’s consider, in general terms, what these criteria mean in practice.

Read the rest of this entry »

Mar 31

Some months ago, Google released a patch for InnoDB that boosts performance on multi-core servers. We decided to incorporate the change into the InnoDB Plugin to make everybody happy: users of InnoDB don’t have to apply the patch, and Google no longer has to maintain the patch for new versions of InnoDB. And it makes us at Innobase happy because it improves our product (as you can in this post about InnoDB Plugin release 1.0.3).

However, there are always technical and business issues to address. Given the low-level changes in the patch, was it technically sound? Was the patch stable and as rock solid as is the rest of InnoDB? Although it was written for the built-in InnoDB in MySQL 5.0.37, we needed to adapt it to the InnoDB Plugin. Could we make the patch portable to many platforms? Could we incorporate the patch without legal risk (so it could be licensed under both the GPL and commercial terms)?

Fortunately Google generously donated the patch to us under the BSD license, so there was no concern about intellectual property (so long as we properly acknowledged the contribution, which we do in an Appendix to the manual and in the source code). So, while the folks at Google are known for writing excellent code, we had to thoroughly review and test the patch before it could be incorporated in a release of InnoDB.

The patch improves performance by replacing InnoDB mutex and rw-mutex with atomic memory instructions. The first issue that arose was that the patch assigned the integer value -1 to a pthread_t variable, to refer to a neutral/non-existent thread identifier. This approach worked for Google because they use InnoDB solely on Linux. As it happens, pthread_t is defined as an integer on Linux.

But we had problems when the patch was tested on FreeBSD. We still needed to reference a non-existent thread, but in some environments (e.g. some versions of HPUX, Mac OS X) pthread_t is defined as a
structure, not an integer. As we looked at it, the problem became more complex. For this scheme to work, it must be possible to change thread identifiers atomically, using a Compare-And-Swap instruction. Otherwise, there will be subtle, mysterious and nasty failures.

Read the rest of this entry »

Mar 20

Ooops! Mark Callaghan of Google is one of world’s experts in InnoDB, and a frequent blogger on its performance characteristics. The InnoDB Plugin 1.0.3 is much more scalable on multi-core systems because of the contributions he has made (along with Ben Handy).

Mark will deliver a keynote the on Google’s use of MySQL and InnoDB on Tuesday morning at the MySQL Conference, and another talk on Wednesday. As Mark says, “Although Innodb is not in the title, it is prominent in both of the talks I will do”:

Read the rest of this entry »

Mar 19

That should read “Talks, Talks, Talks” … There will be several presentations by InnoDB experts at the upcoming 2009 MySQL Conference and Expo. Whether you’re a newbie or an experienced DBA deeply familiar with InnoDB, you won’t want to miss these important talks about InnoDB:

Note the new times for the last two talks above. Be sure to check the conference schedule! Not much more to say about this topic, at least not here. Hear it all there!

« Previous Entries