Achieving a 30x improvement in imap-backup times


Until recently, imap-backup's default strategy was to write emails and metadata to disk one-at-a-time. This approach made sense for small folders, and for machines with few resources, but with large folders, it could get very slow.

Backup Times with Original Strategy

With mailboxes with hundreds of thousands of messages, backup times were running for a couple of hours.

In fact, backup times increased exponentially with mailbox size.

Mailbox sizes against original backup times on a logarithmic scale

The Cause

imap-backup stores mailbox information in two files. One is a standard MBOXRD flat file of messages. The other is a JSON metadata file, with a .imap extension.

The .imap file records the UID (message id), message flags ('Seen', 'Draft', etc) and the offset and length of the message in the .mbox file.

In order to avoid corruption, both files are written to in a synchronised way, and to roll back on errors. While writing to the .mbox file is a simple append for each message, the .imap file is completely rewritten on every save. So, backing up a 100 000 message mailbox means 100 000 writes to this file, with each write growing slightly larger and requiring JSON serialization of more data!


The solution was to hold downloaded metadata in memory, write messages to disk and write the .imap file just once. I've called this the "delayed metadata" strategy.

In order to maintain integrity guarantees for the serialized mailbox, I created a transaction system for the Serializer, to wrap the appends to the mailbox and the metadata file.

download_serializer.transaction do

From /lib/imap/backup/acount/folder_backup.rb

If any error or exception occurs, the mailbox is rolled back to its starting point, and no metadata is ever written.

The Other Strategy

While working on this feature, I actually implemented another "delay everything" strategy, which also delayed writing the message data to the mailbox file. Doing so meant holding a lot more data in memory, and resulted in, at most, a marginal speed gain. So after running benchmarks on both strategies, I dropped the "delay everything" strategy - the extra complexity wasn't worth keeping.

Handling Ctrl+C

As, during the transaction, the on-disk data for the mailbox now gets ahead of the serialized metadata, I needed to avoid problems caused by manual interruption of a backup run.

To do this, the program traps SignalExceptions alongside the normal StandardErrors in the mailbox writer and the metadata serializer.

tsx.begin({savepoint: {length: length}}) do
rescue StandardError => e
  message = <<~ERROR
    #{self.class} error #{e}
  Logger.logger.error message
  raise e
rescue SignalException => e
  Logger.logger.error "#{self.class} handling #{e.class}"
  raise e

From lib/imap/backup/serializer/mbox.rb

Trapping SignalExceptions is unusual in Ruby - you should normally just let the program terminate, but in this case I feel doing so is justified in the name of data integrity.

In the mailbox writer, the rollback actually rewinds the file to its starting position, File::RDWR | File::CREAT, 0o644) do |f|

While, in the metadata serializer, it just resets itself to the preceding metadata:

@messages =[:savepoint][:messages]
@uid_validity =[:savepoint][:uid_validity]


Maintaining Two Strategies

As this new system uses more memory that the original one-at-a-time approach, it might have caused problems on smaller systems with limited resources. For example, using a RaspberryPi to back up my email to disk is a common use case for imap-backup.

In order to allow users to keep resource usage to a minimum, if needed, I've kept the original strategy and added a configuration option to choose between the new, faster, strategy, and the old, less resource hungry, one.

Choose a Download Strategy:
1. write straight to disk
2. delay writing metadata <- current
3. help
4. (q) return to main menu

As the new strategy works best except for the most resource-starved machines, I've made it the default.


These benchmarks were run on a Framework Laptop with 16GB RAM and a solid-state disk.

The IMAP server used was a local Docker image running Dovecot.

Mailboxes were filled with the required number of identical messages, each 52 bytes long.

I chose sample sizes that grew exponentially:

1, 3, 7, 20, 55, 148, 403, 1097, 2981, 8103, 22026, 59874, 162755

Doing so means I can plot results on a logarithmic scale and get evenly spaced data points.

Each sample size was measured 4 times and then I took averages.

These are the results for the "straight to disk" strategy:

CountRun 1Run 2Run 3Run 4Average

While these are the results of the "delayed metadata" strategy:

CountRun 1Run 2Run 3Run 4Average

With small mailboxes, times are more or less the same, but as the number of messages grow, the new strategy is much faster:

The delayed metadata strategy is much better for large mailboxes

As was clear from the start, re-writing the metadata file each time a new message was added was only going to get worse. In fact, the per message backup time grows linearly with the "straight to disk", while it remains more or less constant with the new strategy:

Adding each new message got slower as the mailbox size grew

How Much Faster?

In these benchmarks, while the new strategy changed nothing for small mailboxes, it showed a more than 30x speed improvement for mailboxes with more than 150 000 messages.

The delayed metadata strategy gives 30x speed improvement for very large mailboxes

It worked!