Achieving a 30x improvement in imap-backup times

2023/08/21

Until recently, imap-backup's default strategy was to write emails and metadata to disk one-at-a-time. This approach made sense for small folders, and for machines with few resources, but with large folders, it could get very slow.

Backup Times with Original Strategy

With mailboxes with hundreds of thousands of messages, backup times were running for a couple of hours.

In fact, backup times increased exponentially with mailbox size.

Mailbox sizes against original backup times on a logarithmic scale

The Cause

imap-backup stores mailbox information in two files. One is a standard MBOXRD flat file of messages. The other is a JSON metadata file, with a .imap extension.

The .imap file records the UID (message id), message flags ('Seen', 'Draft', etc) and the offset and length of the message in the .mbox file.

In order to avoid corruption, both files are written to in a synchronised way, and to roll back on errors. While writing to the .mbox file is a simple append for each message, the .imap file is completely rewritten on every save. So, backing up a 100 000 message mailbox means 100 000 writes to this file, with each write growing slightly larger and requiring JSON serialization of more data!

Transactions

The solution was to hold downloaded metadata in memory, write messages to disk and write the .imap file just once. I've called this the "delayed metadata" strategy.

In order to maintain integrity guarantees for the serialized mailbox, I created a transaction system for the Serializer, to wrap the appends to the mailbox and the metadata file.

download_serializer.transaction do
  downloader.run
  ...
end

From /lib/imap/backup/acount/folder_backup.rb

If any error or exception occurs, the mailbox is rolled back to its starting point, and no metadata is ever written.

The Other Strategy

While working on this feature, I actually implemented another "delay everything" strategy, which also delayed writing the message data to the mailbox file. Doing so meant holding a lot more data in memory, and resulted in, at most, a marginal speed gain. So after running benchmarks on both strategies, I dropped the "delay everything" strategy - the extra complexity wasn't worth keeping.

Handling Ctrl+C

As, during the transaction, the on-disk data for the mailbox now gets ahead of the serialized metadata, I needed to avoid problems caused by manual interruption of a backup run.

To do this, the program traps SignalExceptions alongside the normal StandardErrors in the mailbox writer and the metadata serializer.

tsx.begin({savepoint: {length: length}}) do
  block.call
rescue StandardError => e
  message = <<~ERROR
    #{self.class} error #{e}
    #{e.backtrace.join("\n")}
  ERROR
  Logger.logger.error message
  rollback
  raise e
rescue SignalException => e
  Logger.logger.error "#{self.class} handling #{e.class}"
  rollback
  raise e
end

From lib/imap/backup/serializer/mbox.rb

Trapping SignalExceptions is unusual in Ruby - you should normally just let the program terminate, but in this case I feel doing so is justified in the name of data integrity.

In the mailbox writer, the rollback actually rewinds the file to its starting position

File.open(pathname, File::RDWR | File::CREAT, 0o644) do |f|
  f.truncate(length)
end

While, in the metadata serializer, it just resets itself to the preceding metadata:

@messages = tsx.data[:savepoint][:messages]
@uid_validity = tsx.data[:savepoint][:uid_validity]

tsx.clear

Maintaining Two Strategies

As this new system uses more memory that the original one-at-a-time approach, it might have caused problems on smaller systems with limited resources. For example, using a RaspberryPi to back up my email to disk is a common use case for imap-backup.

In order to allow users to keep resource usage to a minimum, if needed, I've kept the original strategy and added a configuration option to choose between the new, faster, strategy, and the old, less resource hungry, one.

Choose a Download Strategy:
1. write straight to disk
2. delay writing metadata <- current
3. help
4. (q) return to main menu
?  

As the new strategy works best except for the most resource-starved machines, I've made it the default.

Benchmarks

These benchmarks were run on a Framework Laptop with 16GB RAM and a solid-state disk.

The IMAP server used was a local Docker image running Dovecot.

Mailboxes were filled with the required number of identical messages, each 52 bytes long.

I chose sample sizes that grew exponentially:

1, 3, 7, 20, 55, 148, 403, 1097, 2981, 8103, 22026, 59874, 162755

Doing so means I can plot results on a logarithmic scale and get evenly spaced data points.

Each sample size was measured 4 times and then I took averages.

These are the results for the "straight to disk" strategy:

CountRun 1Run 2Run 3Run 4Average
10.81070.81010.810120.81590.8117
30.81030.81240.81090.81070.8111
70.81040.81390.81220.83910.8189
200.91090.81260.81080.81030.8362
550.911170.91141.012790.91230.9369
1481.21421.11061.21271.11091.1621
4031.81562.01531.81431.81591.8653
10974.22513.82273.62924.35664.0084
298113.580713.186112.990713.186613.2360
810346.799146.617245.585348.126146.7819
22026379.200308.3147280.3035269.1188309.2343
598741891.76021876.34011882.62121903.93101888.6631
16275514707.691214718.107014666.917214379.648414618.0910

While these are the results of the "delayed metadata" strategy:

CountRun 1Run 2Run 3Run 4Average
10.80960.91280.81150.81090.8362
30.81020.81040.81240.81140.8111
70.81100.81060.81050.81080.8107
200.81070.81070.81230.81080.8111
550.91290.91090.91110.91070.9114
1481.01141.11161.01141.01161.0365
4031.61351.71431.61701.61441.6398
10973.12573.12013.22154.63163.5247
29817.84289.05149.75567.94448.6485
810319.592023.609322.855321.099621.7890
2202669.711370.045765.541860.768566.5168
59874175.1450165.5433163.8310172.4675169.2466
162755462.1263443.1742478.1179466.9995462.6044

With small mailboxes, times are more or less the same, but as the number of messages grow, the new strategy is much faster:

The delayed metadata strategy is much better for large mailboxes

As was clear from the start, re-writing the metadata file each time a new message was added was only going to get worse. In fact, the per message backup time grows linearly with the "straight to disk", while it remains more or less constant with the new strategy:

Adding each new message got slower as the mailbox size grew

How Much Faster?

In these benchmarks, while the new strategy changed nothing for small mailboxes, it showed a more than 30x speed improvement for mailboxes with more than 150 000 messages.

The delayed metadata strategy gives 30x speed improvement for very large mailboxes

It worked!