2023/08/21

Until recently, imap-backup's default strategy was to write emails and metadata to disk one-at-a-time. This approach made sense for small folders, and for machines with few resources, but with large folders, it could get very slow.

Backup Times with Original Strategy

With mailboxes with hundreds of thousands of messages, backup times were running for a couple of hours.

In fact, backup times increased exponentially with mailbox size.

Mailbox sizes against original backup times on a logarithmic scale

The Cause

imap-backup stores mailbox information in two files. One is a standard MBOXRD flat file of messages. The other is a JSON metadata file, with a .imap extension.

The .imap file records the UID (message id), message flags ('Seen', 'Draft', etc) and the offset and length of the message in the .mbox file.

In order to avoid corruption, both files are written to in a synchronised way, and to roll back on errors. While writing to the .mbox file is a simple append for each message, the .imap file is completely rewritten on every save. So, backing up a 100 000 message mailbox means 100 000 writes to this file, with each write growing slightly larger and requiring JSON serialization of more data!

Transactions

The solution was to hold downloaded metadata in memory, write messages to disk and write the .imap file just once. I've called this the "delayed metadata" strategy.

In order to maintain integrity guarantees for the serialized mailbox, I created a transaction system for the Serializer, to wrap the appends to the mailbox and the metadata file.

download_serializer.transaction do
  downloader.run
  ...
end

From /lib/imap/backup/acount/folder_backup.rb

If any error or exception occurs, the mailbox is rolled back to its starting point, and no metadata is ever written.

The Other Strategy

While working on this feature, I actually implemented another "delay everything" strategy, which also delayed writing the message data to the mailbox file. Doing so meant holding a lot more data in memory, and resulted in, at most, a marginal speed gain. So after running benchmarks on both strategies, I dropped the "delay everything" strategy - the extra complexity wasn't worth keeping.

Handling Ctrl+C

As, during the transaction, the on-disk data for the mailbox now gets ahead of the serialized metadata, I needed to avoid problems caused by manual interruption of a backup run.

To do this, the program traps SignalExceptions alongside the normal StandardErrors in the mailbox writer and the metadata serializer.

tsx.begin({savepoint: {length: length}}) do
  block.call
rescue StandardError => e
  message = <<~ERROR
    #{self.class} error #{e}
    #{e.backtrace.join("\n")}
  ERROR
  Logger.logger.error message
  rollback
  raise e
rescue SignalException => e
  Logger.logger.error "#{self.class} handling #{e.class}"
  rollback
  raise e
end

From lib/imap/backup/serializer/mbox.rb

Trapping SignalExceptions is unusual in Ruby - you should normally just let the program terminate, but in this case I feel doing so is justified in the name of data integrity.

In the mailbox writer, the rollback actually rewinds the file to its starting position

File.open(pathname, File::RDWR | File::CREAT, 0o644) do |f|
  f.truncate(length)
end

While, in the metadata serializer, it just resets itself to the preceding metadata:

@messages = tsx.data[:savepoint][:messages]
@uid_validity = tsx.data[:savepoint][:uid_validity]

tsx.clear

Maintaining Two Strategies

As this new system uses more memory that the original one-at-a-time approach, it might have caused problems on smaller systems with limited resources. For example, using a RaspberryPi to back up my email to disk is a common use case for imap-backup.

In order to allow users to keep resource usage to a minimum, if needed, I've kept the original strategy and added a configuration option to choose between the new, faster, strategy, and the old, less resource hungry, one.

Choose a Download Strategy:
1. write straight to disk
2. delay writing metadata <- current
3. help
4. (q) return to main menu
?

As the new strategy works best except for the most resource-starved machines, I've made it the default.

Benchmarks

These benchmarks were run on a Framework Laptop with 16GB RAM and a solid-state disk.

The IMAP server used was a local Docker image running Dovecot.

Mailboxes were filled with the required number of identical messages, each 52 bytes long.

I chose sample sizes that grew exponentially:

1, 3, 7, 20, 55, 148, 403, 1097, 2981, 8103, 22026, 59874, 162755

Doing so means I can plot results on a logarithmic scale and get evenly spaced data points.

Each sample size was measured 4 times and then I took averages.

These are the results for the "straight to disk" strategy:

Count	Run 1	Run 2	Run 3	Run 4	Average
1	0.8107	0.8101	0.81012	0.8159	0.8117
3	0.8103	0.8124	0.8109	0.8107	0.8111
7	0.8104	0.8139	0.8122	0.8391	0.8189
20	0.9109	0.8126	0.8108	0.8103	0.8362
55	0.91117	0.9114	1.01279	0.9123	0.9369
148	1.2142	1.1106	1.2127	1.1109	1.1621
403	1.8156	2.0153	1.8143	1.8159	1.8653
1097	4.2251	3.8227	3.6292	4.3566	4.0084
2981	13.5807	13.1861	12.9907	13.1866	13.2360
8103	46.7991	46.6172	45.5853	48.1261	46.7819
22026	379.200	308.3147	280.3035	269.1188	309.2343
59874	1891.7602	1876.3401	1882.6212	1903.9310	1888.6631
162755	14707.6912	14718.1070	14666.9172	14379.6484	14618.0910

While these are the results of the "delayed metadata" strategy:

Count	Run 1	Run 2	Run 3	Run 4	Average
1	0.8096	0.9128	0.8115	0.8109	0.8362
3	0.8102	0.8104	0.8124	0.8114	0.8111
7	0.8110	0.8106	0.8105	0.8108	0.8107
20	0.8107	0.8107	0.8123	0.8108	0.8111
55	0.9129	0.9109	0.9111	0.9107	0.9114
148	1.0114	1.1116	1.0114	1.0116	1.0365
403	1.6135	1.7143	1.6170	1.6144	1.6398
1097	3.1257	3.1201	3.2215	4.6316	3.5247
2981	7.8428	9.0514	9.7556	7.9444	8.6485
8103	19.5920	23.6093	22.8553	21.0996	21.7890
22026	69.7113	70.0457	65.5418	60.7685	66.5168
59874	175.1450	165.5433	163.8310	172.4675	169.2466
162755	462.1263	443.1742	478.1179	466.9995	462.6044

With small mailboxes, times are more or less the same, but as the number of messages grow, the new strategy is much faster:

The delayed metadata strategy is much better for large mailboxes

As was clear from the start, re-writing the metadata file each time a new message was added was only going to get worse. In fact, the per message backup time grows linearly with the "straight to disk", while it remains more or less constant with the new strategy:

Adding each new message got slower as the mailbox size grew

How Much Faster?

In these benchmarks, while the new strategy changed nothing for small mailboxes, it showed a more than 30x speed improvement for mailboxes with more than 150 000 messages.

The delayed metadata strategy gives 30x speed improvement for very large mailboxes

It worked!