Until recently, imap-backup's default strategy was to write emails and metadata to disk one-at-a-time. This approach made sense for small folders, and for machines with few resources, but with large folders, it could get very slow.
Backup Times with Original Strategy
With mailboxes with hundreds of thousands of messages, backup times were running for a couple of hours.
In fact, backup times increased exponentially with mailbox size.
Mailbox sizes against original backup times on a logarithmic scale
The Cause
imap-backup stores mailbox information in two files. One is a standard MBOXRD flat file of messages. The other is a JSON metadata file, with a .imap
extension.
The .imap
file records the UID (message id), message flags ('Seen', 'Draft', etc) and the offset and length of the message in the .mbox
file.
In order to avoid corruption, both files are written to in a synchronised way, and to roll back on errors. While writing to the .mbox
file is a simple append for each message, the .imap
file is completely rewritten on every save. So, backing up a 100 000 message mailbox means 100 000 writes to this file, with each write growing slightly larger and requiring JSON serialization of more data!
Transactions
The solution was to hold downloaded metadata in memory, write messages to disk and write the .imap
file just once. I've called this the "delayed metadata" strategy.
In order to maintain integrity guarantees for the serialized mailbox, I created a transaction system for the Serializer
, to wrap the appends to the mailbox and the metadata file.
download_serializer.transaction do
downloader.run
...
end
From /lib/imap/backup/acount/folder_backup.rb
If any error or exception occurs, the mailbox is rolled back to its starting point, and no metadata is ever written.
The Other Strategy
While working on this feature, I actually implemented another "delay everything" strategy, which also delayed writing the message data to the mailbox file. Doing so meant holding a lot more data in memory, and resulted in, at most, a marginal speed gain. So after running benchmarks on both strategies, I dropped the "delay everything" strategy - the extra complexity wasn't worth keeping.
Handling Ctrl+C
As, during the transaction, the on-disk data for the mailbox now gets ahead of the serialized metadata, I needed to avoid problems caused by manual interruption of a backup run.
To do this, the program traps SignalExceptions alongside the normal StandardErrors in the mailbox writer and the metadata serializer.
tsx.begin({savepoint: {length: length}}) do
block.call
rescue StandardError => e
message = <<~ERROR
#{self.class} error #{e}
#{e.backtrace.join("\n")}
ERROR
Logger.logger.error message
rollback
raise e
rescue SignalException => e
Logger.logger.error "#{self.class} handling #{e.class}"
rollback
raise e
end
From lib/imap/backup/serializer/mbox.rb
Trapping SignalException
s is unusual in Ruby - you should normally just let the program terminate, but in this case I feel doing so is justified in the name of data integrity.
In the mailbox writer, the rollback actually rewinds the file to its starting position
File.open(pathname, File::RDWR | File::CREAT, 0o644) do |f|
f.truncate(length)
end
While, in the metadata serializer, it just resets itself to the preceding metadata:
@messages = tsx.data[:savepoint][:messages]
@uid_validity = tsx.data[:savepoint][:uid_validity]
tsx.clear
Maintaining Two Strategies
As this new system uses more memory that the original one-at-a-time approach, it might have caused problems on smaller systems with limited resources. For example, using a RaspberryPi to back up my email to disk is a common use case for imap-backup.
In order to allow users to keep resource usage to a minimum, if needed, I've kept the original strategy and added a configuration option to choose between the new, faster, strategy, and the old, less resource hungry, one.
Choose a Download Strategy:
1. write straight to disk
2. delay writing metadata <- current
3. help
4. (q) return to main menu
?
As the new strategy works best except for the most resource-starved machines, I've made it the default.
Benchmarks
These benchmarks were run on a Framework Laptop with 16GB RAM and a solid-state disk.
The IMAP server used was a local Docker image running Dovecot.
Mailboxes were filled with the required number of identical messages, each 52 bytes long.
I chose sample sizes that grew exponentially:
1, 3, 7, 20, 55, 148, 403, 1097, 2981, 8103, 22026, 59874, 162755
Doing so means I can plot results on a logarithmic scale and get evenly spaced data points.
Each sample size was measured 4 times and then I took averages.
These are the results for the "straight to disk" strategy:
Count | Run 1 | Run 2 | Run 3 | Run 4 | Average |
---|---|---|---|---|---|
1 | 0.8107 | 0.8101 | 0.81012 | 0.8159 | 0.8117 |
3 | 0.8103 | 0.8124 | 0.8109 | 0.8107 | 0.8111 |
7 | 0.8104 | 0.8139 | 0.8122 | 0.8391 | 0.8189 |
20 | 0.9109 | 0.8126 | 0.8108 | 0.8103 | 0.8362 |
55 | 0.91117 | 0.9114 | 1.01279 | 0.9123 | 0.9369 |
148 | 1.2142 | 1.1106 | 1.2127 | 1.1109 | 1.1621 |
403 | 1.8156 | 2.0153 | 1.8143 | 1.8159 | 1.8653 |
1097 | 4.2251 | 3.8227 | 3.6292 | 4.3566 | 4.0084 |
2981 | 13.5807 | 13.1861 | 12.9907 | 13.1866 | 13.2360 |
8103 | 46.7991 | 46.6172 | 45.5853 | 48.1261 | 46.7819 |
22026 | 379.200 | 308.3147 | 280.3035 | 269.1188 | 309.2343 |
59874 | 1891.7602 | 1876.3401 | 1882.6212 | 1903.9310 | 1888.6631 |
162755 | 14707.6912 | 14718.1070 | 14666.9172 | 14379.6484 | 14618.0910 |
While these are the results of the "delayed metadata" strategy:
Count | Run 1 | Run 2 | Run 3 | Run 4 | Average |
---|---|---|---|---|---|
1 | 0.8096 | 0.9128 | 0.8115 | 0.8109 | 0.8362 |
3 | 0.8102 | 0.8104 | 0.8124 | 0.8114 | 0.8111 |
7 | 0.8110 | 0.8106 | 0.8105 | 0.8108 | 0.8107 |
20 | 0.8107 | 0.8107 | 0.8123 | 0.8108 | 0.8111 |
55 | 0.9129 | 0.9109 | 0.9111 | 0.9107 | 0.9114 |
148 | 1.0114 | 1.1116 | 1.0114 | 1.0116 | 1.0365 |
403 | 1.6135 | 1.7143 | 1.6170 | 1.6144 | 1.6398 |
1097 | 3.1257 | 3.1201 | 3.2215 | 4.6316 | 3.5247 |
2981 | 7.8428 | 9.0514 | 9.7556 | 7.9444 | 8.6485 |
8103 | 19.5920 | 23.6093 | 22.8553 | 21.0996 | 21.7890 |
22026 | 69.7113 | 70.0457 | 65.5418 | 60.7685 | 66.5168 |
59874 | 175.1450 | 165.5433 | 163.8310 | 172.4675 | 169.2466 |
162755 | 462.1263 | 443.1742 | 478.1179 | 466.9995 | 462.6044 |
With small mailboxes, times are more or less the same, but as the number of messages grow, the new strategy is much faster:
The delayed metadata strategy is much better for large mailboxes
As was clear from the start, re-writing the metadata file each time a new message was added was only going to get worse. In fact, the per message backup time grows linearly with the "straight to disk", while it remains more or less constant with the new strategy:
Adding each new message got slower as the mailbox size grew
How Much Faster?
In these benchmarks, while the new strategy changed nothing for small mailboxes, it showed a more than 30x speed improvement for mailboxes with more than 150 000 messages.
The delayed metadata strategy gives 30x speed improvement for very large mailboxes
It worked!