Comment archive - for node operators and technical

Recent optimizations

This is continuation of previous post about latest optimizations. This article is going to contain description of second optimization listed there - comment archive.

We've came a long way since split with legacy chain. A lot of previously applied memory-focused optimizations consisted of identifying and elimination of data that was unnecessary to run consensus code. There were also gains on rearranging of data, so it is stored better. There are still some places where that type of changes can be done, but the expected gains are nothing compared to 18GB (15.2GB after pool allocator optimization) that is already consumed out of recommended 24GB size of SHM. The biggest one is moving account metadata and related API to some HAF application (and that's less than 800MB and already optional - just compile without COLLECT_ACCOUNT_METADATA). We've basically run out of data that is never needed. There is one thing that could still give significant savings (even 2.5GB) - future task mentioned at the end of previous article - but the effect would only be that big because of one particular index - comments.

Above "double pacman" chart shows relative sizes of various indexes (with account metadata eliminated). On the left we see breakdown of content of SHM before comment archive optimization. As you can see comments take over 3/4 of all the consumed memory. Second part shows how it looks afterwards - a justification for separate follow-up future task.

The idea behind comment archive optimization is the following: once comment is paid you can't delete it, nothing in its content that is stored in consensus state can be changed, moreover votes or replies to such comments are relatively rare, so why keep it all the time in memory? Drop it to database and instead rely on its caching mechanisms to do the work for us.

Note: the text of comment can still be edited, but hived doesn't care about that, since that data is in HAF/Hivemind and not in SHM.

One thing has to be emphasized - this is a trade-off. You can't just migrate data that is still sometimes used to disk storage and expect not to pay for it with performance. Depending on relative frequency of accessing external data as well as hardware (disk in particular) the cost might be bigger or smaller. Quite surprisingly, although now it's kind of obvious, the most frequent event when hived needs to look into the archive is new comment - it needs to check if the comment is really new or an edit of existing one. It is also the slowest of all types of access to archive.

Results from application of second optimization

Just like the last time, let's starts with some results. Again, for fair comparison, timings are for up to 93M block. This time the baseline measurement is a value from previous optimization, marked as comment-archive = NONE: - 17 hours 28 minutes 22 seconds - replay (comment archive = NONE) - 18 hours 40 minutes 2 seconds - replay (comment archive = ROCKSDB) - 29 hours 44 minutes 47 seconds - synchronization without checkpoints (comment archive = NONE) - 30 hours exactly - synchronization without checkpoints (comment archive = ROCKSDB)

One thing to note is that disk performance has big impact, results are also a lot more unstable (like +/- 30 minutes difference per run - best replay run was 18 hours 12 minutes 42 seconds). I've also noticed that runs with smaller SHM tends to last longer, but it might be just a coincidence. I've only run one sync with that configuration so far.

Ok, so version with optimization is running one to 5 quarters longer. So what did we gain for that lost performance? Previously recommended size of SHM was 24GB. After optimization with pool allocator you could run with 20GB with solid safety margin, since 15.2GB was actually taken. With comment archive I'd recommend 8GB (that's how I ran the measurements) and half of it is still empty. Unfortunately RocksDB also needs to take something for its cache, so hived process took 12.6GB of RAM at the peak (4.6GB above size of SHM). For comparison, without comment archive hived was taking 26GB at the peak (2GB above size of SHM).

In the course of reviewing main implementation by Mariusz I've added some detailed statistics - it was needed to make sure we are not introducing some scaling problem with that changes. Now we have a lot more data to draw nice graphs, although only from replay (performance stats are unfortunately not unified across modes of hived operation). I'll tell more about it below, but to properly read the graphs, you need to at least know that the mechanism of comment archive has proper abstract layer allowing for multiple implementations and there are currently three of them, although they are only available (officially) in mirrornet and testnet. You can switch between them with config.ini option comment-archive (option does not exist for mainnet and its use has no effect, unless you comment out two lines of code - don't do it on your production node, especially not without reading the whole article first). The option has three values - NONE turns off comment archive optimization, MEMORY uses in-memory (SHM) archive, finally ROCKSDB is the default version, the only one officially supported for mainnet. Each version has its own requirements for shared-file-size - I've run them with 24G, 20G and 8G respectively.

_{Full resolution here}

If you want to learn about strange bumps on red graph, you'll need to read technical part below, for now let's focus on the difference between yellow (no comment archive) and blue (archive in RocksDB). The Y axis is megabytes of SHM taken, X axis is block number. As you can see all versions start the same until HF19 when comment archiving starts. The allocation in NONE version also drops despite no archive, because that's also the moment when 0.5GB of comment_cashout_ex_index is wiped (one of optimizations that were included in HF24 version). After that the ROCKSDB version fluctuates and permanently grows only with new accounts, while other versions grow constantly also with new comments.

Execution times are pretty similar for all versions, with MEMORY being the fastest: - 16 hours 41 minutes 50 seconds (comment archive = MEMORY)

_{Full resolution here}

Anomalies from previous graph in red are now more prominent, especially the one at 65.58M block mark. The Y axis is rolling average (from 200k blocks) of processing time in microseconds per block (that's why the peak of red is barely above 2000us, but in reality it is one block that was processing for 4 minutes - one of the reasons why MEMORY did not become officially available option). X axis is block number. Comparison between version without optimization (yellow) and with official one (blue) shows that sometimes blue is actually faster. The details in technical section below, but here I can give it away that the number of new comments is the main factor - the more new comments, the slower new version is compared to version without optimization.

That's all I have for node operators. The rest of article has some technical info on the implementation(s), contains detailed comparisons between versions and some story on what I did trying to make MEMORY work.

How does official version work?

I'm not going to dive in too much detail on ROCKSDB version internals, that is, the RocksDB part. Mariusz did a great job by cutting preexisting code of account_history_rocksdb plugin and reusing a lot of it in his implementation, so when I was reviewing it, I mostly verified that it is indeed the same code that was working fine for ages. General structure, data flow and how it cooperates with the rest of hived code is the new important part.

The main interface is comments_handler. It has three subgroups of methods. The main one is get_comment(). It handles access to comment in uniform way, no matter whether it was found in comment_index in memory (SHM) or in archive. In order to do so it needs to have a unified way to represent returned comment. After all when it is already in memory, we want to access it via pointer to existing object. When comment is pulled from archive, the archive needs to construct temporary representation of comment from database data and someone needs to govern its lifetime. That's why get_comment() returns comment - special wrapper that is either just regular pointer or also a smart pointer holding temporary comment representation. The comment instance can be used the same way direct comment_object would.

Second group of methods is on_cashout() and on_irreversible_block(). These are callbacks that react on events that are important for comment handling. First one is the last moment when all the comment related data is available. After the event comment_cashout_object part is released and no longer available, so if particular implementation of comment archive needs something out of it, it needs to remember it somewhere aside.

Important note: the event happens during normal block processing, which means the cashout is not yet irreversible. It means the same comment can be "cashed out" multiple times, as previous cashout event is undone by fork switch. Moreover it is entirely possible for comment to not experience cashout after fork switch, because if certain conditions are met, the destination fork can contain transaction that deletes the comment before it is cashed out again. See unit test that covers such cases.

The on_irreversible_block() event indicates that certain block became irreversible. All comments that were cashed out inside that block (or earlier) can now be safely moved to archive.

Again: if the archive moves selected comments to archive, that is, removes them from comment_index, it needs to take into account that such operation is normally subject to undo mechanics. It means the moved comments can be reintroduced to SHM during fork switch. Such situation is also covered by above mentioned test.

Third group of methods is open(), close(), wipe() and all methods of base class external_storage_snapshot. These are called on global events, mostly during application startup and shutdown, exactly during main database calls of the same names. The names are self explanatory.

Class that constitutes official (ROCKSDB) implementation of comment archive is rocksdb_comment_archive. It delegates all database related work to provider - instance of rocksdb_comment_storage_provider (this is the place where RocksDB structure for holding comments is defined, most of database related work is implemented in its base class that it shares with account_history_rocksdb) and snapshot - instance of rocksdb_snapshot for when it needs to handle snapshot related events.

When comment is requested with get_comment(), it is first looked for in comment_index (SHM). Most of the time it will be found there, returned via regular pointer and that's it. Only when it is not found the access to RocksDB happens. provider tries to pull the comment data out of external database and if found the temporary comment is created and returned.

As I was writing it, I've noticed we are not using pool allocator for temporary comments. We should, although it won't make difference for replay, as sum of whole access times to comments that exist in archive just amount to a bit over 4 minutes. Result of a fix should show on detailed measurements and specific stress tests though, so that is something to be corrected.

Comment can be in three phases: - fresh comment, not yet cashed out - after cashout, but the block when it happened is not yet irreversible - when cashout of the comment happened in irreversible block

Method on_cashout() is called right after the comment was cashed out. ROCKSDB version of comment archive stores all necessary data in newly constructed volatile_comment_object (similar concept to volatile_operation_object known from account_history_rocksdb). The object holds all the data on comment that is going to be written to database later as well as block_number to know when it is safe to actually store it in database. Existence of that object indicates that the related comment is ready for archiving. Once on_irreversible_block() is called, all comments that have their volatile counterparts marked with block_number at or below new LIB are saved to RocksDB by provider and then removed from SHM (both comment_object and its volatile representation). It is not done with every on_irreversible_block() call though. Only when there is enough comments ready to be archived the process starts. That grouping is why migration process is surprisingly fast when calculated as "per comment" rate.

It is "kind of bug", because code looks at whole number of volatile comments, not at the number of comments that are at or below LIB, so in some cases (f.e. OBI turned off) it would execute migration more frequently than intended. However calculating the latter would take more time than it would save.

The implementation of last method group is just redirections to common functions of provider and/or snapshot - open database (optionally creating it when not present), flush and close it, remove all storage files, finally load or save snapshot to/from selected location. Most of it is completely independent of the content of RocksDB database, and parts that are comment specific (definitions of columns) are handled inside virtual methods.

There is one slight drawback of reusing code from account_history_rocksdb (or maybe not, as it is still just one place to fix). Account History has outdated concept of LIB. LIB used to be part of consensus (calculated from blocks that act as confirmations for previous blocks), but that changed with introduction of OBI. Now each node can have its own LIB value reflecting its own state based on presence or absence of OBI confirmation transactions. Consensus only defines the smallest value of last irreversible block number. There used to be a problem with LIB for AH. As the plugin has the option to exclude reversible data from its API responses, it needs to know which block is LIB. But during replay value of LIB as indicated by state was smaller than actual LIB. After all whole replay happens exclusively on irreversible blocks. Now all replayed blocks are immediately irreversible (same with synced blocks at or below last checkpoint), but when it was not yet the case, AH worked around the problem by introducing "reindex point" that was something like "LIB that results from replay". I'm only talking about it here, because if you were to read into the code of rocksdb_comment_storage_provider, you are going to inevitably run into that confusing thing.

Statistics

I've said that access to archive is relatively rare and that it was the assumption that enabled the whole optimization. How rare is it exactly?

_{Full resolution}here

In above chart scale on Y axis is logarithmic. The blue graph shows most common access type - from comment_index. It is order of magnitude more frequent than next one - access to comment that does not exist (yellow line). Access to nonexistent comment happens pretty much exclusively when new comment is being processed. It is another order of magnitude more frequent than access to existing but archived comment, although there are some moments when such accesses become much more frequent. Overall throughout replay up to 93M block the total times fresh comment is accessed is almost 1.2 billion, total number of not found comments is 126 million and total number of accesses to archived comment is just 15 million.

_{Full resolution here}

Here we have average access times in nanoseconds. For access to comment_index total average is just 1512ns, access to archive is ten times more - 15156ns. The slowest is check that comment does not exist, which costs 38106ns on average, but can cross 70000ns. The chart reveals one interesting thing. RocksDB does "compacting" once in a while (whatever it means), optimizing its indexes, so the access time does not just grow with growth of database - in uneven intervals (probably because database does not grow evenly either) access time goes back to shorter levels. There are also two more measurements, but adding them on the chart just made it less readable: - processing cashout events (creation of volatile comment representation) takes on average 5348ns - migration of comments from SHM to archive costs 2794ns per comment

Chart shows potential minor replay optimization - before archiving starts at HF19 we know the comment that was not found in SHM won't be found in (still empty) archive, so we could take a shortcut.

How does it compare to version with no optimization?

Since the mechanism has an interface, it is possible to implement it differently. One of such implementations is placeholder_comment_archive used when comment-archive = NONE. That implementation is really simple - it does nothing in most of its methods. In particular cashed out comments just remain in SHM, even after cashout became irreversible. Effectively it turns off comment archive optimization. But having such implementation allowed for measuring "old way".

_{Full r}