Technical Report: December 5, Inscriptions Launch on TON

Article image

Problem

On December 5, after the launch of the TON-20 promo, the load on the blockchain increased rapidly, and the blockchain split into 11 shardchains.

The TON blockchain has a specific maximum amount of time to create new blocks. There are also other limits on block creation - maximum number of gas, maximum number of transactions, etc. The limit system is necessary.

In times of normal transaction volume, blocks include the number of transactions processed within the specified limits. When the load increases, the TON blockchain splits into shardchains, and the split must occur in one block. At this point, not only the state but also the message queue splits. It is essential to maintain the sequence of messages in the queue.

Around 18:00 UTC on December 5, two shardchains attempted to split under load, but the validators of these shardchains were not performing at a level required to bypass the message queue in the given time. This resulted in endless attempts to split, with the message queue only growing due to new messages.

Shardchain validators in TON periodically rotate, but the load and message queue increased so much that validators on productive hardware could not keep up with the limit. Due to the resulting large queue, a similar problem occurred further when trying to merge some shardchains, and deleting already processed messages from the message queue became slow.

Impact

From the evening of December 5 through December 8, there were periods when the transaction processing speed in some shardchains dropped to 1 TPS and periods when some shardchains did not make blocks at all.

Solution

First, the team took actions to stabilize the situation and reduce the number of new messages. We asked the TON-20 project to suspend the promotional campaign and the services to suspend the ability to send messages if possible.

While there were still alternative ways to send new messages to the blockchain, this had some effect.

TON Core Development Team released a preliminary update with the limit changes and some optimizations.

The queue growth slowed down significantly.

Identified Issues

1. Validators hardware issues

The global solution is to improve the penalty system for poorly performing validators, motivating all validators to work on productive hardware. We have planned this, but it will take time. At the time of the incident, we just asked in the public channel validators with CPUs less than 15 to exit, which resulted in about 98 validators out of 342 voluntarily exiting validation.

Even though this situation should not happen, if it did, we need a mechanism to prevent the queue from growing, preventing the situation from worsening. We have made such a mechanism in the validator code update.

The repair time depends on the size of the message queue. Having made several updates to the validator code, we finally managed to achieve independence of block production from the queue size.

Having promptly prepared an update of the validator code, we faced another problem: Validators in the network were not ready to promptly update their validators; most validators are anonymous, and we needed to establish a communication channel with most of them. We made a public request for validators to get in contact.

After most validators had updated, the message queue was processed in full. Queue processing could be monitored in real-time on a public dashboard.

After that, the TON-20 project completed its mint, and new TON-20 projects began to launch. From December 9 to the current time, December 21, the network has processed over 75 million transactions.

2. APIs, infrastructure, and 3rd party services

Although many third-party API services experienced problems during the load, their work was restored within a few hours, indicating the absence of conceptual issues.

Some services and exchanges were not designed to work when the blockchain contained multiple shardchains. We provided the necessary consulting assistance to modernize their systems.

It also turns out that many popular products use public liteservers to access the blockchain, although public liteservers are made for a quick start and cannot guarantee high uptime. We've published the guidelines along with new hardware recommendations. We believe that the heavy activity on the TON blockchain will continue and increase, so this is the moment for the developer community to adapt their products to the heavy load.

Summary

This rapid and sudden increase in activity has uncovered places that need improvement in blockchain, infrastructure, and services. The changes described above have already been made, allowing all transactions sent by users to be processed, and there have been 75M such transactions in the last 11 days.

However, improvements will continue - better penalties system for poor validators, validator hardware upgrades, improved external message delivery, and more. We found no omissions in the blockchain architecture. The TON architecture can process millions of transactions per second.

We thank the community for your patience and understanding. Despite the obstacles, we are moving towards mass adoption of decentralized technologies and cryptocurrencies.