This is primarily the work of @Kubuxu and @zenground0, I just happened to be the one to discover the built-in market actor is at fault.
Background
Filecoin tipset execution includes a cron-like facility for scheduled execution of actor code at the end of every epoch. This cron activity is real work, which we can account in gas units, but is not paid for by any external party (no tokens are burnt as a gas fee). Cron is very convenient for some maintenance operations, but is essentially a subsidy from the network to whichever actors get the free execution (which is only ever built-in actors).
As network activity has grown, the amount of work done in cron has increased. Recent analysis shows that cron execution is frequently consuming 80 billion gas units each epoch. For context, the block gas limit is 5 billion, so a tipset with the expected five blocks is <25 billion. Cron is consuming a multiple of the target computational demands of block validation.
A fast and predictable block validation time is important to the ability of validating nodes to sync the chain quickly (especially when catching up), and critical for block producers to be able to produce new blocks for timely inclusion. Although there is significant buffer to account for network delays and the variability of expected consensus, cron execution is beginning to threaten chain quality and minimum hardware specs for validators.
The built-in market actor is consuming 85% of cron execution (73B gas / epoch). It performs deal maintenance (mainly transferring incremental payments) on a regular interval of 1 day for each deal. This is far from the kind of critical network service for which cron was intended, and a complete waste for the 98% of deals that have zero fee.
The number of deals brokered by this built-in market have increased greatly over the past year. This is a 🍾 problem, but we must address this risk to chain quality on the expectation that the deal count will grow a lot more.
Proposal
The propsed resolution to this problem is in two steps: a quick short-term mitigation to buy time, then a permanent reworking of the built-in market actor.
Short-term mitigation
Increase the interval at which the built-in market performs deal maintenance from 1 day to 30 days (or perhaps longer). We expect this to reduce the per-epoch gas consumption by a factor of ~30 (73B → 2.4B).
We we think we can do this without a migration of the market actor’s state by performing the rescheduling in actor execution during the first day after code changes are released in a network upgrade. The algorithm for doing this is TBC, but must maintain the property of uniformly distributing the work over the period, robust to client or provider attempts to manipulate the schedule.
Long-term fix
Add new deal settlement methods to the built-in market actor and remove automatic deal maintenance, and hence use of cron, entirely. This is a permanent fix to the market actor’s cron costs. Automatic deal payment processing is not something that Filecoin can support at greater scale
Note that other uses of cron (e.g. miner deadline maintnance) are also growing with time, and we expect to address them too in the future.
Discussion
Urgency for short-term mitigation
As the cron workload is expected to increase as more deals come on board, a short-term mitigation is necessary. Dividing the problem by 30 will give us some months to develop good APIs and code for the longer term solution, but attempting to implement to that long term solution now will take longer than we are comfortable with the current growth rates.
As a short-term fix, we’re aiming for maximum simplicity. Leaving the state schema and all essential operations intact makes this a very tightly scoped change that we can deliver with minimum risk.
We strongly recommend that the short-term mitigation be scheduled for network version 19, the upgrade immediately following the introduction of FEVM.
Choice of update period
One idea might be to increase the deal processing period much larger, in the hope of avoiding the need to implement a more permanent fix. We’ve declined this option because
- any length of period can be overcome by sufficient growth in deal count – it’s not permanent, just less temporary;
- if the period is too long, manual settlement methods will be needed by clients/providers in any case;
- automatic processing represents a subsidy from the whole network to the built-in market actor, which is counter to our goals of promoting user-programmed actors as markets.
Simplification
In addition to resolving a growing risk and removing a privilege enjoyed by the built-in market actor, removing cron processing from the built-in market will simplify that actor’s code, reducing maintenance burdens and the possibility of error.
Alternatives
We also considered an alternative of splitting the market’s deals into two groups: those that have non-zero payments to process (about 2% of deals today), and those that don’t.
We rejected this option because, like the proposed short-term fix, it’s not a permanent solution to the problem. We might expect the fraction of paid deals to increase over time. It’s also more complex than either the short-term fix or the proposed permanent resolution. This alternative would have the advantage of maintaining the built-in market’s current service of automatic payment processing, but we don’t believe that service is sustainable or appropriate for the future anyway.