🦺

Cron safety - survey of the problem and solutions

Authors

Alex North

Creator

Alex North

Created

Feb 6, 2023 8:42 PM

This note is a survey of the problems and solutions associated with Filecoin network cron, by @ and @Alex North.

Network carrying capacity

The Filecoin network has some maximum capacity of sectors and deals that it can stably support, without further technical development. This capacity is governed by the total computational work (represented by gas) to perform the required maintenance operations, including submission of Window PoSt, deal payments, handling faults, sector expiration etc.

Since sectors have a finite lifetime, there is also a necessary onboarding rate to maintain that capacity, to balance the expiration rate. These two branches of carrying capacity compete for the same scarce resource of gas. There is some complexity here: when maintenance costs get super high, there’s no room to on board naturally, and some risk that powerful SPs can squeeze out the maintenance of less power SPs.

This carrying capacity is independent of the use of cron. If we moved deadline maintenance etc. out of cron, it would still need to be performed. However, moving it out of cron would improve safety. While some maintenance is performed in cron, block validation times could blow out with no negative feedback on the onboarding rate.

Gas used for operations other than storage maintenance and onboarding also reduces the carrying capacity, to the extent those operations are willing to pay a higher price. The idea of a separate gas lane for storage operations is a possible mitigation for this.

We don’t currently have a good estimate of the network’s carrying capacity. ACTION: develop a model for this, so we can understand what scale of network growth would lead us to run into capacity constraints. If max capacity turns out to be uncomfortably close, we might focus efforts on improving the capacity rather that merely moving things out of cron.

Motivations to do something about cron

Cron represents a huge subsidy by the network to perform certain operations for free. The gas consumption of cron currently exceeds the total gas consumption of paid-for operations. The demands of work can grow without bound, sooner or later running into the network’s carrying capacity. Because the cost is hidden, there’s significant risk of us approaching this without realising it.

Without any price signals, there is no incentive to reduce the use of cron. This situation is bound to end in tears.

A second motivation to do something about cron is to eventually make it available as a network service to user-programmed applications. But we need to bound the use of cron by built-in actors in order to safely make space for user-programmed ones.

Cron for miner deadline maintenance

A major use of cron is miner deadline maintenance: checking that all Window PoSts were submitted and performing fault accounting if not, including eventually terminating sectors.

This is high-risk code. Programmer error could result in the halting of one or more miners, threatening network security and progress. And as noted above, the work done here currently grows without bound as the network grows.

Here is a plan for making deadline maintenance safe from failure, and introducing a price signal to exert pressure against unbounded growth in work.

Think through and develop a plan, possibly code, to recover miner actors in case a programmer error in cron does cause them to halt. The assumption of programmer error means this recovery would probably involve a migration, since we also need to fix the error that caused a problem. NEEDS DESIGN.
Develop a miner-initiate deadline recovery method, to be invoked by a miner in case their deadline cron does not complete, and without a network migration. If/when a miner’s deadline cron does fail, we immediately remove power for the deadline (or miner?). Together, these make deadline cron “safe to fail”. The recovery method probably amounts to running the incomplete cron epochs, but the miner will be paying gas. NEEDS DESIGN. This “safe cron” means we could then introduce a gas limit to the call to each miner’s deadline cron. The gas limit might be based on the power maintained by the deadline. If the gas limit is exceeded, cron fails and the miner must manually recover it. This would be effective if state inefficiency (e.g. many near-empty partitions) is a significant contributor to the cost of deadline cron. The miner would be incentivised to compact their state in order to enjoy free maintenance from cron execution. ACTION: test this hypothesis about the impact of state efficiency.
Taking this further, we could remove deadline maintenance from cron almost entirely, instead providing a method for miners to call explicitly that performs the same function. The cron call would only check if this maintenance has been performed, and simply remove power if it has not been (there is some design space to explore here). The miner-initiated deadline recovery mechanism could then recover the deadline’s power. NEEDS DESIGN. This would move the cost of deadline maintenance entirely onto the miner operator, and thus provide price signals. Gas price dynamics would also motivate the dispersion of active deadlines throughout each proving period, while we could increase operators control over what time of day the maintenance is scheduled. Deadline cron work would then be constant time per miner actor, and no longer a disaster lying in waiting.

Cron for built-in market deal maintenance

A second major use of cron is built-in market deal maintenance. Every deal is processed every 100 epochs to make incremental payments. If this turns out to be a major part of cron costs or carrying capacity, there are probably some easy short-term mitigations. E.g. we could significantly increase the update interval. With more design and implementation work, we could probably remove the automatic payments entirely. These would represent light impairment to the built-in market’s functionality, but that functionality is only available today due to the network’s subsidised cron. We are aiming to support user-programmed markets and it’s appropriate to reduce the built-in market’s privilege here.

ACTION: work out the carrying cost of deals.

‼️ We don’t have a plan to get built-in deal maintenance out of cron entirely yet

Update: See

Cron for deferred proof validation

Single PoRep proofs are validated asynchronously via a cron call to the power actor. The asynchrony is to enable the use of multiple CPU cores by verifying all proofs submitted that epoch in parallel. Blockchain execution is usually single-threaded, so multiple processor cores are seen as an otherwise wasted resource.

We charge gas for the proof validation when it is submitted, so it’s both bounded and paid for. However, the state updates after validation are not metered. These state updates include deal activation. This is only ok because the built-in market is the only market right now. In order to support user-programmed markets, we must support deal activation to untrusted market code, which we cannot do from cron.

We can solve this with a slight change to the proof validation flow. The cron call can validate the proof, but instead of then activating the sector and deals, it can just write a bit into state indicating whether the proof was valid. Then the activation can be deferred to an explicit miner-initiated method, which can activate deals and generally notify whatever on-chain contracts the miner wishes, since they are paying gas. NEEDS DESIGN.

This separation of proof validation from sector activation may also mesh nicely with sealing-as-a-service proposals. One challenge with such proposals is that the sealer can’t prove their sector and then take a long time to ship it to the host miner, because it almost immediately needs Window PoSt. See Decoupling ProveCommit and Deadline Assignment (#570). Maybe we want to introduce this separation for aggregated proofs too, but with an option to prove-and-activate for integrated operations. NEEDS DESIGN.

Interaction with programmable markets and notifications

The built-in market actor is notified whenever a sector with deals terminates, including if this happens in cron. User-programmed markets in general may need to rely on some notifications of changes in the state of sectors that are hosting deals: terminations, changes in data (re-snap), and maybe faults.

The programmable markets design includes notifications on sector activation. The challenge is getting notifications on termination, which the miner operator may not willingly provide. But we can’t use cron to call untrusted market actor code.

These notifications can be sent during the deadline maintenance method, even if it is called explicitly by the miner. However we need a fallback in case a miner operator abandons their operation completely. We can achieve this with an additional miner method that a market can use to check if a sector is still maintained. This would need to be triggered by a market operator, or client, or some other external party who can observe the chain state to detect abandoned sectors, and is incentivised somehow to notify the market contract. Note that some markets could be designed entirely around such pull-based methods, with no need for push-based notifications of state change. But most would benefit from synchronous notification. NEEDS DESIGN.

In order to guarantee that a miner sends appropriate termination messages during deadline maintenance, the miner state would need to know which actors to send messages to. But we don’t want to store per-deal information in miner chain state. A possible class of solutions here is to store a commitment to the subscribing parties, and required the miner to provide the actual data of the subscriptions only when processing a termination - see Private Sector Info (#530). NEEDS DESIGN as part of #298. We need to think carefully about how to handle notification messages succeeding or failing.