Also, my flash device supports an in-device block-to-block page
copy command to speed up this recovery operation. Neither Linux MTD nor
YAFFS support this concept as far as I can tell.

The problem with using the block-to-block operations I've seen, is that you have no way to check the ECC of the data as you move it. Thus, you could read a page with a single-bit ECC error from a page within the old block and write it out with the same error to a new block. When you read the data from the new block, you detect the error (if you're lucky), and declare the new block ready for retirement... If you're NOT lucky, a page of data will sit with its single bit ECC error unread long enough that a second bit will flip by the time you finally read the block, giving you an uncorrectable memory error (UCME).

To reduce the latency of UCMEs you could have a process read each page of NAND at some very slow rate, looking for and correcting single bit errors. The additional reads could, of course, increase the rate at which errors crop up, due to NAND read-disturb effects. It'd be pretty straightforward to create a Markov model of the system failures to determine the best rate at which to scrub for CMEs. Given the wide range of NAND configurations, this might would need to be a tuning parameter, unless the unavailability curves are very flat in the region around the optimum value.

William