(This is a resend of a message I sent last week, but I wasn't subscribed 
to the list at the time so it ended up in list moderator approval land. 
Apologies in advance if the original message does eventually show up as 
a duplicate, but I suspect they are buried in a bucket of endless spam 
and will never be heard from again...)

As part of testing yaffs2 with mtd nandsim with various error simulation 
options turned on, we discovered some issues with the error handling of 
yaffs. I've posted some simple patches to correct those issues and now 
yaffs appears to be correctly doing ecc and scrubbing blocks when 
corrected errors are detected.

However, I noticed that the actual scrubbing (e.g. prioritized garbage 
collection) is usually deferred till the next write operation. Given 
that our flash usage patterns vary considerably and may be free of 
writing for very long periods of time, I thought it would be wise in our 
case to trigger garbage collection after reads as well. That appears to 
work fine; I'm sure it degrades performance to some degree but it seems 
acceptable. Perhaps a conditional triggering of the gc based on the 
prioritized flag would be better, but anyway.

When I ran this test on nandsim with bitflips=1 (which assures a 
constant stream of single bit errors, basically insane pathological 
conditions), the expected behavior resulted -- blocks were being 
rewritten and moved around like crazy just by reading a file. However, 
during and extended run of this process memory usage steadily grew until 
the oom killer eventually started going ballistic on everything in 
sight, and the system ground to a total halt.

I'm not sure if the problem is actually a memory leak of some kind in 
yaffs_CheckGarbageCollection or if it's an artifact of the different 
context in which I'm having it called (from 
yaffs_ReadChunkDataFromObject), but I thought I'd mention it anyway for 
the record.

Also, another observation (I think this was noted recently on the list 
already) is that a MTD -EBADMSG result (or YAFFS_ECC_RESULT_UNFIXED) 
doesn't appear to translate into an error condition at the userspace 
level -- from what I can tell, bad data is returned to userspace with no 
indication of its badness.

Obviously we would all prefer that bad data never happen at all, but 
pretending that bad data is good seems perhaps a little too zealous. :-) 
In practice most of the time if we have bad data on our flash it's 
disasterous anyway and it doesn't really matter much in the end if it's 
returned as an EIO error or as bad data to userspace, but for some bits, 
like configuration data, we could take reasonable steps (e.g. restoring 
defaults) if we can detect bad data, whereas the results of processing 
bad data is undefined.

One final point related to the last one, as far as I can tell yaffs will 
in most places process tag data from blocks where the tag ecc has 
failed, and this appears to sometimes lead to system hangs. I think it 
would be desirable to avoid handling the tag data entirely in this case, 
since we know it to be corrupt in some way. I'm not sure exactly what 
you can do with a chunk in such a case; presumably some kind of recovery 
would be required given that some chunk/object ID is basically going 
away but there is no reliable way to know which ID it is based on 
inspecting the (invalid in some way) metadata itself. That sounds like 
it might be a significant project, though as my understanding of yaffs 
internals is quite limited I don't really know for sure.

Thanks,
-Yeasah Pell