Author: Yeasah Pell Date: To: yaffs Subject: [Yaffs] possible gc memory leak/error handling comments
(This is a resend of a message I sent last week, but I wasn't subscribed
to the list at the time so it ended up in list moderator approval land.
Apologies in advance if the original message does eventually show up as
a duplicate, but I suspect they are buried in a bucket of endless spam
and will never be heard from again...)
As part of testing yaffs2 with mtd nandsim with various error simulation
options turned on, we discovered some issues with the error handling of
yaffs. I've posted some simple patches to correct those issues and now
yaffs appears to be correctly doing ecc and scrubbing blocks when
corrected errors are detected.
However, I noticed that the actual scrubbing (e.g. prioritized garbage
collection) is usually deferred till the next write operation. Given
that our flash usage patterns vary considerably and may be free of
writing for very long periods of time, I thought it would be wise in our
case to trigger garbage collection after reads as well. That appears to
work fine; I'm sure it degrades performance to some degree but it seems
acceptable. Perhaps a conditional triggering of the gc based on the
prioritized flag would be better, but anyway.
When I ran this test on nandsim with bitflips=1 (which assures a
constant stream of single bit errors, basically insane pathological
conditions), the expected behavior resulted -- blocks were being
rewritten and moved around like crazy just by reading a file. However,
during and extended run of this process memory usage steadily grew until
the oom killer eventually started going ballistic on everything in
sight, and the system ground to a total halt.
I'm not sure if the problem is actually a memory leak of some kind in
yaffs_CheckGarbageCollection or if it's an artifact of the different
context in which I'm having it called (from
yaffs_ReadChunkDataFromObject), but I thought I'd mention it anyway for
the record.
Also, another observation (I think this was noted recently on the list
already) is that a MTD -EBADMSG result (or YAFFS_ECC_RESULT_UNFIXED)
doesn't appear to translate into an error condition at the userspace
level -- from what I can tell, bad data is returned to userspace with no
indication of its badness.
Obviously we would all prefer that bad data never happen at all, but
pretending that bad data is good seems perhaps a little too zealous. :-)
In practice most of the time if we have bad data on our flash it's
disasterous anyway and it doesn't really matter much in the end if it's
returned as an EIO error or as bad data to userspace, but for some bits,
like configuration data, we could take reasonable steps (e.g. restoring
defaults) if we can detect bad data, whereas the results of processing
bad data is undefined.
One final point related to the last one, as far as I can tell yaffs will
in most places process tag data from blocks where the tag ecc has
failed, and this appears to sometimes lead to system hangs. I think it
would be desirable to avoid handling the tag data entirely in this case,
since we know it to be corrupt in some way. I'm not sure exactly what
you can do with a chunk in such a case; presumably some kind of recovery
would be required given that some chunk/object ID is basically going
away but there is no reliable way to know which ID it is based on
inspecting the (invalid in some way) metadata itself. That sounds like
it might be a significant project, though as my understanding of yaffs
internals is quite limited I don't really know for sure.