gc causing kernel panic (WAS Re: [Yaffs] very simple example of YAFFS2 "forgetting" a file)

Mon Jan 16 20:21:01 GMT 2006

On Mon, 16 Jan 2006, Michael Schmidt wrote:

> Charles Manning wrote:
> > What is almost certainly happening is that the tags are getting
> > corrupted.
> >
> > You might like to try Sergey's patch if you have not already done so.
>
> I'm not sure why my own fix suddenly stopped working, but Sergey's
> patch
> fixed the problem, thanks Sergey!
>
> However now I get to the point where when the GC cycle runs I get a
> couple
> messages like "page 5568 in gc has no object" and then the kernel
> panics.
> It looks like the code that detects this problem isn't doing anything
> other
> then just printing that message, so I assume something is messed up
> (tags
> still getting corrupted???) that cannot be fixed if it gets to that
> point?

That's a longstanding problem with hardlinks treatment when YAFFS is used as
a read-write root filesystem. There is a hardlink pointing nowhere somewhere
in the FS so it references the unexisting list entry thus causing kernel to
panic when trying to access that memory location. This is fatal because
there is no way to fix it on a production system whatsoever.

I did report it several times to the list with decoded oopses etc. but it
looks like the developers don't bother to fix anything pretending
everything's fine and I'm just ranting and raining on their parade.

Unfortunately it is NOT a simple bug, it looks more like a logical bug or
design flaw so there is nothing to fix, it requires rewriting. I personally
don't have time to rewrite it right now because I'm very busy with other
things but I will return to it in foreseable future. As a temporary fix we
use tmpfs for a root FS where we copy everything from an R/O YAFFS2
partition on bootup and then that partition is unmounted. The second YAFFS2
partition is mounted R/O under /usr. This is ugly but it works for time
being.

There is another problem with initial NAND scan in YAFFS -- it keeps a big
lock for the entire duration of the scan that doesn't allow anything else in
the kernel to run. This is also requires rewriting because it's not just
annoying, it makes it unusable. We do have a hardware watchdog timer with
circa 1.6 seconds latency that makes the system constantly rebooting because
the kernel thread that is supposed to toggle it while booting up does not
have a chance to run. And there is no way for us to stop that WDT, there is
no reset or disable pin on it. It's MAX6735 chip that has initial latency of
54 seconds that changes to 1.68 seconds on the first access and stays at
that value until the next powercycle.

All that means that YAFFS is NOT suitable for any production use as of now
despite all its developers' claims. This is not a problem per se because
everything's fixable. The main problem is developers' attitude - they do NOT
want to fix anything keeping on beating the same old "Fix your MTD" drum
instead. They don't even accept patches from those they don't like so the
patches are not applied and their CVS tree still holds the same source that
had NEVER EVER worked with ANY MTD version.

---
******************************************************************
*  KSI at home    KOI8 Net  < >  The impossible we do immediately.  *
*  Las Vegas   NV, USA   < >  Miracles require 24-hour notice.   *
******************************************************************