[Yaffs] Re: [YAFFS1] Some bits are changed - systematically

Top Page
Attachments:
Message as email
+ (text/plain)
Delete this message
Reply to this message
Author: Martin Egholm Nielsen
Date:  
To: yaffs
Subject: [Yaffs] Re: [YAFFS1] Some bits are changed - systematically
>> > Can you reproduce the problem? Does the corruption hit the same
>> > file? Is it similar in other files? Do you know it's not a NAND
>> > or MTD problem -- i.e a corrupted write or a bad device. Have
>> > you seen this problem on other instances of the h/w. etc.
>>
>>That's the only device I've seen it with - out of 20-30 pieces having
>>had the same "treatment" :-)
>>And no I haven't tried that device any more - I didn't want to ruin the
>>possibility to analyse what has happened...
>>
>>And I don't know if it's a NAND or MTD problem - I was hoping that some
>>could guide me...
>>
>>Can this occur, say, with a bad NAND? Would YAFFS/MTD puke up with a lot
>>of checksum errors?
>
>
> A few things that I can think of:
>
> 1) A gross NAND failure. YAFFS/mtd are not magic and need reasonably reliable
> media to do anything. ECC can fix for single bit errors, but nothing more. If
> can't fix gross NAND errors any more than ReiserFS can work with a disk with
> a 6 inch nail through it.
>
> 2) Iffy timing. CHeck you NAND access timing. Marginal timing has a habit of
> making some parts work OK and others not.
>
> 3) Check that the ECC code is actually working OK. A poor ECC implementation
> could cause more damage than it fixes.
>
> 4) Bad block handling. If a bad block is not being flagged correctly then you
> could end up retrying it on every mount. That would be a problem.


I haven't had the time to dig further into to this - we've been
strugling with other critical issues - namely bad powerup and most
noticeable of all: Memory failures! Some of our boards crashes and in
"lightweight" situations the memory is just modified slightly. So for
now I put all my faith in this being the reason for this systematic
bit-changing...
But I guess, in order for this to be The Plausible Real Explanation
(TM), the bits would have been modified writing the file. However, the
error just occurred after some several reboots and additional writes to
the NAND. But perhaps, the additional writing could trigger new
instructions/code from the altered file (libc.so)?!
Does this sound likely?

BR,
Martin Egholm