Re: [Yaffs] yaffs2 and power fail problems

Top Page
Attachments:
Message as email
+ (text/plain)
Delete this message
Reply to this message
Author: Blair Barnett
Date:  
To: Charles Manning
CC: ian, yaffs
Subject: Re: [Yaffs] yaffs2 and power fail problems
Thanks for the feedback!

I failed to mention we NEVER see a data ECC error, which means to me
that the data is getting written out the the NAND along with the
oob/data ECC correctly.

A scenario we do see, although it points to something that corrupted the
file system in the before "this" reboot, is an oops when we reboot:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
Internal error: Oops: 7
CPU: 0
pc : [<c00844f0>]    lr : [<0000000d>]    Not tainted
sp : c0c95ef4  ip : 00000000  fp : c0c95f04
r10: 401e564c  r9 : c0c94000  r8 : c001a6c4
r7 : c01464e8  r6 : c0146474  r5 : c3ee8000  r4 : c3ee8000
r3 : 00000000  r2 : 00000005  r1 : 00000001  r0 : 00000000
Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  Segment user
Control: 397F  Table: A3F64000  DAC: 00000015
Stack: (0xc0c95ef4 to 0xc0c96000)
5ee0: 00000000 c0c95f1c c0c95f08
5f00: c007ea4c c008444c c3ee8000 c01462c4 c0c95f34 c0c95f20 c007ec18 
c007ea2c
5f20: c3ee8000 c3ee9000 c0c95f4c c0c95f38 c0079160 c007ebcc c3ee9000 
c3ee9044
5f40: c0c95f6c c0c95f50 c0050ccc c0079124 00000000 c0c95f70 00000000 
00000016
5f60: c0c95fa4 c0c95f70 c0063804 c0050b94 c3edeca0 c02fe420 00000000 
401e564c
5f80: c0c95fac 00000009 00000001 00149ad0 00149ae8 00149ae8 00000000 
c0c95fa8
5fa0: c001a520 c00637c0 00149ad0 c0020754 00149ae8 00149aea 00149ad2 
00000075
5fc0: 00149ad0 00149ae8 00149ae8 00000000 0000cf5c 00000000 401e564c 
00000000
5fe0: 000779cc bfffee58 000553cc 40195454 a0000010 00149ae8 00000000 
00000000
Backtrace:
Function entered at [<c0084440>] from [<c007ea4c>]
 r4 = 00000000
Function entered at [<c007ea20>] from [<c007ec18>]
 r5 = C01462C4  r4 = C3EE8000
Function entered at [<c007ebc0>] from [<c0079160>]
 r5 = C3EE9000  r4 = C3EE8000
Function entered at [<c0079118>] from [<c0050ccc>]
 r5 = C3EE9044  r4 = C3EE9000
Function entered at [<c0050b88>] from [<c0063804>]
 r7 = 00000016  r6 = 00000000  r5 = C0C95F70  r4 = 00000000
Function entered at [<c00637b4>] from [<c001a520>]
 r6 = 00149AE8  r5 = 00149AE8  r4 = 00149AD0
Code: e153000e a59000e4 aa000015 e59000e4 (e7903103)
Error (Oops_bfd_perror): /tmp/ksymoops.ZwvjnG Invalid bfd target



>>PC; c00844f0 <yaffs_CheckpointClose+b8/11c> <=====


>>r8; c001a6c4 <sys_call_table+0/0>
>>r7; c01464e8 <yaffs2_fs_type+0/1c>
>>r6; c0146474 <yaffs_super_ops+0/50>


Trace; c0084440 <yaffs_CheckpointClose+8/11c>
Trace; c007ea4c <yaffs_WriteCheckpointData+34/a0>
Trace; c007ea20 <yaffs_WriteCheckpointData+8/a0>
Trace; c007ec18 <yaffs_CheckpointSave+60/7c>

>>r5; c01462c4 <yaffs_traceMask+0/4>


Trace; c007ebc0 <yaffs_CheckpointSave+8/7c>
Trace; c0079160 <yaffs_put_super+50/bc>
Trace; c0079118 <yaffs_put_super+8/bc>
Trace; c0050ccc <kill_super+14c/168>
Trace; c0050b88 <kill_super+8/168>
Trace; c0063804 <sys_umount+58/9c>
Trace; c00637b4 <sys_umount+8/9c>
Trace; c001a520 <ret_fast_syscall+0/38>

Does this look familiar or ring any bells?

Charles Manning wrote:
> On Friday 16 February 2007 05:12, wrote:
>
>> On Wednesday 14 February 2007 18:49, Blair Barnett wrote:
>>
>>> We're experiencing what appear to be file system corruption
>>> due to power fail. I just got done looking at a nanddump of a
>>> yaffs2 file system that looks like the block header (first
>>> page in the block) was overwritten with garbage data (can't
>>> tell whether it's "good" data from some where else yet).
>>>
>> If power is lost during an erase, the memory cells are left in an
>> undefined state with a "half bucket" of electrons which can be
>> read as a one or a zero depending on the temperature, supply
>> voltage, age and the wind ;-) The garbage data would very
>> propably produce a data ECC error (mtd ecc), but this is not
>> guaranteed.
>>
>
> This scenario seems to make the most sense.
>
> While this sort of thing is pretty easy to see on NOR (which takes a long time
> to erase), there is only a very small window for this to happen on NAND since
> an erase only takes 2 or 3 msec or so. How often do you see this?
>
> I would suggest that you check the power rails etc. If your CPU can run at
> voltages where the NAND is marginal, then you have the potential to be
> telling the NAND to do stuff which it can't do properly. ie. During a power
> failure you'd ideally be shutting down the CPU while the NAND still has
> enough power to be sane.
>
> Also look carefully at the WP pin. Yanking the WP pin during a program/erase
> can cause problems.
>
> You should ideally be doing power OK check in the NAND driver before starting
> an erase or write to ensire that the system will have residual power to
> complete an erase.
>
>
>> If an error is detected by a failed read (mtd ecc), I would
>> expect Yaffs to recover. If not, and Yaffs' test of it's own
>> spare/tags "mini ECC" looks good, then the data would be assumed
>> valid and presented as part of a Yaffs file.
>> [Have I got this right Charles?]
>>
>
> There are places where the ecc is ignored.
>
>
>> If you compute the ECC data for the "corrupted" block header
>> page, is it correct?
>> Is the mini-ecc that runs over the Yaffs tags correct?
>>
>>
>>> I'm running 2.4.27 linux with the latest yaffs2 tarball. I'm
>>> unable to quickly move to 2.6.
>>>
>> Linux 2.5 vs. 2.6 should not matter; there are many others
>> in the same boat (real world).
>>
>
> Correct. That's why we consider it important to support 2.4.x rather than give
> out abuse!
>
>
>> -imcd
>>
>> _______________________________________________
>> yaffs mailing list
>>
>> http://lists.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs
>>
>
>