Thanks for the feedback! I failed to mention we NEVER see a data ECC error, which means to me that the data is getting written out the the NAND along with the oob/data ECC correctly. A scenario we do see, although it points to something that corrupted the file system in the before "this" reboot, is an oops when we reboot: Unable to handle kernel NULL pointer dereference at virtual address 00000000 Internal error: Oops: 7 CPU: 0 pc : [] lr : [<0000000d>] Not tainted sp : c0c95ef4 ip : 00000000 fp : c0c95f04 r10: 401e564c r9 : c0c94000 r8 : c001a6c4 r7 : c01464e8 r6 : c0146474 r5 : c3ee8000 r4 : c3ee8000 r3 : 00000000 r2 : 00000005 r1 : 00000001 r0 : 00000000 Flags: Nzcv IRQs on FIQs on Mode SVC_32 Segment user Control: 397F Table: A3F64000 DAC: 00000015 Stack: (0xc0c95ef4 to 0xc0c96000) 5ee0: 00000000 c0c95f1c c0c95f08 5f00: c007ea4c c008444c c3ee8000 c01462c4 c0c95f34 c0c95f20 c007ec18 c007ea2c 5f20: c3ee8000 c3ee9000 c0c95f4c c0c95f38 c0079160 c007ebcc c3ee9000 c3ee9044 5f40: c0c95f6c c0c95f50 c0050ccc c0079124 00000000 c0c95f70 00000000 00000016 5f60: c0c95fa4 c0c95f70 c0063804 c0050b94 c3edeca0 c02fe420 00000000 401e564c 5f80: c0c95fac 00000009 00000001 00149ad0 00149ae8 00149ae8 00000000 c0c95fa8 5fa0: c001a520 c00637c0 00149ad0 c0020754 00149ae8 00149aea 00149ad2 00000075 5fc0: 00149ad0 00149ae8 00149ae8 00000000 0000cf5c 00000000 401e564c 00000000 5fe0: 000779cc bfffee58 000553cc 40195454 a0000010 00149ae8 00000000 00000000 Backtrace: Function entered at [] from [] r4 = 00000000 Function entered at [] from [] r5 = C01462C4 r4 = C3EE8000 Function entered at [] from [] r5 = C3EE9000 r4 = C3EE8000 Function entered at [] from [] r5 = C3EE9044 r4 = C3EE9000 Function entered at [] from [] r7 = 00000016 r6 = 00000000 r5 = C0C95F70 r4 = 00000000 Function entered at [] from [] r6 = 00149AE8 r5 = 00149AE8 r4 = 00149AD0 Code: e153000e a59000e4 aa000015 e59000e4 (e7903103) Error (Oops_bfd_perror): /tmp/ksymoops.ZwvjnG Invalid bfd target >>PC; c00844f0 <===== >>r8; c001a6c4 >>r7; c01464e8 >>r6; c0146474 Trace; c0084440 Trace; c007ea4c Trace; c007ea20 Trace; c007ec18 >>r5; c01462c4 Trace; c007ebc0 Trace; c0079160 Trace; c0079118 Trace; c0050ccc Trace; c0050b88 Trace; c0063804 Trace; c00637b4 Trace; c001a520 Does this look familiar or ring any bells? Charles Manning wrote: > On Friday 16 February 2007 05:12, ian@brightstareng.com wrote: > >> On Wednesday 14 February 2007 18:49, Blair Barnett wrote: >> >>> We're experiencing what appear to be file system corruption >>> due to power fail. I just got done looking at a nanddump of a >>> yaffs2 file system that looks like the block header (first >>> page in the block) was overwritten with garbage data (can't >>> tell whether it's "good" data from some where else yet). >>> >> If power is lost during an erase, the memory cells are left in an >> undefined state with a "half bucket" of electrons which can be >> read as a one or a zero depending on the temperature, supply >> voltage, age and the wind ;-) The garbage data would very >> propably produce a data ECC error (mtd ecc), but this is not >> guaranteed. >> > > This scenario seems to make the most sense. > > While this sort of thing is pretty easy to see on NOR (which takes a long time > to erase), there is only a very small window for this to happen on NAND since > an erase only takes 2 or 3 msec or so. How often do you see this? > > I would suggest that you check the power rails etc. If your CPU can run at > voltages where the NAND is marginal, then you have the potential to be > telling the NAND to do stuff which it can't do properly. ie. During a power > failure you'd ideally be shutting down the CPU while the NAND still has > enough power to be sane. > > Also look carefully at the WP pin. Yanking the WP pin during a program/erase > can cause problems. > > You should ideally be doing power OK check in the NAND driver before starting > an erase or write to ensire that the system will have residual power to > complete an erase. > > >> If an error is detected by a failed read (mtd ecc), I would >> expect Yaffs to recover. If not, and Yaffs' test of it's own >> spare/tags "mini ECC" looks good, then the data would be assumed >> valid and presented as part of a Yaffs file. >> [Have I got this right Charles?] >> > > There are places where the ecc is ignored. > > >> If you compute the ECC data for the "corrupted" block header >> page, is it correct? >> Is the mini-ecc that runs over the Yaffs tags correct? >> >> >>> I'm running 2.4.27 linux with the latest yaffs2 tarball. I'm >>> unable to quickly move to 2.6. >>> >> Linux 2.5 vs. 2.6 should not matter; there are many others >> in the same boat (real world). >> > > Correct. That's why we consider it important to support 2.4.x rather than give > out abuse! > > >> -imcd >> >> _______________________________________________ >> yaffs mailing list >> yaffs@lists.aleph1.co.uk >> http://lists.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs >> > >