On Friday 16 February 2007 11:07, Blair Barnett wrote: > Thanks for the feedback! > > I failed to mention we NEVER see a data ECC error, which means to me > that the data is getting written out the the NAND along with the > oob/data ECC correctly. > > A scenario we do see, although it points to something that corrupted the > file system in the before "this" reboot, is an oops when we reboot: > > Unable to handle kernel NULL pointer dereference at virtual address > 00000000 Internal error: Oops: 7 > CPU: 0 > pc : [] lr : [<0000000d>] Not tainted > sp : c0c95ef4 ip : 00000000 fp : c0c95f04 > r10: 401e564c r9 : c0c94000 r8 : c001a6c4 > r7 : c01464e8 r6 : c0146474 r5 : c3ee8000 r4 : c3ee8000 > r3 : 00000000 r2 : 00000005 r1 : 00000001 r0 : 00000000 > Flags: Nzcv IRQs on FIQs on Mode SVC_32 Segment user > Control: 397F Table: A3F64000 DAC: 00000015 > Stack: (0xc0c95ef4 to 0xc0c96000) > 5ee0: 00000000 c0c95f1c c0c95f08 > 5f00: c007ea4c c008444c c3ee8000 c01462c4 c0c95f34 c0c95f20 c007ec18 > c007ea2c > 5f20: c3ee8000 c3ee9000 c0c95f4c c0c95f38 c0079160 c007ebcc c3ee9000 > c3ee9044 > 5f40: c0c95f6c c0c95f50 c0050ccc c0079124 00000000 c0c95f70 00000000 > 00000016 > 5f60: c0c95fa4 c0c95f70 c0063804 c0050b94 c3edeca0 c02fe420 00000000 > 401e564c > 5f80: c0c95fac 00000009 00000001 00149ad0 00149ae8 00149ae8 00000000 > c0c95fa8 > 5fa0: c001a520 c00637c0 00149ad0 c0020754 00149ae8 00149aea 00149ad2 > 00000075 > 5fc0: 00149ad0 00149ae8 00149ae8 00000000 0000cf5c 00000000 401e564c > 00000000 > 5fe0: 000779cc bfffee58 000553cc 40195454 a0000010 00149ae8 00000000 > 00000000 > Backtrace: > Function entered at [] from [] > r4 = 00000000 > Function entered at [] from [] > r5 = C01462C4 r4 = C3EE8000 > Function entered at [] from [] > r5 = C3EE9000 r4 = C3EE8000 > Function entered at [] from [] > r5 = C3EE9044 r4 = C3EE9000 > Function entered at [] from [] > r7 = 00000016 r6 = 00000000 r5 = C0C95F70 r4 = 00000000 > Function entered at [] from [] > r6 = 00149AE8 r5 = 00149AE8 r4 = 00149AD0 > Code: e153000e a59000e4 aa000015 e59000e4 (e7903103) > Error (Oops_bfd_perror): /tmp/ksymoops.ZwvjnG Invalid bfd target > > >>PC; c00844f0 <===== > >> > >>r8; c001a6c4 > >>r7; c01464e8 > >>r6; c0146474 > > Trace; c0084440 > Trace; c007ea4c > Trace; c007ea20 > Trace; c007ec18 > > >>r5; c01462c4 > > Trace; c007ebc0 > Trace; c0079160 > Trace; c0079118 > Trace; c0050ccc > Trace; c0050b88 > Trace; c0063804 > Trace; c00637b4 > Trace; c001a520 > > Does this look familiar or ring any bells? Is this the same problem as the corruption you mention below? Can you please explain the sequence that lead up to this? Also please provide the following: 1) A line number for this oops. I have an idea, but I want to be sure. 2) A binary dump of the broken page. I suggest you also turn on more yaffs tracing to check this out, From what I read here, this crash happened during a umount due to a strange block value beiung used in the checkpoint. > > Charles Manning wrote: > > On Friday 16 February 2007 05:12, ian@brightstareng.com wrote: > >> On Wednesday 14 February 2007 18:49, Blair Barnett wrote: > >>> We're experiencing what appear to be file system corruption > >>> due to power fail. I just got done looking at a nanddump of a > >>> yaffs2 file system that looks like the block header (first > >>> page in the block) was overwritten with garbage data (can't > >>> tell whether it's "good" data from some where else yet). > >> > >> If power is lost during an erase, the memory cells are left in an > >> undefined state with a "half bucket" of electrons which can be > >> read as a one or a zero depending on the temperature, supply > >> voltage, age and the wind ;-) The garbage data would very > >> propably produce a data ECC error (mtd ecc), but this is not > >> guaranteed. > > > > This scenario seems to make the most sense. > > > > While this sort of thing is pretty easy to see on NOR (which takes a long > > time to erase), there is only a very small window for this to happen on > > NAND since an erase only takes 2 or 3 msec or so. How often do you see > > this? > > > > I would suggest that you check the power rails etc. If your CPU can run > > at voltages where the NAND is marginal, then you have the potential to be > > telling the NAND to do stuff which it can't do properly. ie. During a > > power failure you'd ideally be shutting down the CPU while the NAND still > > has enough power to be sane. > > > > Also look carefully at the WP pin. Yanking the WP pin during a > > program/erase can cause problems. > > > > You should ideally be doing power OK check in the NAND driver before > > starting an erase or write to ensire that the system will have residual > > power to complete an erase. > > > >> If an error is detected by a failed read (mtd ecc), I would > >> expect Yaffs to recover. If not, and Yaffs' test of it's own > >> spare/tags "mini ECC" looks good, then the data would be assumed > >> valid and presented as part of a Yaffs file. > >> [Have I got this right Charles?] > > > > There are places where the ecc is ignored. > > > >> If you compute the ECC data for the "corrupted" block header > >> page, is it correct? > >> Is the mini-ecc that runs over the Yaffs tags correct? > >> > >>> I'm running 2.4.27 linux with the latest yaffs2 tarball. I'm > >>> unable to quickly move to 2.6. > >> > >> Linux 2.5 vs. 2.6 should not matter; there are many others > >> in the same boat (real world). > > > > Correct. That's why we consider it important to support 2.4.x rather than > > give out abuse! > > > >> -imcd > >> > >> _______________________________________________ > >> yaffs mailing list > >> yaffs@lists.aleph1.co.uk > >> http://lists.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs > > _______________________________________________ > yaffs mailing list > yaffs@lists.aleph1.co.uk > http://lists.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs