Hi Charles:

following is a proposed patch that check for block state after discovering a check point block, and continues search if the block state is DEAD:
From b08b8c5fc21c2820f66454968be3a5115477fc96 Mon Sep 17 00:00:00 2001
From: Peter Lin <peter.lin@gdc-tech.com>
Date: Mon, 21 May 2012 20:25:55 +0800
Subject: [PATCH] yaffs: ignore checkpt if it is in bad block

---
 yaffs_checkptrw.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/yaffs_checkptrw.c b/yaffs_checkptrw.c
index 997a618..0d63e74 100644
--- a/yaffs_checkptrw.c
+++ b/yaffs_checkptrw.c
@@ -13,6 +13,8 @@
 
 #include "yaffs_checkptrw.h"
 #include "yaffs_getblockinfo.h"
+#include "yaffs_nand.h"
+#include "yaffs_guts.h"
 
 static int yaffs2_checkpt_space_ok(struct yaffs_dev *dev)
 {
@@ -117,6 +119,15 @@ static void yaffs2_checkpt_find_block(struct yaffs_dev *dev)
  tags.ecc_result);
 
  if (tags.seq_number == YAFFS_SEQUENCE_CHECKPOINT_DATA) {
+ enum yaffs_block_state  state = 0;
+ u32 seq_number = 0;
+ yaffs_query_init_block_state(dev, i, &state, &seq_number);
+ if( YAFFS_BLOCK_STATE_DEAD == state )
+ {
+ yaffs_trace(YAFFS_TRACE_CHECKPOINT,
+ "ignore bad checkpt block %d", i);
+ continue;
+ }
  /* Right kind of block */
  dev->checkpt_next_block = tags.obj_id;
  dev->checkpt_cur_block = i;
-- 
1.7.6.1


On Fri, May 18, 2012 at 1:02 PM, peterlingoal <peterlingoal@gmail.com> wrote:
I did a quick testing using HEAD yaffs to search for a checkpoint block on my NAND, it returned the same one in the bad block area. Even it is rejected in the later checking, checkpoint will not work properly every time if this back block is at the starting.

shall the checkpoint be ignored and search continues if it is in a bad block area?


On Fri, May 18, 2012 at 9:39 AM, peterlingoal <peterlingoal@gmail.com> wrote:
Yes we are are using a pretty old version (back in Sep 2010), and now we are trying to upgrade to latest.
Could you please help point out which checksum would prevent an old checkpoint being used? Right now I could not simply try a new version as the version mismatch would always force a re-scan.

BTW, HowYaffsWorks is a great document, however there's no download link in yaffsDotnet. I didn't find this doc until I googled for the file directly. Could this be fixed so newbie like me could read the document first before asking questions?

Thanks,
Peter


On Fri, May 18, 2012 at 5:26 AM, Charles Manning <manningc2@actrix.gen.nz> wrote:
On Thursday 17 May 2012 22:29:42 peterlingoal wrote:
> After spending sometime looking around in my corrupted NAND, I think I am
> clear what's going on there:
>
> There's some *outdated* checkpoint block in the bad blocks portion, and the
> real good one is located at a later block. During mounting, yaffs firstly
> found the *outdate* checkpoint block and loaded from there. That's why
> loading from checkpoint will always result a corrupted FS, even after
> re-scanning all the blocks with no-checkpoing-read.
>
> now the question part:
>
>    1. why in the first place there's some checkpoint block 'left over' in
>    the bad blocks? shall they be erased?
It is generally a bad idea to erase bad blocks.
>    2. While looking for a checkpoint block, shall the block status be
>    checked? Or is there any better way to handle this situation? I simply
> used mtd->block_isbad and continue searching and it seemed working.
That should be happening. I'll fix it if that is broken.

Now my question :-):
Are you using an old version of yaffs or the latest? There are various
checksums on the checkpoint data which should fail if old data is found.

>
> regards,
> Peter
>
> On Mon, May 7, 2012 at 3:08 AM, Charles Manning
<manningc2@actrix.gen.nz>wrote:
> > On Friday 04 May 2012 00:30:55 peterlingoal wrote:
> > > Hi Charles,
> > >
> > > Thanks for the reply.
> > >
> > > I am quite confused about the bad block management methodology, seems
> >
> > both
> >
> > > MTD and yaffs2 have some kind of bad block control. The problem of my
> >
> > case
> >
> > > is, after some period of usage, the yaffs2 file system on some NAND
> > > begin to fail. Remounting with ignoring checkpoint could recover the
> > > file
> >
> > system,
> >
> > > but for once only. The file system is still boomed after reboot and
> > > mount (with checkpoint).
> > >
> > > I tried to read the codes of yaffs2 about scanning if checkpoint is
> > > ignored, and got confused. Seems yaffs2 driver is querying status of
> > > each block (in function yaffs2_scan_backwards). My question is:
> >
> > I suggest you read the HowYaffsWords doc. You can find that on
> > yaffs.netor find the openoffice doc on the yaffs git.
> >
> > >    1. what does function yaffs2_scan_backwards do?
> >
> > This function scans the nand partition if there is no checkpoint. It
> > reads the
> > tags and builds up the file system state.
> >
> > >    2. MTD keeps a BBT (in NAND in my case), how does yaffs2 module
> >
> > obtains
> >
> > >    the BBT information? Why rescan from backward is needed in my case
> > > in order to recover a file system.
> >
> > Yaffs calls the MTD function to determine if a block is good or bad.
> > Yaffs does not know or care if mtd used a bad block table or not.
> >
> > >    3. After recovering the system, seems the bad block information is
> > > not saved. So re-scan is still needed after a reboot. This is my guess,
> > > please correct me if I am wrong.
> > >
> > > Also I am using a quite old version of yaffs2 ( back in 2010). What's
> > > the most recommended stable version of yaffs2,
> >
> > I suggest using a more recent version. I would recommend using the
> > current HEAD.
> >
> > > and the kernel MTD driver
> > > version?
> >
> > Sorry I don't keep current with all mtd changes and cant't advise that
> > off the
> > top of my head.
> >
> > > To cut some boot up time I am saving BBT on NAND and reuse it
> > > after reboot, will this make any negative impact?
> >
> > I don't see that this will cause any problems. yaffs does not care how or
> > if
> > you store bbt info.
> >
> > > I am interested in block
> > > summaries, but I would like to stick to checkpoint at the moment.
> >
> > If you use the new code you will get summaries as part of the
> > improvement.
> >
> > > I am new to kernel level debugging, so I am quite lost here. Any help
> > > is appreciated. Thanks!
> >
> > We've all been there.
> >
> > > regards,
> > > Peter
> > >
> > > On Mon, Apr 30, 2012 at 7:41 AM, Charles Manning
> >
> > <manningc2@actrix.gen.nz>wrote:
> > > > On Saturday 28 April 2012 05:26:23 Peter Lin wrote:
> > > > > I have several NANDs that yaffs2 module would consider itself
> > > >
> > > > successfully
> > > >
> > > > > recovered from check pointing and skip scanning, but the filesystem
> >
> > is
> >
> > > > not
> > > >
> > > > > usable. Mounting with option no-checkpoint-read could recover the
> > > > > filesystem.
> > > > >
> > > > > I understand that bad block management shall be provided from MTD
> > > > > layer, and rescanning fixing the problem proved MTD is doing his
> > > > > job. But I do have some questions:
> > > > >
> > > > > 1. why in the first place the check point restoring succeeded but
> >
> > left
> >
> > > > > a corrupted filesystem?
> > > >
> > > > It is impossible to say with so little info.
> > > >
> > > > > 2. What would happen if a used block become a bad
> > > > > block?
> > > >
> > > > That block will not be scanned. But blocks don't just"go bad". We
> > > > have
> >
> > to
> >
> > > > mark
> > > > them as bad, That normally means we have timne to extract the useful
> >
> > data
> >
> > > > first.
> > > >
> > > > > will the whole filesystem got crazy?
> > > >
> > > > No. Yaffs uses a log structure with tags. That means there is no
> >
> > "master
> >
> > > > table" or such which holds all the information.
> > > >
> > > > > Any way to recover from it?
> > > > >
> > > > > 3.
> > > > > Any way to check or indicate an inconsistence in the filesystem, so
> >
> > the
> >
> > > > > mounting script could try with the option no-checkpoint-read?
> > > >
> > > > There is no such provision at present. Since there is no scanning if
> >
> > the
> >
> > > > checkpoint works, it is really hard to see how you would decise that
> >
> > the
> >
> > > > checkpoint was bad.
> > > >
> > > > If you are having problems with checkpoint, then consider just
> > > > turning
> >
> > it
> >
> > > > off.
> > > > Since block summaries were introduced, the boot speed up benefits of
> > > > checkpointing are not as dramatic as they were.
> > > >
> > > > > Thanks for your work and help. Please let me know if there's any
> > > > > mistake
> > > >
> > > > in
> > > >
> > > > > my understanding.
> > > > >
> > > > > regards,
> > > > > Peter
> > > > >
> > > > > does the official kernel has this function enabled or is there any
> > > > > option that controls it?
> > > > >
> > > > > On 2010-03-04 20:55, Charles Manning wrote:
> > > > > > On Friday 05 March 2010 07:14:59 Shivdas Gujare wrote:
> > > > > > > Hi Charles,
> > > > > > >
> > > > > > > Thanks lot for your help.
> > > > > > >
> > > > > > > On Wed, Mar 3, 2010 at 12:34 PM, Charles Manning
> > > > > > >
> > > > > > > wrote:
> > > > > > > > On Wednesday 03 March 2010 23:33:31 Sven Van Asbroeck wrote:
> > > > > > > >> Hello Shivdas,
> > > > > > > >>
> > > > > > > >> > So, what does actually "check pointing" saves while
> > > > > > > >> > unmount?
> > > > > > > >>
> > > > > > > >> It's my understanding that the check point consists of the
> > > > > > > >> RAM
> > > >
> > > > data
> > > >
> > > > > > > >> structure which is assembled when a yaffs partition is
> >
> > scanned.
> >
> > > > > > > >> It consists of meta-information associated with each chunk
> > > > > > > >> and block. If you'd like to know more, I recommend reading
> > > > > > > >> the
> >
> > 'How
> >
> > > > > > > >> Yaffs works' document, which is available in CVS.
> > > > > > > >
> > > > > > > > A full scan builds up a set of data structures that define
> > > > > > > > the file system state. A checkpoint captures a reduced
> > > > > > > > version of that,
> > > >
> > > > enough
> > > >
> > > > > > > > to reconstitute the main part of the state and the rest can
> > > > > > > > be
> > > >
> > > > built
> > > >
> > > > > > > > up on a lazy basis.
> > > > > > > >
> > > > > > > >> > and Is it
> > > > > > > >> > safe to use check-pointing always in final product?
> > > > > > > >>
> > > > > > > >> According to Charles, checkpointing is designed to be used
> > > > > > > >> in the way you describe. To my knowledge, no open
> > > > > > > >> checkpointing issues exist, but you should search the
> > > > > > > >> archives. If you are concerned about the checkpoint
> > > > > > > >> diverging from the
> > > > > > > >> meta-information on flash, you could a) disable
> > > > > > > >> checkpointing altogether, or b) submit a
> > > >
> > > > patch
> > > >
> > > > > > > >> implementing a checkpoint counter ;-)
> > > > > > > >
> > > > > > > > You can also choose to mount ignoring checkpointing with
> > > > > > > >
> > > > > > > > mount -t yaffs2 -o"no-checkpoint-read" ..
> > > > > > >
> > > > > > > This is not the option for me, since in final product, end user
> > > >
> > > > should
> > > >
> > > > > > > not be able
> > > > > > > to change system data (i.e. mount flag's.) Or I can't change it
> > > >
> > > > unless
> > > >
> > > > > > > rootfs is flashed
> > > > > > > on device, since yaffs2/nand partitions are mounted from rcS
> > > > > > > script.
> > > > > >
> > > > > > You don't need to do this. Just leave checkpointing on.
> > > > > >
> > > > > > -- CHarles
> > > > > >
> > > > > >
> > > > > > -- Charles
> > > > >
> > > > > -Peter
> > > > > _______________________________________________
> > > > > yaffs mailing list
> > > > > yaffs@lists.aleph1.co.uk
> > > > > http://lists.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs
> >
> > _______________________________________________
> > yaffs mailing list
> > yaffs@lists.aleph1.co.uk
> > http://lists.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs