Re: [Yaffs] File corruption - read and write problems

Attachments:
Message as email (text/plain)

Author: zheng shi
Date:
To: William Juul
CC: Charles Manning, yaffs
Subject: Re: [Yaffs] File corruption - read and write problems

It sounds like NAND driver problem.
I think you may first replay your test case on mtd level.
i.e. You may do the verification on a NAND page to have a check.

On Fri, Dec 16, 2011 at 2:38 AM, William Juul <william@juul.no> wrote:
> We have now done some further investigations in this matter; and its getting
> even more peculiar.
>
> We can now reproduce this error condition and here is what we do:
> 1) Erase nand in U-boot
> 2) Write lots of files to yaffs FS from U-boot
> 3) Boot to linux (using the files just written to nand)
> 4) Wait 10 seconds
> 5) reboot
> 6) in linux check SHA1 sum of file with object_id 264
>
> If we do this excactly, everything is fine; but if we skip 4) or reduce that
> delay, we get a missing chunk (with chunkid in the range 30-60) in the file
> mentioned in 6)
> If we check the SHA1 sum before rebooting in step 5) it is correct (not
> depending on any delay)
>
> The reboot is done properly, and trace shows us that yaffs_do_sync_fs is
> being called and that yaffs background thread is being shut down.
>
> The yaffs version we are using in U-boot is from 2010-04-26.
>
> Any ideas?
>
> Best regards
> William
>
> On Wed, Nov 23, 2011 at 00:14, Charles Manning <manningc2@actrix.gen.nz>
> wrote:
>>
>> On Wednesday 23 November 2011 11:36:00 William Juul wrote:
>> > Hello, we have been using yaffs on a PPC running linux for several
>> > years.
>> > We have multiple boards and a complete install base of tens of
>> > thousands.
>> > And in our QA lab we have several hundred devices that are upgraded
>> > several
>> > times an hour 24/7.
>> >
>> > We are currently on kernel 3.0.4/3.0.7 and yaffs as of august 15th.
>> >
>> > So far so good, and thanks by the way :-)
>> >
>> > Now for the problem.
>> >
>> > We do from time to time (read seldom), experience file corruption. To
>> > try
>> > and find out of this I have written a utility that I can use to analyze
>> > a
>> > yaffs image we have "dd'ed" from the NAND.
>> >
>> > At least one occurence had a missing chunk (chunkid going from 51 to 53,
>> > skipping 52) in the middle of a large file (several MB). The missing
>> > chunk
>> > was in the middle of a sequence of chunks in the middle of a block (as
>> > seen
>> > on the physical NAND). The missing chunk could not be found anywhere on
>> > the
>> > flash. Not even when looking for OOB data with several bit errors.
>> > From the linux file system point of view, the file has correct size, but
>> > when read the missing chunk has its data replaced with all zeroes. There
>> > is
>> > no error or warning during read of this file.
>> >
>> > After some code inspection I can explain what happens during read:
>> > in "yaffs_rd_data_obj(...)" there is a comment saying "get sane (zero)
>> > data if you read a hole" followed by a memset(buffer,
>> > 0, in->my_dev->data_bytes_per_chunk);
>> > "yaffs_rd_data_obj(...)" returns 0 when this error occurs, but the
>> > return
>> > value from this function is never used or checked.
>> >
>> > I would have thought that yaffs should have notified the user of this
>> > error
>> > in such a way that the user read() resulted in EIO.
>> > Why is it not so?
>>
>> yaffs does not write the holes in files with zeros.
>>
>> eg consider the following sequence:
>> write 1MB of data
>> seek to 2MB
>> write 1MB of data
>>
>> You will have a 3MB file of which there is only 2 MB or actual data and
>> there
>> is a 1MB hole in the middle.
>>
>> Since yaffs does not record holes, it cannot tell the difference between a
>> missing page due to some error or a valid hole. It therefore does not
>> report
>> EIO.
>>
>> >
>> > I still do not understand what happens during write, and as I stated
>> > this
>> > happens very seldom. It can be due to a hard to trigger bug in HW, the
>> > driver, MTD or Yaffs; or any combination of these alternatives.
>> > Have anyone else experienced anything similar?
>> > Do you have any suggestions on how to debug this further or how to work
>> > around this?
>>
>> This sounds pretty strange and I I was to guess I would say it is most
>> likely
>> due to some problem in the driver.
>>
>> > As it happens, it seems most/all of these (few) occurences happen during
>> > upgrade of our application (which is fair due to our use case), so we
>> > can
>> > to some degree work around it by verifying the checksum of each file
>> > after
>> > installation and rewrite if necessary. But I would really like a cleaner
>> > solution.
>>
>> Verification of files during an upgrade is always a good idea.
>> I would also recommend that you drop the cache before you verify so that
>> you
>> verify against the flash and not what is in the VFS cache.
>>
>> ie.
>> write upgrade files to yaffs
>> sync
>> echo 3> /proc/sys/vm/drop_caches
>> verify files
>>
>> -- Charles
>>
>>
>
>
>
> --
> ----------------------------------------------
> William Juul
> Gullhaugveien 53
> N-1354 Bærums Verk, Norway
>
> Tel: +47 67 56 16 67 Mob: +47 95 79 32 53
>
> william@juul.no
> www.juul.no
> ----------------------------------------------
>
> _______________________________________________
> yaffs mailing list
> yaffs@lists.aleph1.co.uk
> http://lists.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs
>

--
Regards, Shizheng