Re: [Yaffs] File corruption - read and write problems

Attachments:
Message as email (text/plain)

Author: Charles Manning
Date:
To: yaffs
Subject: Re: [Yaffs] File corruption - read and write problems

On Wednesday 23 November 2011 11:36:00 William Juul wrote:
> Hello, we have been using yaffs on a PPC running linux for several years.
> We have multiple boards and a complete install base of tens of thousands.
> And in our QA lab we have several hundred devices that are upgraded several
> times an hour 24/7.
>
> We are currently on kernel 3.0.4/3.0.7 and yaffs as of august 15th.
>
> So far so good, and thanks by the way :-)
>
> Now for the problem.
>
> We do from time to time (read seldom), experience file corruption. To try
> and find out of this I have written a utility that I can use to analyze a
> yaffs image we have "dd'ed" from the NAND.
>
> At least one occurence had a missing chunk (chunkid going from 51 to 53,
> skipping 52) in the middle of a large file (several MB). The missing chunk
> was in the middle of a sequence of chunks in the middle of a block (as seen
> on the physical NAND). The missing chunk could not be found anywhere on the
> flash. Not even when looking for OOB data with several bit errors.
> From the linux file system point of view, the file has correct size, but
> when read the missing chunk has its data replaced with all zeroes. There is
> no error or warning during read of this file.
>
> After some code inspection I can explain what happens during read:
> in "yaffs_rd_data_obj(...)" there is a comment saying "get sane (zero)
> data if you read a hole" followed by a memset(buffer,
> 0, in->my_dev->data_bytes_per_chunk);
> "yaffs_rd_data_obj(...)" returns 0 when this error occurs, but the return
> value from this function is never used or checked.
>
> I would have thought that yaffs should have notified the user of this error
> in such a way that the user read() resulted in EIO.
> Why is it not so?

yaffs does not write the holes in files with zeros.

eg consider the following sequence:
write 1MB of data
seek to 2MB
write 1MB of data

You will have a 3MB file of which there is only 2 MB or actual data and there
is a 1MB hole in the middle.

Since yaffs does not record holes, it cannot tell the difference between a
missing page due to some error or a valid hole. It therefore does not report
EIO.

>
> I still do not understand what happens during write, and as I stated this
> happens very seldom. It can be due to a hard to trigger bug in HW, the
> driver, MTD or Yaffs; or any combination of these alternatives.
> Have anyone else experienced anything similar?
> Do you have any suggestions on how to debug this further or how to work
> around this?

This sounds pretty strange and I I was to guess I would say it is most likely
due to some problem in the driver.

> As it happens, it seems most/all of these (few) occurences happen during
> upgrade of our application (which is fair due to our use case), so we can
> to some degree work around it by verifying the checksum of each file after
> installation and rewrite if necessary. But I would really like a cleaner
> solution.

Verification of files during an upgrade is always a good idea.
I would also recommend that you drop the cache before you verify so that you
verify against the flash and not what is in the VFS cache.

ie.
write upgrade files to yaffs
sync
echo 3> /proc/sys/vm/drop_caches
verify files

-- Charles

This message is part of the following thread:
	the complete thread tree sorted by date
	William Juul at
	William Juul at