Author: Charles Manning Date: To: yaffs Subject: Re: [Yaffs] File corruption - read and write problems
On Wednesday 23 November 2011 11:36:00 William Juul wrote: > Hello, we have been using yaffs on a PPC running linux for several years.
> We have multiple boards and a complete install base of tens of thousands.
> And in our QA lab we have several hundred devices that are upgraded several
> times an hour 24/7.
>
> We are currently on kernel 3.0.4/3.0.7 and yaffs as of august 15th.
>
> So far so good, and thanks by the way :-)
>
> Now for the problem.
>
> We do from time to time (read seldom), experience file corruption. To try
> and find out of this I have written a utility that I can use to analyze a
> yaffs image we have "dd'ed" from the NAND.
>
> At least one occurence had a missing chunk (chunkid going from 51 to 53,
> skipping 52) in the middle of a large file (several MB). The missing chunk
> was in the middle of a sequence of chunks in the middle of a block (as seen
> on the physical NAND). The missing chunk could not be found anywhere on the
> flash. Not even when looking for OOB data with several bit errors.
> From the linux file system point of view, the file has correct size, but
> when read the missing chunk has its data replaced with all zeroes. There is
> no error or warning during read of this file.
>
> After some code inspection I can explain what happens during read:
> in "yaffs_rd_data_obj(...)" there is a comment saying "get sane (zero)
> data if you read a hole" followed by a memset(buffer,
> 0, in->my_dev->data_bytes_per_chunk);
> "yaffs_rd_data_obj(...)" returns 0 when this error occurs, but the return
> value from this function is never used or checked.
>
> I would have thought that yaffs should have notified the user of this error
> in such a way that the user read() resulted in EIO.
> Why is it not so?
yaffs does not write the holes in files with zeros.
eg consider the following sequence:
write 1MB of data
seek to 2MB
write 1MB of data
You will have a 3MB file of which there is only 2 MB or actual data and there
is a 1MB hole in the middle.
Since yaffs does not record holes, it cannot tell the difference between a
missing page due to some error or a valid hole. It therefore does not report
EIO.
>
> I still do not understand what happens during write, and as I stated this
> happens very seldom. It can be due to a hard to trigger bug in HW, the
> driver, MTD or Yaffs; or any combination of these alternatives.
> Have anyone else experienced anything similar?
> Do you have any suggestions on how to debug this further or how to work
> around this?
This sounds pretty strange and I I was to guess I would say it is most likely
due to some problem in the driver.
> As it happens, it seems most/all of these (few) occurences happen during
> upgrade of our application (which is fair due to our use case), so we can
> to some degree work around it by verifying the checksum of each file after
> installation and rewrite if necessary. But I would really like a cleaner
> solution.
Verification of files during an upgrade is always a good idea.
I would also recommend that you drop the cache before you verify so that you
verify against the flash and not what is in the VFS cache.