Hi,
I'm running into an issue with kernel oops during YAFFS2 scan. It just
happened once on one of our units but I really want to get to the
bottom of it. It's been two days since I started looking into this
issue, but seems I've hit a dead end now...
Basically, during the YAFFS2 backwards scanning, a file (deleted) was
seen first, it went all fine except one (yes, only one) chunk of it
had a wrong objectId (0x220 instead of 0x2e0). Because object 0x220
was first seen in the scan, a file object was created. The scan then
continued on without any issue until a directory object with real
objectId 0x220 was found. A new directory object wasn't created in
this case because a file object with the same id already existed.
Obviously the object wasn't initialized properly as per the directory
item, so it eventually crashed when other files were added to that
directory in yaffs_AddObjectToDirectory().
My first suspicion was the flash was playing the trick. I dumped out
the tag area containing the wrong objectId, and manually calculated
the crc of the tag, it turned out the crc matches with the tag.
Although it's possible the tag and the tag CRC went wrong at the same
time, it would be highly unlikely they still match with each other.
Before I move on, here's the tags I dumped from the flash during the
scan, which made me believe that firstly encountered objectId 0x220
was wrong,
obj 0x2e0: seqNum 4718, chunkId 0, byteCount 0, address 0x08af7000
obj 0x2e0: seqNum 4705, chunkId 0, byteCount 0, address 0x08791000
obj 0x2e0: seqNum 4705, chunkId 2256, byteCount 1372, address 0x08790000
obj 0x2e0: seqNum 4705, chunkId 2255, byteCount 4096, address 0x0878f000
obj 0x2e0: seqNum 4705, chunkId 2254, byteCount 4096, address 0x0878e000
obj 0x2e0: seqNum 4705, chunkId 2253, byteCount 4096, address 0x0878d000
obj 0x2e0: seqNum 4705, chunkId 2252, byteCount 4096, address 0x0878c000
...
obj 0x2e0: seqNum 4699, chunkId 1869, byteCount 4096, address 0x0860d000
obj 0x2e0: seqNum 4699, chunkId 1868, byteCount 4096, address 0x0860c000
obj 0x2e0: seqNum 4699, chunkId 1867, byteCount 4096, address 0x0860b000
obj 0x2e0: seqNum 4699, chunkId 1866, byteCount 4096, address 0x0860a000
obj 0x2e0: seqNum 4699, chunkId 1865, byteCount 4096, address 0x08609000
obj 0x2e0: seqNum 4699, chunkId 1864, byteCount 4096, address 0x08608000
obj 0x220: seqNum 4699, chunkId 1863, byteCount 4096, address
0x08607000 <----------
obj 0x2e0: seqNum 4699, chunkId 1862, byteCount 4096, address 0x08606000
obj 0x2e0: seqNum 4699, chunkId 1861, byteCount 4096, address 0x08605000
obj 0x2e0: seqNum 4699, chunkId 1860, byteCount 4096, address 0x08604000
obj 0x2e0: seqNum 4699, chunkId 1859, byteCount 4096, address 0x08603000
obj 0x2e0: seqNum 4699, chunkId 1858, byteCount 4096, address 0x08602000
obj 0x2e0: seqNum 4699, chunkId 1857, byteCount 4096, address 0x08601000
obj 0x2e0: seqNum 4699, chunkId 1856, byteCount 4096, address 0x08600000
obj 0x2e0: seqNum 4698, chunkId 1855, byteCount 4096, address 0x085ff000
obj 0x2e0: seqNum 4698, chunkId 1854, byteCount 4096, address 0x085fe000
obj 0x2e0: seqNum 4698, chunkId 1853, byteCount 4096, address 0x085fd000
...
obj 0x2e0: seqNum 4670, chunkId 3, byteCount 4096, address 0x07ec3000
obj 0x2e0: seqNum 4670, chunkId 2, byteCount 4096, address 0x07ec2000
obj 0x2e0: seqNum 4670, chunkId 1, byteCount 4096, address 0x07ec1000
obj 0x2e0: seqNum 4670, chunkId 0, byteCount 0, address 0x07ec0000
So I'm trying to understand what would cause this to happen. In that
particular tag, everything else is correct: sequenceNumber was same as
the chunks surrounding it, which indicates no major actions such as GC
happened in between, and chunkId was also increased properly telling
VFS was doing the right thing, like passing the right pos argument
etc. By looking at the code, the only thing could happen is in
yaffs_file_write(), f->f_dentry was actually pointing to something
else just for that particular write call. But I couldn't figure out
why.
I don't know what operations the test team has conducted on it to
cause this. But I believe it's just some directory renames, file
copy-ings, removing and gzip-pings etc. Now my yafff2 version has
lagged behind about a year or so, but it appears to me it's not an
isolated issue, although rare. One recent post I could relate to is
http://lists.aleph1.co.uk/lurker/message/20100219.181429.cc5e0f4a.en.html
Another one is quite old and happened with yaffs1 but very similar
http://www.aleph1.co.uk/lurker/message/20071009.205512.c73e2ef4.pt.html
Also I think we shouldn't really crash on this type of error. In
yaffs_FindOrCreateObjectByNumber(), it would make sense to check the
object type returned by yaffs_FindObjectByNumber() to see if it
matches with the type we pass in. We might end up with losing some
file in the extream case like this, but at least it won't crash.
Any feedback would be appreciated.
--
Rong