I don't know much about debugging yaffs specifically, but this seems like a good clue:
root@errored_system:/root# echo "-all+gc">/proc/yaffs
root@errored_system:/root# dmesg | tail -f
[ 3419.220000] yaffs_block_became_dirty block 3048 state 8
[ 3419.220000] GC Selected block 3437 with 1 free, prioritised:0
[ 3419.220000] yaffs: GC erasedBlocks 5 aggressive 1
[ 3419.260000] yaffs_block_became_dirty block 3437 state 8
[ 3419.270000] GC Selected block 2201 with 1 free, prioritised:0
[ 3419.270000] yaffs: GC erasedBlocks 5 aggressive 1
[ 3419.310000] yaffs_block_became_dirty block 2201 state 8
[ 3419.310000] GC Selected block 2231 with 1 free, prioritised:0
[ 3419.310000] yaffs: GC erasedBlocks 5 aggressive 1
[ 3419.370000] yaffs_block_became_dirty block 2231 state 8
These "yaffs_block_became_dirty block" records rapidly fill up the kernel log... (on the good systems, I do not get these messages, at least not in rapid succession)
Heres the interesting thing: they are all repeating on the same few blocks. (2201, 2357, 2819, 3048 & 3437... same story on the other error'd systems, only different block numbers) So my hunch is that something is preventing these blocks from getting GC'd and are somehow backlogging the other blocks that need freeing. Is that even possible? I am able to delete legitimate files and regain their space, but that doesn't really address the issue of the mystery disk usage. It does tell
me that the GC is working at least in part.
Incidentally, I cross-referenced the yaffs_block_became_dirty block against the "bad eraseblocks" and "blocks marked bad" as reported on startup and they were different (i.e. the blocks in the above GC messages were not "known bad blocks"
Is there any way to forcibly verify the integrity of the data blocks in yaffs and free up this mystery space? In other words, I have remote access to these systems, but re-flashing isn't exactly an option for me as they are geographically distributed (i.e. data collectors). Can I somehow force those block numbers to be marked as bad? (I'd imagine that wouldn't be wise, even if I could)
Some more (hopefully) useful information:
I am not writing much to flash. About once an hour I update a couple of values in a sqlite database. I dont know if the update operation is the culprit (since its behaving on other systems), but its the only source of disk IO I can think of.
There are not many reported bad blocks (5-20 depending on which system I'm looking at)
The
board manufacturer apparently has changed the NAND chip somewhere along the line. I am experiencing
the issue on both types (so I'm fairly certain its not exclusively hardware related)
One system:
[ 2.520000] NAND device: Manufacturer ID: 0xad, Chip ID: 0x76 (Hynix NAND 64MiB 3,3V 8-bit)
Another system:
[ 2.510000] NAND device: Manufacturer ID: 0xec, Chip ID: 0x76 (Samsung NAND 64MiB 3,3V 8-bit)
Here is the /proc/yaffs from a good system (one that isn't full)
root@good_system:/root# cat /proc/yaffs
Multi-version YAFFS built:May 25 2012 01:48:02
Device 0 "rootfs"
start_block.......... 0
end_block............ 3711
total_bytes_per_chunk 512
use_nand_ecc......... 1
no_tags_ecc.......... 0
is_yaffs2............ 0
inband_tags.......... 0
empty_lost_n_found... 0
disable_lazy_load.... 0
refresh_period....... 500
n_caches............. 10
n_reserved_blocks.... 5
always_check_erased.. 0
data_bytes_per_chunk. 512
chunk_grp_bits....... 0
chunk_grp_size....... 1
n_erased_blocks...... 294
blocks_in_checkpt.... 0
n_tnodes............. 7901
n_obj................ 5653
n_free_chunks........ 55483
n_page_writes........ 597395
n_page_reads......... 906712
n_erasures........... 11524
n_gc_copies.......... 22402
all_gcs.............. 32575
passive_gc_count..... 32575
oldest_dirty_gc_count 0
n_gc_blocks.......... 11524
bg_gcs............... 0
n_retired_writes..... 0
nRetireBlocks........ 0
n_ecc_fixed.......... 0
n_ecc_unfixed........ 0
n_tags_ecc_fixed..... 0
n_tags_ecc_unfixed... 0
cache_hits........... 100140
n_deleted_files...... 37
n_unlinked_files..... 22541
refresh_count........ 0
n_bg_deletions....... 0
And heres one from a "disk full" system:
root@errored_system:/root# cat /proc/yaffs
Multi-version YAFFS built:May 25 2012 01:48:02
Device 0 "rootfs"
start_block.......... 0
end_block............ 3711
total_bytes_per_chunk 512
use_nand_ecc......... 1
no_tags_ecc.......... 0
is_yaffs2............ 0
inband_tags.......... 0
empty_lost_n_found... 0
disable_lazy_load.... 0
refresh_period....... 500
n_caches............. 10
n_reserved_blocks.... 5
always_check_erased.. 0
data_bytes_per_chunk. 512
chunk_grp_bits....... 0
chunk_grp_size....... 1
n_erased_blocks...... 5
blocks_in_checkpt.... 0
n_tnodes............. 23388
n_obj................ 19836
n_free_chunks........ 206
n_page_writes........ 2912051
n_page_reads......... 2907834
n_erasures........... 88768
n_gc_copies.......... 2751776
all_gcs.............. 88768
passive_gc_count..... 0
oldest_dirty_gc_count 0
n_gc_blocks.......... 88768
bg_gcs............... 0
n_retired_writes..... 0
nRetireBlocks........ 0
n_ecc_fixed.......... 1
n_ecc_unfixed........ 0
n_tags_ecc_fixed..... 0
n_tags_ecc_unfixed... 0
cache_hits........... 1851
n_deleted_files...... 2
n_unlinked_files..... 35064
refresh_count........ 0
n_bg_deletions....... 0
Thanks,
Walter