In windows programming and for maximum efficiency folks use the native file API like ReadFile, where there is no fixed buffer size. The question is, what's the best buffer size for reading a file sequentially from beginning to end? This calls for a controlled experiment. I used a JPG file of 1.13 MB size, and read it in a few times with various buffer sizes. Reading a 1MB file nowadays is done in no time so to obtain meaningful results I read the file many times repeatedly. The file was opened as such:
CreateFile("pic.jpg", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, NULL);
|
Table 1. Reading speed (ms) of a 1MB file depending on buffer size used
The timings in table 1 take into account the disk formatting; the first half of the table is for NTFS (4096 bytes cluster size) and the second half is for a FAT USB stick (8192 bytes cluster size). Obviously the hardware is different so there's not much point comparing NTFS with FAT here. It is clear that reading the picture with a small buffer, say 16 bytes at a time is very slow — despite Windows internal buffering. Increasing the buffer size from 16 to 2048 bytes results in a 100-fold increase in read speed!
As the buffer reaches the disk cluster size (4096 bytes and above) the reading speed is near instantaneous. Reading the file 10 times occurs in no time, so to get better results I read the file 1000 times (see the 3rd and 5th columns in table 1). We can see that there are some differences up to the 32KB buffer size, but from then onwards the speed remains the same (within the accuracy provided by GetTickCount).
What do we infer from these timing results? If you are reading a file sequentially, then the bigger the buffer size the merrier. Do not use FILE_FLAG_NO_BUFFERING flag which disables the internal windows buffering. As you can see from the last column, the performance is dreadful, 100 times slower than buffered I/O (it takes as long to read 10 times the file as 1000 reads in the buffered case).
Here's another twist: say you are scanning a file to find some text (keyword) in it. The keyword may happen to be at the beginning of the file or at the very end (or it may be absent!). What's the best reading strategy in this situation where we don't want to read the whole file? The maximum sensible chunk size in this case is 32KB. The reading speed is identical as if we were to read the whole file, but if we happen to hit the keyword at the first chunk, we saved ourselves tons of time reading the remaining bytes in the file. In fact, as disk access is orders of magnitude slower than memory access, I recommend reading only 8KB at a time; whatever we lose in disk access speed we gain by the chance of finding the keyword earlier in the file. This opportunistic strategy is used in xplorer².
And now you know why some file managers are better than others — and that's scientifically proven <g>
Post a comment on this topic »