xplorer² blog: From ASCII to UNICODE text

[xplorer²] — Plain text files and searches
home » blog » 25 March 2007

"A file that big? It might be very useful; But now it is gone." — Random heiku

How plain are plain text files? In a previous article we saw that all files are stored as numbers which get interpreted as "documents" through an agreed protocol. In plain text files the protocol is very simple: there's a 1-to-1 mapping between characters and bytes, the famous ASCII standard. The letter "A" maps to number 65, so when xplorer² reads the number 65 from an assumed text file it knows that it represents the letter "A". Confusingly the character "0" (zero) is number 48 on disk (and in memory too but that's another blog article).

Since each byte (eight 0-1 bits) can store 256 values at most, this ASCII encoding scheme can do up to 256 different letters, and that has to cater for capital/lowercase, accents, even control characters (e.g. "end of line" is number 10). These basic 256 slots can and do run out in no time. When computers started invading non-english speaking countries a first extension of the ASCII scheme was code pages. The low half of the mapping table was fixed to english letters and common symbols (e.g. brackets) whereas the high half (above 128) had code-page specific characters. Each code page targeted a specific country. The Greek code page is 1253 and there the number 195 corresponds to letter Γ. In normal latin (code page 1250) the same number 195 is character Ă.

Code page extensions maintained the "one byte one character" principle but made text deciphering complicated. To know which letter a stored number >127 corresponded to you had to know the code page. Since text files don't store anything on top of the text, it was left up to the user to figure out and request the encoding. Not tubby custard as you can imagine.

As computers conquered the world, even more complicated languages from the far east had to be accommodated and the one byte rule had to go. Text aware applications now have to deal with multi-byte encoding schemes like UTF-8 and UNICODE as well as single byte text files. xplorer² is internally UNICODE which means that each character is represented by two bytes. Therefore UNICODE can store 65536 characters and symbols. Is that enough for all the world's different characters? I wouldn't know but it's an improvement over code pages — albeit at the expense of doubling the memory and disk storage requirements. Now "A" is stored as 65 00, which is a bit of a waste of space but makes things easier for programmers!

With all this background you may appreciate what xplorer² has to go through every time you search for text in files (e.g. Mark > Containing text command or any hyperfilter with non-empty "Containing text" box). The text string you type and want found is UNICODE (as it's within xplorer²) and most of the time text files are not, so behind the scenes xplorer² search algorithms have to do some juggling. It has to figure out how to interpret each text file according to this table:

Byte Order Mark

f
o
r
c
e
d

	`FF FE`	`EF BB BF`	(none)
y e s	UNICODE	UTF-8	forced encoding
n o	UNICODE	UTF-8	plain (ANSI) text

The first thing checked is the presense of a Byte Order Mark (BOM for short). This is a little tag in the beginning of the file, a couple of numbers. If there is one then it settles the encoding, e.g. if a file starts with numbers FF FE (255, 254 in decimal notation) it is assumed 2-byte unicode. These BOMs are not compulsory though, so a file may be unicode without this starting sequence. Without a recognizable BOM, xplorer² is at a loss so it assumes a plain text file, unless the user has forced an assumed encoding from the <Ctrl+G> dialog (that's the purpose of the Text file encoding drop list control). In that case xplorer² will take your word for it and try its best!

There is plenty of room for confusion here as e.g. files that happen to start with a known BOM without being text at all, wrong forced encodings set by the user, and so on. When you force the encoding you must confine the search to text files with similar encoding, otherwise they will be all assumed same and the results will not be as intended.

From the user point of view, here are a few tips for successful searching in text files:

If your text is English or major latin (German, French, etc) then you don't have to worry about anything — unless you want to search in unidentified unicode files.
If your encoding is more complex (Greek, Slovakian, Japanese etc) make sure you use text files with BOMs to identify their encoding.
To find text in files without encoding markers, force the most likely encoding type from xplorer² drop list control. This forcing only makes sense if your file is known to be multibyte without BOM; for plain text files you just have to rely on your default ANSI encoding system-wide setup.

In a following article we'll see how it is possible (and you'll be glad to hear, easier) to search for text in other kind of documents (e.g. MS Word or PDF files) that employ proprietary encodings not covered by anything in the present article. Probably we'll need yet another blog entry for all the advanced goodies that xplorer² search engine can deal with like regular expressions.

What would you like to do next?

Reclaim control of your files!

browse
preview
manage
locate
organize

"This powerhouse file manager beats the pants off Microsoft's built-in utility..."

download.com