Home » Blog
date 14.Aug.2016

■ Find text in DOCX files regardless of installed Office bit depth (32/64 bit)


Personally as a matter of habit I only use the old DOC format for MS Word documents. The newer DOCX is a compressed version (a XML mini filesystem really) that takes less size on disk, but is not a fully mature format. So whereas you can always use xplorer² to find text in DOC files, for DOCX it will only work if your installed office bit depth matches xplorer² (i.e. both 32 or both 64 bit). If you have Office 365 in the recommended 32 bit version and your xplorer² is 64 bit, or if you have 32 bit xplorer² because of other shell extensions alongside office x64 — both these mismatches will not allow you to search inside DOCX for keywords.

Why "bitness" affects badly only DOCX and not DOC files also? Because the office installer registers two different text filter components for DOC (one 32 and one 64 bit), but for DOCX it only adds the "natural" component. To work around this problem I saw people recommending installing the other office filter pack on a 32 bit PC then copying files and registry tweaks to the 64 bit system. The idea works but too much hassle, isn't it?

Registry fix for DOCX text filter
If you are not interested in understanding how the fix works, then just skip this section and get the download below. Otherwise please continue reading. Windows search filters are shell extensions that are implemented as DLLs (a COM object exporting IFilter interface). As a general principle, a process can only load a DLL if it matches its bit depth, so a 64 bit process cannot load a 32 bit DLL and vice versa. In an older article I explained how shell extensions are generally registered under HKEY_CLASSES_ROOT\CLSID registry key. For 64 bit windows, this key is virtualized, so that 32 bit COM objects (DLLs) are registered under HKCR\Wow6432Node\CLSID.

When a process like xplorer² tries to read a DOCX file's contents, it needs its IFilter interface which for this file type is an object with class identifier {5A98B233-3C59-4B31-944C-0E560D85E6C3} (use REGEDIT to snoop the registry for IFilter registration guidelines). If there is a bit mismatch xplorer² cannot load the DLL (say the DLL is 32 bit). The only solution would be to have a 32 bit process load the matching DLL then have this process to talk to xplorer². This is exactly what COM surrogate tries to do. DLLHOST.EXE effectively converts the COM object from inproc to a standalone server.

This aft explanation problably did your head in, but the good news is that the fix is quite simple. You can have the object instantiated automatically with DLLHOST by adding a few registry keys and values, the most important being the AppID with a DllSurrogate value. I have adapted this solution for the peciliarities of the IFilter interface (needs an intermediate PersistentHandler object), and present 2 REG files, one if your office is 32 bit, and one if your office is 64 bit. Most people will need only the former for 32 bit office I reckon.

Click to download DOCX text extraction registry fix (3 KB)
Unpack the ZIP archive then read the instructions
(also included are fixes for XLSX and PPTX excel and powerpoint formats)

At present I am working on a solution that will let xplorer² do all the internal 32/64 bit juggling without modifying the registry. It ain't easy but I am optimistic. More on that later!

ps. Starting with version 4.5, xplorer² (64 bit version) uses a 32 bit text filter broker process to solve this and similar problems in text search automatically.

Post a comment on this topic »

Share |

©2002-2016 ZABKAT LTD, all rights reserved | Privacy policy | Sitemap