Reading the Document Imaging Blog about how Google are now indexing PDFs and using OCR to make them searchable made me think about PDFs in general and whether searchable PDFs and other document formats are a good thing for intranets or not. At the moment I guess that search in most intranets doesn’t index PDFs but there are search products out there that will allow you to search all documents in your intranet.
As you can probably guess from the title I come down on the side of not making PDFs and documents searchable within an intranet for the reasons I give below.
I am not against using PDFs, in fact I find them extremely useful. Bundling up content in discrete sets as PDF documents in intranets makes absolute sense in many cases.
The advantages are –
- Simpler navigation by reducing the number of web pages required
- In content management systems it allows some content to be free of the straitjacket of CMS uniformity. Content can be arranged differently and graphics used in a way that makes sense within the document itself and takes into account the type of content being displayed
- If the document is large, good internal navigation will allow users to navigate as if they were still using HTML web pages and the back button will always return them to the landing page
- Links to other parts of the intranet and the internet can still be preserved within the PDF
- PDFs look nice and crisp when viewed on screen
- They can be protected using different levels of security
- PDFs are searchable through the PDF reader
- They are a lot less work. If the content owner needs to change something in the document they simply change the Word version they hold and send it to the intranet team who check the changes, turn the Word document into a PDF and then remove the previous version and upload the latest version. This usually only take minutes whereas large changes to content on web pages can mean hours of work
However you must make sure that each document has its own landing page. This is to ensure that the context for each document can be made explicit and to allow for metadata to be attached that is relevant only to the document.
OK I’ve hopefully sold you on PDFs but what about not making them searchable? I have been told that searchable PDFs will be a very good thing for intranets but I just don’t get it. The poor users put their search terms in and, as all documents are searchable, they will get a mountain of results back. Then, when they click on a result, it will land them on a document containing the search term. This can be a problem as a lot of documents aren’t set up like the best web pages and if they are PDFs from external sources, e.g. legislation, H & S advice etc., you won’t be able to change them anyway. The problem is context.
In good web pages you should be able to land on any page and have an idea where you are and what the page is about. This is not true for a lot of documents. Users also need to know the status of a document e.g. is it mandatory or for guidance only? So is the answer to put more and more information in the document so the user knows where they are when they land on a document or a document page? I think there is a simpler answer than that. Make documents non-searchable and non-accessible except through their landing pages. If the user can only access the landing page for a document and, if the intranet team has done their work correctly, the page will be findable and, through the information on the page, the user will have the context they need. This way, when they open a document, they will know what they are letting themselves in for.
Context is the key and, as a by-product of not making your documents searchable, you should be able to keep down the number of search results and, if the intranet team have done their stuff on the metadata for document landing pages, the quality of search results should also improve as only web pages will be indexed.
You can learn more about metadata from James Robertson’s article in the Step Two blog.
(Many thanks to EJeffson for his CC Flickr PDF icon)