Home > MOSS, scan, search, SharePoint > Searching Scanned pdf files in SharePoint

Searching Scanned pdf files in SharePoint


If you need to search a scanned pdf file in SharePoint. Adobe ifilter (free) does not have a capability to search through a scanned pdf file.

Let’s dig out what does scanned pdf means for those who are scratching their heads so as to what’s this exactly is.

Here you go…a scanned pdf is one that is created by scanning physical paper like pages of a book, legal documents, etc. see below

052810_0605_SearchingSc1

While doing a proof-of concept exercise for a prospect we encountered this behavior (inability to search through scanned pdf files) and if you are interested how we overcame the issue…please read on

After analyzing various aspects as to what best can be done so as to facilitate "scanned pdf" searching in SharePoint, we zeroed in on the following three options. The crux is to make the scanned pdf a searchable dual layer pdf which has not only the scanned image but also a layer of the text from the image. The technology to read text from "Image" is known as OCR (Optical Character Recognition)

  1. Use an OCR tool which converts the "scanned pdf" directly to "dual layer pdf" i.e. (image + OCR text) and upload the resulting pdf to SharePoint and the adobe ifilter will take care of indexing the document.

    Following are the few such products:

    Nuance PDF Converter Enterprise

    Solid PDF Tools

    X-Key

    These products come with an API as well, that means you can automate the complete process.

  2. Use an open source OCR tool to retrieve the text from the pdf file and store it as a metadata for the pdf document inside a SharePoint document library. Do a full crawl of the site and you are up and running with the solution to scan

    OCRopus

    Tesseract

  3. Use an Ifilter specifically targeted towards such pdf documents.

    Captaris

List of OCR Software’s

Free:

  1. CuneiForm
  2. GOCR
  3. Ocrad
  4. OCRopus
  5. Tesseract

Proprietary:

  1. Expervision
  2. FineReader
  3. Microsoft Office Document Imaging
  4. OmniPage
  5. Readiris

Till next time…

Advertisements
Categories: MOSS, scan, search, SharePoint
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: