Posts Tagged ‘djpeg’

ocr and powerpoint

January 30, 2008 4 comments

One of the projects I’m working on requires text being read off of a powerpoint slide, or any type of presentation materials for that matter. In a preivous post I released a perl script that can parse the text out of a native .ppt file using some clunky ole automation. But in this new scenario, the PPT is recorded as a JPEG image. I’m probably the last one on the planet to find GOCR, the open source OCR program. There is even a windows binary that you can download. In order to replicate the problem I’m trying to solve, I do the following:

  1. Save the .ppt deck as .jpg – this feature will save all the slides or the current slide as jpg files
  2. Next you need to transform the image to greyscale, because of 2 issues with GOCR, it works best with greyscale images that are in the .pnm format. – for this step you’ll need to download djpeg.exe
  3. Then run the following:
    > djpeg -grey -pnm test.jpg test.pnm
    > gocr test.pnm
    and that’s it!

Have a look at this site :

This is a great example where gocr and djpeg are being used.

Categories: ocr, powerpoint Tags: , , ,