Archive

Archive for the ‘powerpoint’ Category

ocr and powerpoint

January 30, 2008 4 comments

One of the projects I’m working on requires text being read off of a powerpoint slide, or any type of presentation materials for that matter. In a preivous post I released a perl script that can parse the text out of a native .ppt file using some clunky ole automation. But in this new scenario, the PPT is recorded as a JPEG image. I’m probably the last one on the planet to find GOCR, the open source OCR program. There is even a windows binary that you can download. In order to replicate the problem I’m trying to solve, I do the following:

  1. Save the .ppt deck as .jpg – this feature will save all the slides or the current slide as jpg files
  2. Next you need to transform the image to greyscale, because of 2 issues with GOCR, it works best with greyscale images that are in the .pnm format. – for this step you’ll need to download djpeg.exe
  3. Then run the following:
    > djpeg -grey -pnm test.jpg test.pnm
    > gocr test.pnm
    and that’s it!

Have a look at this site : http://www.seeingwithsound.com/ocr.htm

This is a great example where gocr and djpeg are being used.

Advertisements
Categories: ocr, powerpoint Tags: , , ,

Extract Text from PowerPoint using Perl – convert ppt to text

February 3, 2007 6 comments

I need to unleash this on the world. I’ve been working on a project where the goal is to integrate powerpoint slides into a flash presentation. All the ppt to swf products seem to convert the powerpoint slide into an image, maybe vector based, I’m not sure, but the end result is a swf and the text is not selectable.

The point is to automate the process of grabbing the text out of the original powerpoint deck to use for different purposes, search engine optimization, text search. I’m sure there’s an easy way to do this in VB or C#, but I’m a Perl native and became obsessed with Win32::OLE for the last 3 hours.

Here’s the breakdown:
The Problem: How do I get text out of the text boxes in a powerpoint presentation and into a text file?

1. First get yourself Win32::OLE for perl

2. Take a look at this roth consulting presentation. It took me a few reads to “get it”, and I think there are some inconsistencies in some of the examples.

3. Make yourself a powerpoint presentation. Just use the default text boxes. This script is highly experimental and I’ve only been playing with it on simple slides.

4. After several attempts at pasting code into this blog… i’ve given up. You can download the sample file: ppt-parse.txt PPT Parser

Basically, this is an exercise in Win32::OLE.

Categories: perl, powerpoint