Ricky,
I don't know for sure but I suspect they could be using Ocropus:
http://code.google.com/p/ocropus/
which is sponsored by Google.
I saw (and was impressed by) Thomas Breuel speaking about this at a
'Million Books Workshop' - if you're interested I blogged more about
this here:
http://blog.paulwalk.net/2008/03/16/digital-library-pipeline-for-a-million-books/
Cheers,
Paul
On 4 Nov 2008, at 12:56, Richard Rankin wrote:
> Does anyone know what OCR software they are using we currently have
> a project to OCR a large number of pdfs per month and would like to
> automate the process
>
> Ricky
> ______________________
> Principal Analyst
> Information Services
> Queen's University Belfast
>
> tel: 02890 974824
> fax: 02890 976586
> email: [log in to unmask]
>
> From: Repositories discussion list [mailto:[log in to unmask]
> ] On Behalf Of Peter Millington
> Sent: 04 November 2008 10:14
> To: [log in to unmask]
> Subject: Google Indexes Non-searchable PDFs
>
> The following development from Google could have a big impact on
> institutional repositories. PDFs from scanned documents and/or from
> low-end software are often just images that can be read by humans,
> but cannot be searched by keyword or indexed by search engines. I am
> sure that most if not all repositories hold such PDFs. This Google
> initiative will unfurl the cloak of invisibility from them.
>
> Peter Millington
> SHERPA, University of Nottingham
>
> * Google sheds light on 'Dark Web' by searching scanned documents
> http://cwflyris.computerworld.com/t/3821061/247711/148332/2/
>
> Using optical character recognition (OCR) technology, Google's search
> engine now can convert scanned PDF documents into text that can be
> searched and indexed, the company said. Thus, government reports,
> academic papers and other scanned documents can now show up in search
> results. Search engines generally interpret PDF documents as images of
> text rather than text.
>
> This message has been checked for viruses but the contents of an
> attachment may still contain software viruses, which could damage
> your computer system: you are advised to perform your own checks.
> Email communications with the University of Nottingham may be
> monitored as permitted by UK legislation.
>
--------------------------------------------
Paul Walk
Technical Manager
UKOLN (University of Bath)
http://www.ukoln.ac.uk/
[log in to unmask]
+44(0)1225383933
--------------------------------------------
|