The following
development from Google could have a big impact on institutional repositories.
PDFs from scanned documents and/or from low-end software are often just images
that can be read by humans, but cannot be searched by keyword or indexed by
search engines. I am sure that most if not all repositories hold such PDFs. This
Google initiative will unfurl the cloak of invisibility from
them.
Peter
Millington
SHERPA, University
of Nottingham
* Google
sheds light on 'Dark Web' by searching scanned documents
http://cwflyris.computerworld.com/t/3821061/247711/148332/2/
Using optical character recognition (OCR)
technology, Google's search
engine now can convert scanned PDF documents into
text that can be
searched and indexed, the company said. Thus, government
reports,
academic papers and other scanned documents can now show up in
search
results. Search engines generally interpret PDF documents as images
of
text rather than text.
This message has been checked for viruses but the contents of an attachment
may still contain software viruses, which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.