ApacheCon US 2009 Session
Implementing an Information Retrieval Framework for an Organizational Repository
- Sithu D Sudarsan
- Fri, 06 November 2009 09:00
- No Materials Available
Successful Information Retrieval (IR) frameworks for large repositories have been reported in recent times. Invariably, all of them have used machine readable repositories, where plain text availability is the norm. However, organizations with legacy archives need to develop a framework which first converts the non-electronic archive to an electronic archive, and then extracts machine readable text with an acceptable error rate. The Food and Drug Administration (FDA) has electronic images of their document repository. These documents were collected as part of their charter to approve and monitor products related to health care. The documents date back multiple decades, and are in formats ranging from microfiche through early optical character recognition to recent electronic formats. We believe that a large knowledge base hidden in the FDAs document repository could be mined. To mine this knowledge base, we are developing a semantic-mining framework using open source tools such as Apache Lucene, Apache PDFBox, Apache Solr, Apache Poi, and Java. Challenges include determining the quality of text being extracted and the ability to handle documents containing formatted text in part. The text itself may contain specific vocabularies from medical, legal, engineering and scientific domains, and terminology that evolves over time. Careful thought needs to be given to selecting analyzers for indexing and retrieval and implementing a framework for heuristics useful to domain experts as well as novices. An initial prototype, with a sample size of over 100,000 documents and 70GB of data, is currently being evaluated for different extractors, analyzers and search heuristics, with multiple indices for each document stored in a distributed fashion.
























