Saturday, October 21, 2006

Unstructured Information Management

Unstructured information makes up most of the information content on the internet today. Estimates are as high as 90% of available information on the internet is unstructured. So with all the databases, portals, websites, repositories, hard drives and trillions of files that exist today how do you harness this information? This is where the field of Information Management has the technical challenge to turn all this information into useful information and knowledge. This is entirely conceiveable given sufficient time, computing power and storage. The challenge is making this happen in near realtime.

To meet this challenge, DARPA has funded IBM Research in 2005 to create UIMA which stands for the Unstructured Information Management Architecture. It is an open, industrial-strength, scaleable and extensible platform for creating, integrating and deploying unstructured information management solutions from combinations of semantic analysis and search components. IBM makes UIMA available as a free SDK (alpha), and makes the core Java framework available as open source software (UIMA at SourceForge) to provide a common foundation for industry and academia to collaborate and accelerate the world-wide development of technologies critical for discovering the vital knowledge present in the fastest growing sources of information today. IBM developerWorks has a tutorial for using the UIMA SDK with Eclipse.

Since IBM released UIMA as open source in early 2006, it has been widely adopted. Open source projects such as GATE, OntoText, and many other have been utilizing UIMA as the framework for unstructured information management research. As research into managing and harnessing unstructured information grows, there will be more available solutions to solve these problems.

No comments: