Apache Tika

From Wikipedia, the free encyclopedia

Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation.[2] It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages.

The project originated as part of the Apache Nutch codebase, to provide content identification and extraction when crawling. In 2007, it was separated out, to make it more extensible and usable by content management systems, other Web crawlers, and information retrieval systems. The standalone Tika was founded by Jérôme Charron, Chris Mattmann and Jukka Zitting.[3] In 2011 Chris Mattmann and Jukka Zitting released the Manning book "Tika in Action", and the project released version 1.0.

Features

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types. For most of the more common and popular formats,[4] Tika then provides content extraction, metadata extraction and language identification capabilities.

It can also get text from images by using the OCR software Tesseract.[5]

While Tika is written in Java, it is widely used from other languages.[6] The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

Notable uses

See also

References

Related Articles

Wikiwand AI