steps
-
install
$ brew install tesseract --all-languages --with-serial-num-pack Installing dependencies for tesseract: autoconf, autoconf-archive, automake, leptonica ... Warning: tesseract: --all-languages was deprecated; using --with-all-languages instead! ...
-
using tika and tesseract
-
use tesseract
$ tesseract -psm 3 /path/to/tiff/file.tiff out.txt $ cat out.txt
-
use tika
$ tika -t /path/to/tiff/file.tiff
-
-
using tika server and tesseract
-
start tika server
$ java -jar /path/to/tika-server-1.7-SNAPSHOT.jar
-
curl request
$ curl -T /path/to/tiff/file.tiff http://localhost:9998/tika --header "Content-type: image/tiff"
-
using different language models
# single language $ curl -T /path/to/tiff/image.jpg http://localhost:998/tika --header "X-Tika-OCRLanguage: eng" # multiple languages $ curl -T /path/to/tiff/image.jpg http://localhost:998/tika --header "X-Tika-OCRLanguage: eng+fra"
-
-
docker-tikaserver Dockerfile on github
-
pull
$ docker pull logicalspark/docker-tikaserver
-
run
$ docker run --rm -d -p 9998:9998 logicalspark/docker-tikaserver
-
-
-
set
TESSDATA_PREFIX
$ cd /usr/local/Cellar/tesseract/3.05.00/share/tessdata $ export TESSDATA_PREFIX="`pwd`"
-
chi_sim
$ tesseract ./test.png ./test -l chi_sim Tesseract Open Source OCR Engine v3.05.00 with Leptonica $ cat text.txt ... orc output of test.png file ...
-
-
running the tika server
-
java -jar
$ java -jar tika-server-1.14.jar ...
-
change host name and port number
$ java -jar tika-server-1.14.jar --host=intranet.local --port=9999 Mar 09, 2017 9:35:55 AM org.apache.tika.server.TikaServerCli main INFO: Starting Apache Tika 1.14 server Mar 09, 2017 9:35:56 AM org.apache.cxf.endpoint.ServerImpl initDestination INFO: Setting the server's publish address to be http://localhost:9999/ Mar 09, 2017 9:35:56 AM org.slf4j.impl.JCLLoggerAdapter info INFO: jetty-8.y.z-SNAPSHOT Mar 09, 2017 9:35:56 AM org.slf4j.impl.JCLLoggerAdapter info INFO: Started SelectChannelConnector@localhost:9999 Mar 09, 2017 9:35:56 AM org.apache.tika.server.TikaServerCli main INFO: Started
-
visit http://localhost:9999/
Welcome to the Apache Tika 1.14 Server For endpoints, please see https://wiki.apache.org/tika/TikaJAXRS and http://tika.apache.org/1.14/miredot/index.html PUT /detect/stream Class: org.apache.tika.server.resource.DetectorResource Method: detect Produces: text/plain GET /detectors Class: org.apache.tika.server.resource.TikaDetectors Method: getDectorsHTML Produces: text/html
-