Manual

Virtualpaper can be run as manually as a system daemon. Refer to Dockerfile and docker-compose.yml for sample configurations.

Requirements:

PostgreSql

Virtualpaper needs a running Postgresql database and a Meilisearch server. Make sure the database has ‘utf8’ encoding. New database can be created with command:

psql > CREATE DATABASE virtualpaper WITH ENCODING='utf8' TEMPLATE template0;

Virtualpaper

Meilisearch server can be shared with other processes too, but Virtualpaper uses an index for each user, namespaced as ‘virtualpaper-{userid}’. Meilisearch has a hard limit of 200 indices, meaning that Virtualpaper is able to handle maximum of 200 users.

Binaries

Virtualpaper uses libraries / other binaries for processing the documents. List of required and recommended binaries:

  • tesseract-ocr
  • tesseract training data for languages that are being OCR
  • imagemagick & imagemagick-dev
  • poppler-utils
  • pandoc

Depending on distribution Imagemagick likely requires modifying its policies via the /etc/ImageMagick-7/policy.xml. The Virtualpaper repository contains reference policy at docker/imagemagick-7-policy.xml

Tesseract language packs

Tesseract is the program used for OCR:ing pdf and image files. For optimal results it needs a language pack for the specific languages that the documents are primarily. Many distributions ship these packages as e.g. package ’tesseract-ocr-fin’ but they are also available in https://github.com/tesseract-ocr/tessdata.