Architecture

What happens when document is uploaded

graph TD; A([User uploads file]) --> B(Extract content) B --> C(Match metadata) C --> D(Run user's processing rules) D --> E(Index in full-text-search engine)

The system

Below is depicted a rough architecture of the system. All of the relational data is stored in a PostgreSql database and a rendered document is stored in Meilisearch. Meilisearch results in a phenomenal search experience with its typo-tolerant and fast search across the documents.

Why use Meilisearch and not Elasticsearch? It’s a burden to keep Elasticsearch up and running. Meilisearch provides a really good search experience out of the box and doesn’t eat up all your resources.

graph TD; A[Virtualpaper server] --> B[(PostgreSQL)] A --> C[(Meilisearch)] A --> D[Libraries] D --> E([Tesseract]) D --> F([ImageMagick]) D --> G([Poppler-utils]) D --> H([Pandoc])

Known limits

Virtualpaper has not been optimized for scaling. The number of users is limited to 200, since that is the limit of indices a single Meilisearch instance can process. Virtualpaper reserves single index for each user. This allows each user to customize the search experience to their needs, ranging from setting stop word to creating their personal dictionary of synonyms. This all improves the search experience.

Also, if lots of large documents are being added, Meilisearch takes some time to actually index them. For instance, uploading 1000s of research papers may require several hours of indexing, depending on the hardware. The searching is really fast, but the indexing process is not if lots of data is inserted in a short period of time.

Other than that, Meilisearch can handle millions of documents, so the few users should be fine.