Extrat text and image contents from pdf-files

Due to memory limitations of the browser the files are read in several chunks.

By Frode Eika Sandnes, OsloMet, March, 2025

Select pdf-files to process

Batch convert pdf documents to json with file info, text contents, page info and figure info.

Image (url-data) data are stored in separate json-file.

For very many large pdf files run several times and join the parts using the separate json-joiner tool.

Use Contrl/Shift to select multiple files (Contrl-A for all).


Reading files from disk...

Reading pdf files and placing contents in temporary indexdb....should be quite quick!

Setting up indexdb...

Setting up local indexdb for temporary storing partial results. This may take a while....

Parsing pdf contents

Extracting the text and image contents.

Current file:

Status: / reports processed

Cleaning up... please wait!

Please wait for the indexdb in the browser to be cleaned....this could take a while...

Finished

Please find the two json files containing the text contents and image contents in your download folder.

Reload to convert more pdf-documents.

Use the json-merging tool if you need to combine several results.