Historical OCR Pipeline

HOP (Historical OCR Pipeline)

HOP is a tool that can massively handle legacy products of various OCR systems, such as Olive Software's system products and bring more accurate OCR using Transkribus. It is a pipeline that transforms PRXML files via Transkribus into a research corpus, and deals with the challenge of improving the OCR without losing the valuable work that was done hitherto to analyze the layout and content structure of the newspapers. For this we created an workflow which converts the legacy format to an open format, on which the improved text recognition technologies can run to produce improved output that meets the threshold and requirements of text analytical research.

The open code of the workflow can be found here: https://github.com/omilab/2022_OCR_Pipeline

This project was initiated at OMILab and the pipeline was created by Nurit Greidinger, Yanir Marmor, Itai Zandbak, under the supervision of Dr. Sinai Rusinek.

OmiLab’s project on Historical Newspaper Archive Research has been running during 2018-2021 in collaboration with the Historical Jewish Press project of the Tel Aviv University and the National Library of Israel. [The National Library of Israel (https://web.nli.org.il/sites/nli/english/pages/default.aspx)] provided access to selected image and OCR output files at the back end of JPRESS.

On December 2022, OMILab won 10.000 credits for Transkribus OCR in the #MyTranskribusStory competition carried out by READ COOP. See here: https://readcoop.eu/a-trip-through-our-written-past/

Team

Expand All

Dr. Sinai Rusinek, OmiLab

OMILab team: Itay Zandbank, Nurit Greidingher, Yanir Marmor