This model was trained on a dataset of 19th-century Romanian documents obtained from the Central University Libraries (BCU) of Timișoara, Iași, and Cluj-Napoca, Romania.
The training dataset comprises 147 pages of Romanian texts written in the Romanian Transitional Script (RTS). The RTS script is a combination of Latin and Cyrillic characters that were employed during the 19th century in the Romanian provinces. Its purpose was to facilitate the transition from the Romanian Cyrillic Script to the modern Latin Script.
The images within the dataset span the period between 1833 and 1864, providing a comprehensive representation of the linguistic and typographic variations during that time. The selected texts encompass a diverse range of literary genres, including poems, novels, dramas, stories, newspapers, and religious texts.
For more details about the project, visit our website:
The dataset is available to download from Kaggle: https://www.kaggle.com/datasets/mariuspenteliuc/rts-ocr
This work was supported by a grant of the Romanian Ministry of Research, Innovation and Digitization, CCCDI – UEFISCDI, project number PN-III-P2-2.1-PED-2021-0693, within PNCDI III.