PyLaia model created from Ground Truth data resulting from the transcription and manual segmentation of a sample of 193 pages of the Spanish XVIII-XIX press, in particular volumes from “Diario de Madrid 1788-1825” (https://hemerotecadigital.bne.es/hd/card?oid=0001510462).
This model has been developed within the CLARA-HD project (https://clara-nlp.uned.es/home/dh) founded by the Spanish Ministery and is valid for automatically transcribing similar Spanish prints of the same period. Manual segmentation is recommended since newspapers usually contain tables and columns. A CER of 1% on validation set has been achieved.
For more information or details please contact Eva Sánchez Salido at evasan@lsi.uned.es or Ana García Serrano at agarcia@lsi.uned.es.
Please cite this model as: Menta, A., Sánchez-Salido, E., & García-Serrano, A. (2022). Transcripción de periódicos históricos: Aproximación CLARA-HD. Proceedings of the Annual Conference of the Spanish Association for Natural Language Processing 2022: Projects and Demonstrations (SEPLN-PD 2022).