An Algorithm for sentence recovery from PDF files

Vesna Pajić University of Belgrade, Faculty of Agriculture
Staša Vujičić Stanković University of Belgrade, Faculty of Mathematics
Miloš Pajić University of Belgrade, Faculty of Agriculture

Abstract

The use of PDF documents in Natural Language Processing (NLP) became an almost daily activity for researchers in the field of computer linguistics and alike. Extracting plain text from PDF documents, with existing software tools, leads to severe distortion of sentence and paragraph structures, which is a huge problem for linguistically oriented research. In this paper, we present a novel algorithm for recovering sentences and paragraphs from PDF documents, called Sentence Recovery Algorithm or SR algorithm. The algorithm takes plain text extracted from a PDF document as an input, and tends to recover sentences from it. It takes into account cases like misinterpreted end of line, interruption of a sentence by tables or figures, problems occurred by hyphenation and so on. Beside describing and evaluating the algorithm, we present a use case for processing scientific articles originally given in PDF format, implemented in Java programming language.

2014_2_en_04.pdf

Published

2024-02-29

How to Cite

PAJIĆ, Vesna; VUJIČIĆ STANKOVIĆ, Staša; PAJIĆ, Miloš. An Algorithm for sentence recovery from PDF files. Infotheca - Journal for Digital Humanities, [S.l.], v. 15, n. 2, feb. 2024. ISSN 2217-9461. Available at: <https://infoteka.bg.ac.rs/ojs/index.php/Infoteka/article/view/345>. Date accessed: 09 july 2026.

Citation Formats

Issue

Vol 15 No 2 (2014): Infotheca - Journal for Digital Humanities

Section

Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

An Algorithm for sentence recovery from PDF files

Abstract

Most read articles by the same author(s)

Publisher

		Faculty of Philology, University of Belgrade
		University Library „Svetozar Marković“
		Association of Libraries of the Universities of Serbia