Vol. 15, No. 2, 2014.

Vesna Pajić
University of Belgrade, Faculty of Agriculture

Staša Vujičić Stanković
University of Belgrade, Faculty of Agriculture

Miloš Pajić
University of Belgrade, Faculty of Mathematics

 

AN ALGORITHM FOR SENTENCE RECOVERY FROM PDF FILES

UDC: 81'322.2:004.912
Keywords.Natural Language Processing, Language Resources, Java programming, PDF processing  
Abstract. The use of PDF documents in Natural Language Processing (NLP) became an almost daily activity for researchers in the field of computer linguistics and alike. Extracting plain text from PDF documents, with existing software tools, leads to severe distortion of sentence and paragraph structures, which is a huge problem for linguistically oriented research. In this paper, we present a novel algorithm for recovering sentences and paragraphs from PDF documents, called Sentence Recovery Algorithm or SR algorithm. The algorithm takes plain text extracted from a PDF document as an input, and tends to recover sentences from it. It takes into account cases like misinterpreted end of line, interruption of a sentence by tables or figures, problems occurred by hyphenation and so on. Beside describing and evaluating the algorithm, we present a use case for processing scientific articles originally given in PDF format, implemented in Java programming language.
 

                                                                         


SCIENTIFIC PAPERS

Henri Broch
TEACHING “PARANORMAL VS. ZETETICS” AT THE UNIVERSITY USING PSEUDOSCIENCE TO TEACH THE SCIENTIFIC METHOD
 

Ramón Reichert
DIGITAL HUMANITIES
 

Soufiane Rouissi, Ana Štulić
JUDEO-SPANISH ON THE WEB: DESCRIPTION OF A SOCIAL BOOKMARKING EXPERIMENT


Vesna Pajić, Staša Vujičić Stanković, Miloš Pajić
AN ALGORITHM FOR SENTENCE RECOVERY FROM PDF FILES

PROFESSIONAL PAPERS

Katarina Perić, Ana Nikolić, Kristina Gogić
MAKING OF THE MULTIMEDIA DOCUMENT “AROUND THE WORLD IN 80 DAYS“
 

Aleksandra Adžić
DIGITIZING MATERIALS USING THE LIBRARY INFORMATION SYSTEM – NI BIS


REVIEWS

 

Marko Vitas
THE EUROPEAN SUMMER SCHOOL IN DIGITAL HUMANITIES,LEIPZIG 2014

Nataša Dakić, Dejana Kavaja Stanišić
OVERVIEW OF A FINAL MEETING OF A CONSORTIUM OF THE ”EUROPEANA NEWSPAPERS” PROJECT