Time filter

Source Type

Garz A.,Vienna University of Technology | Fischer A.,Institute of Computer Science and Applied Mathematics | Sablatnig R.,Vienna University of Technology | Bunke H.,Institute of Computer Science and Applied Mathematics
Proceedings - 10th IAPR International Workshop on Document Analysis Systems, DAS 2012 | Year: 2012

Segmenting page images into text lines is a crucial pre-processing step for automated reading of historical documents. Challenging issues in this open research field are given e.g. by paper or parchment background noise, ink bleed-through, artifacts due to aging, stains, and touching text lines. In this paper, we present a novel binarization-free line segmentation method that is robust to noise and copes with overlapping and touching text lines. First, interest points representing parts of characters are extracted from gray-scale images. Next, word clusters are identified in high-density regions and touching components such as ascenders and descenders are separated using seam carving. Finally, text lines are generated by concatenating neighboring word clusters, where neighborhood is defined by the prevailing orientation of the words in the document. An experimental evaluation on the Latin manuscript images of the Saint Gall database shows promising results for real-world applications in terms of both accuracy and efficiency. © 2012 IEEE.


Fischer A.,Institute of Computer Science and Applied Mathematics | Indermuhle E.,Institute of Computer Science and Applied Mathematics | Bunke H.,Institute of Computer Science and Applied Mathematics | Viehhauser G.,Institute For Germanistik | Stolz M.,Institute For Germanistik
ACM International Conference Proceeding Series | Year: 2010

Handwriting recognition in historical documents is vital for the creation of digital libraries. The creation of readily available ground truth data plays a central role for the development of new recognition technologies. For historical documents, ground truth creation is more difficult and timeconsuming when compared with modern documents. In this paper, we present a semi-automatic ground truth creation proceeding for historical documents that takes into account noisy background and transcription alignment. The proposed ground truth creation is demonstrated for the IAM Historical Handwriting Database (IAM-HistDB) that is currently under construction and will include several hundred Old German manuscripts. With a small set of algorithmic tools and few manual interactions, it is shown how laypersons can efficiently create a ground truth for handwriting recognition. Copyright 2010 ACM.


Fornes A.,Autonomous University of Barcelona | Frinken V.,Institute of Computer Science and Applied Mathematics | Fischer A.,Institute of Computer Science and Applied Mathematics | Almazan J.,Autonomous University of Barcelona | And 2 more authors.
ACM International Conference Proceeding Series | Year: 2011

The automatic processing of handwritten historical documents is considered a hard problem in pattern recognition. In addition to the challenges given by modern handwritten data, a lack of training data as well as effects caused by the degradation of documents can be observed. In this scenario, keyword spotting arises to be a viable solution to make documents amenable for searching and browsing. For this task we propose the adaptation of shape descriptors used in symbol recognition. By treating each word image as a shape, it can be represented using the Blurred Shape Model and the De-formable Blurred Shape Model. Experiments on the George Washington database demonstrate that this approach is able to outperform the commonly used Dynamic Time Warping approach. © 2011 ACM.


Fischer A.,Institute of Computer Science and Applied Mathematics | Frinken V.,Institute of Computer Science and Applied Mathematics | Fornes A.,Autonomous University of Barcelona | Bunke H.,Institute of Computer Science and Applied Mathematics
ACM International Conference Proceeding Series | Year: 2011

Transcriptions of historical documents are a valuable source for extracting labeled handwriting images that can be used for training recognition systems. In this paper, we introduce the Saint Gall database that includes images as well as the transcription of a Latin manuscript from the 9th century written in Carolingian script. Although the available transcription is of high quality for a human reader, the spelling of the words is not accurate when compared with the handwriting image. Hence, the transcription poses several challenges for alignment regarding, e.g., line breaks, abbreviations, and capitalization. We propose an alignment system based on character Hidden Markov Models that can cope with these challenges and efficiently aligns complete document pages. On the Saint Gall database, we demonstrate that a considerable alignment accuracy can be achieved, even with weakly trained character models. © 2011 ACM.

Loading Institute of Computer Science and Applied Mathematics collaborators
Loading Institute of Computer Science and Applied Mathematics collaborators