Moscow, Russia
Moscow, Russia

ABBYY is an international software company that provides optical character recognition, document capture and language software for both PC and mobile devices.The majority of ABBYY products, such as ABBYY FineReader, are intended to simplify converting paper documents to digital data. ABBYY also provides language products and services. Wikipedia.

SEARCH FILTERS
Time filter
Source Type

An algorithm for assigning priorities to tasks queued for processing by users based on how heavily each tasks user used the system resources in the past, including the number of tasks queued by the user in the past, the volume of these tasks, and the amount of processor time used. In the OCR context, the tasks are graphic files placed on servers and chosen for processing in accordance with the assigned priorities.


The present invention is directed to a method of extracting data from fields in an image of a document. In one implementation, a text representation of the image of the document is obtained. A graph for storing features of the text fragments in the text representation of the image of the document and their links is constructed. A cascade classification for computing the features of the text fragments in the text representation of the image of the document and their link is run. Hypotheses about the belonging of text fragments to the fields in the image of the document are generated. Combinations of the hypotheses are generated. A combination of the hypotheses is selected. And data from the fields in the image of the document is extracted based on the selected combination of the hypotheses.


Systems and methods for extracting information from structured documents comprising natural language text. An example method comprises: receiving a table comprising a natural language text; identifying, within the table, a header and a plurality of cells organized into rows and columns; performing semantico-syntactic analysis of the natural language text to produce a plurality of semantic structures; interpreting the plurality of semantic structures using a first set of production rules to produce a data object representing the table; analyzing the header to identify a plurality of ontology classes associated with respective table columns; and modifying the data object representing the table using a second set of production rules associated with the ontology classes associated with the table columns.


Systems and methods for detecting near-duplicate images using triples of adjacent ranked features (TARFs). An example method may include: identifying a plurality of TARFs associated with a query image, wherein each TARF comprises a blob feature point and two corner feature points; identifying, using an index of a corpus of images, an at least one candidate image having at least one TARF matching a TARF of the plurality of TARFs associated with the query image; and responsive to evaluating a filtering condition, identifying the candidate image as a near-duplicate of the query image.


Patent
Abbyy | Date: 2016-08-04

Disclosed are systems, computer-readable mediums, and methods for detecting glare in a frame of image data. A frame of image data is preprocessed. A set of connected components in the preprocessed frame is determined. A set of statistics is calculated for one or more connected components in the set of connected components. A decision for the one or more connected components is made, using the calculated set of statistics, if the connected component is a light spot over text. Whether glare is present in the frame is determined.


Disclosed are methods, systems, and computer-readable mediums for automatic training of a syntactic and semantic parser using a genetic algorithm. An initial population is created, where the initial population comprises a vector of parameters for elements of syntactic and semantic descriptions of a source sentence. A natural language compiler (NLC) system is used to translate the sentence from the source language into a target language based on the syntactic and semantic descriptions of the source sentence. A vector of quality ratings is generated where each quality rating in the vector of quality ratings is of a corresponding parameter in the vector of parameters. Quality ratings are evaluated according to specific criterion, which comprise parameters such as a BLEU score and a number of emergency sentences. A number of parameters in the vector of parameters are replaced with adjusted parameters.


Systems and methods for identifying word collocations in natural language texts. An example method comprises: performing, by a computing device, semantico-syntactic analysis of a natural language text to produce a plurality of semantic structures; generating, in view of relationships defined by the semantic structures, a raw list of word combinations; producing a list of collocations by applying a heuristic filter to the raw list of word combinations; and using the list of collocations to perform a natural language processing operation.


Patent
Abbyy | Date: 2015-12-14

A data capture component of a mobile device receives information for an identification of a data field in a physical document. The data capture component receives a video stream comprising a plurality of frames, wherein each frame comprises a portion of the physical document. A frame is selected from the plurality of frames in the video stream. One or more text regions in the frame are identified. Each of the identified text region(s) in the frame is processed to identify data of each of the identified text region(s) and to select data of one of the identified text region(s) that corresponds to a set of attributes associated with the data field. The selected data is then compared with data of text regions of a subsequent frame. If the data of the text regions of the subsequent frame is a closer match to the set of attributes, the selected data is updated. A display field is then provided with the selected data for presentation in a user interface.


Patent
Abbyy | Date: 2016-06-26

Disclosed are systems, computer-readable mediums, and methods for determining that text contains Chinese, Japanese, or Korean characters. One method includes determining a language hypothesis for each text fragment in a plurality of text fragments identified from connected components in a document image. The method further includes selecting a first subset of text fragments from the plurality of text fragments based on ratings for the language hypothesis of each text fragment in the plurality of text fragments. The method further includes verifying, by a processor, the language hypothesis of one or more text fragments in the first subset of text fragments based on optical character recognition of the one or more text fragments. The method further includes determining, by the processor, that Chinese, Japanese, or Korean (CJK) characters are present in the document image based on the verification of the language hypothesis of each of the one or more text fragments.


There is disclosed a method of determining a document type associated with a digital document, the method executable by an electronic device. A processor of the electronic device is configured to execute a plurality of machine learning algorithm (MLA) classifiers, each of the plurality of MLA classifiers having been trained to identify a specific document type. The plurality of MLA classifiers is ranked in a hierarchical order of execution of the plurality of MLA classifiers. A method of training the plurality of MLA classifiers is also disclosed.

Loading ABBYY collaborators
Loading ABBYY collaborators