Bhattacharya S.,University of Iowa |
Bhattacharya S.,Linguamatics |
Srinivasan P.,University of Iowa |
Polgreen P.,University of Iowa
PLoS ONE | Year: 2014
Objective: To investigate factors associated with engagement of U.S. Federal Health Agencies via Twitter. Our specific goals are to study factors related to a) numbers of retweets, b) time between the agency tweet and first retweet and c) time between the agency tweet and last retweet. Methods: We collect 164,104 tweets from 25 Federal Health Agencies and their 130 accounts. We use negative binomial hurdle regression models and Cox proportional hazards models to explore the influence of 26 factors on agency engagement. Account features include network centrality, tweet count, numbers of friends, followers, and favorites. Tweet features include age, the use of hashtags, user-mentions, URLs, sentiment measured using Sentistrength, and tweet content represented by fifteen semantic groups. Results: A third of the tweets (53,556) had zero retweets. Less than 1% (613) had more than 100 retweets (mean = 284). The hurdle analysis shows that hashtags, URLs and user-mentions are positively associated with retweets; sentiment has no association with retweets; and tweet count has a negative association with retweets. Almost all semantic groups, except for geographic areas, occupations and organizations, are positively associated with retweeting. The survival analyses indicate that engagement is positively associated with tweet age and the follower count. Conclusions: Some of the factors associated with higher levels of Twitter engagement cannot be changed by the agencies, but others can be modified (e.g., use of hashtags, URLs). Our findings provide the background for future controlled experiments to increase public health engagement via Twitter. © 2014 Bhattacharya et al. Source
Lewin I.,Linguamatics |
Clematide S.,University of Zurich
CEUR Workshop Proceedings | Year: 2013
We describe the automatic harmonization method used for building the English Silver Standard annotation supplied as a data source for the multilingual CLEF-ER named entity recognition challenge. The use of an automatic Silver Standard is designed to remove the need for a costly and time-consuming expert annotation. The final voting threshold of 3 for the harmonization of 6 different annotations from the project partners kept 45% of all available concept centroids. On average, 19% (SD 14%) of the original annotations are removed. 97.8% of the partner annotations that go into the Silver Standard Corpus have exactly the same boundaries as their harmonized representations. Source
Agency: Cordis | Branch: FP7 | Program: CSA | Phase: ICT-2007.4.4 | Award Amount: 2.20M | Year: 2009
This proposal defines a support action project that brings together the researchers from international biomedical text-mining groups to address the difficult issue of annotating large text corpora with a large set of semantic types. We propose a collaborative approach to this annotation task in the form of an open challenge to the biomedical text-mining community. The task is the annotation of named entities in a large biomedical corpus, for a variety of semantic categories. The project delivers as outcome a large, collaboratively annotated corpus, marked with the mentions of biomedical entities. The annotated corpus becomes a resource for the community, to be used as a reference for improving text-mining applications. The biomedical text-mining research community has a long tradition of organizing such challenges, as a way of evaluating techniques, sharing technical knowledge, and helping to improve the results from text-mining programs. However, such challenges have typically addressed relatively small corpora in a narrow sub-domain, in part because the evaluation of the results is extremely long and costly. As a result, the generated annotated corpora are too small and are only narrowly annotated to be useful in a variety of text-mining applications. In contrast, we propose to create a broadly-scoped and large annotated corpus by integrating the annotations from different named entity recognition systems. Metadata will also be added to the corpus. The participating systems have different application scopes and annotation strategies, and therefore complement each other. As a consequence, the annotated corpus reflects these different scopes and strategies. A secondary goal of this project is to define a standardized format for representing the annotations contributed by the participants and comparing them effectively. Currently the lack of such a format hinders progress in the evaluation of named entity recognition systems.
Agency: Cordis | Branch: FP7 | Program: CP | Phase: ICT-2011.4.1 | Award Amount: 2.31M | Year: 2012
This project will provide multilingual terminologies and semantically annotated multilingual documents, e.g., patent texts, to improve the accessibility of scientific information from multilingual documents. The two SME partners will use these resources to improve the quality and functionality of their product offerings, viz. delivering multilingual search and text mining engines based on multilingual terminologies. Both SMEs will market these solutions to their customer base.The MANTRA project capitalizes on parallel document corpora from which translational correspondences will be computed by the use of different alignment methods. Fortunately, the biomedical domain, the application scenario of MANTRA, offers a rich variety of such parallel corpora. We will exploit these multilingual document sets to harvest terms and concept representations in different languages in order to augment currently available terminological resources such as the Medical Subject Headings (MeSH).The project partners will collaboratively build two types of resources: automatically enhanced multilingual terminologies and semantically annotated multilingual documents. The novelty of the latter resource derives from the fact that we solicit and orchestrate community efforts for building up these annotated resources, a procedure that has already been proven successful for the semantic enrichment of large-scale biomedical document corpora (CALBC project) which was executed by the project partners. The novelty of the first comes from a new combination of existing technologies in the area of statistical machine translation, named entity tagging and terminological resources. We start from statistically aligned, parallel documents on which named entity taggers are run to produce highly diverse semantic (named entity) annotations. These annotations signal concept mentions in the text which can then be linked to corresponding entries in relevant biomedical ontologies (from the UMLS, OBO or BioPortal umbrellas), and, in addition, provide the corresponding concept identifiers. Parallel named entity occurrences lacking links to the chosen ontologies can be considered as putative translation equivalents. Validated putative translation equivalents can then be used to enhance already given monolingual terminological resources. Both types of resources will be made available to the public for translation purposes and for search in and text mining from multilingual documents.
Agency: GTR | Branch: Innovate UK | Program: | Phase: Collaborative Research & Development | Award Amount: 153.68K | Year: 2014
In the age of Big Data, knowledge workers - individuals, companies and organisations whose primary focus is knowledge and information extraction and usage - find it increasingly difficult to search for and identify accurate and relevant information. In the domain of scientific literature and IP search, where the underlying corpora are growing at a huge rate, this is a daunting task and human expertise and involvement remain critical. This project aims to develop a suite of tools that will enable users to search for and identify relevant information within a corpus more efficiently and effectively. The methods developed will deploy new search paradigms together with semantic-based analysis, domain and lexical linguistic ontologies in order to understand the user needs based on the underlying domain of application and subsequently enable accurate information retrieval through enhanced search and cross-reference of information. The project aims to offer tools for sharing of search strategies which will be identified by observing and understanding patterns in users search behaviours.