Macskassy S.A.,Fetch Technologies
Social Network Analysis and Mining | Year: 2011
The last decade has seen an explosion in blogging and the blogosphere is continuing to grow, having a large global reach and many vibrant communities. Researchers have been pouring over blog data with the goal of finding communities, tracking what people are saying, finding influencers, and using many social network analytic tools to analyze the underlying social networks embedded within the blogosphere. One of the key technical problems with analyzing large social networks such as those embedded in the blogosphere is that there are many links between individuals and we often do not know the context or meaning of those links. This is problematic because it makes it difficult if not impossible to tease out the true communities, their behavior, how information flows, and who the central players are (if any). This paper seeks to further our understanding of how to analyze large blog networks and what they can tell us. We analyze 1.13M blogs posted by 185K bloggers over a period of 3 weeks. These bloggers span private blog sites through large blog-sites such as LiveJournal and Blogger. We show that we can, in fact, tag links in meaningful ways by leveraging topic-detection over the blogs themselves. We use these topics to contextually tag links coming from a particular blog post. This enrichment enables us to create smaller topic-specific graphs which we can analyze in some depth. We show that these topic-specific graphs not only have a different topology from the general blog graph but also enable us to find central bloggers which were otherwise hard to find. We further show that a temporal analysis identifies behaviors in terms of how components form as well as how bloggers continue to link after components form. These behaviors come to light when doing an analysis on the topic-specific graphs but are hidden or not easily discernable when analyzing the general blog graph. © 2011, Springer-Verlag.
Agency: Department of Defense | Branch: Defense Advanced Research Projects Agency | Program: SBIR | Phase: Phase I | Award Amount: 148.91K | Year: 2010
The past decade has seen an explosion in online social media, such as blogs, forums, twitter, and so forth. This online information can give us insights into groups and communities--what are their "hot button" issues, how are they responding to current events, and how are they likely to react in the future. With the advance and proliferation of technology, online groups and communities now include populations of high importance to the U.S. military planners, such as students in Iran, social conservatives in Saudia Arabia, and housewives in Egypt -- all sorts of groups throughout the Middle East. Currently, there is a great deal of information in social media that could be exploited for the benefit of military planners, but it is infeasible to monitor social media channels and provide simple, realtime assessment, analysis and predictive capabilities. This is the problem which we address in this proposal.
Agency: Department of Defense | Branch: Defense Advanced Research Projects Agency | Program: SBIR | Phase: Phase II | Award Amount: 749.99K | Year: 2010
In this project, we are developing an approach for identifying and exposing the latent semantics within a folksonomy, which will enable a new class of data integration applications. We have previously developed software enabling non-programmers to create web feeds, and an “Intelligence Portal” system for displaying that data in an integrated view. The new application we are developing in this project will enable domain-experts to automatically integrate webfeeds into the portal without any programming being required. To achieve this, we will be investigating an approach that enables an expert to train the system to perform the integration task. The training process is very efficient, because the system automatically induces background concepts and relations based on a folksonomy, which in turn boosts its performance.
Agency: Department of Defense | Branch: Air Force | Program: SBIR | Phase: Phase I | Award Amount: 99.99K | Year: 2010
The ultimate aim of this project is to enable better entity-oriented situation awareness systems to be developed. Such systems should enable operators to rapidly “connect the dots” and allow them to track entities of interest. In this Phase I project we will design an approach for collecting information about entities from multiple heterogeneous sources, and for consolidating that information into entity profiles. We will also develop technology that will enable profiles to be monitored, so that alerts can be generated when significant changes occur. The project will explore the application of the technology, including an application to streamline the Market Research and Source Selection Phases of the Air Force’s acquisition cycle. BENEFIT: To achieve significant improvement in situation awareness applications, we need easy-to-use systems that enable information to be integrated and monitored, without necessitating a long, arduous, expensive programming project for each application that is created. The research described here will develop such an approach for collecting, integrating and monitoring information about entities. The work has a very targeted application for the military, which is to streamline the Market Research and Source Selection Phases of the Air Force’s acquisition cycle. In addition, there are important commercial markets for the technology. One market is the background screening industry. Currently, background checks on both companies and individuals tend to be a done sporadically, but in many situations, monitoring relevant information sources would be highly preferred. This technology improvements that we propose to investigate will enable such applications to be developed.
Agency: Department of Defense | Branch: Air Force | Program: SBIR | Phase: Phase II | Award Amount: 749.99K | Year: 2010
In this project Fetch Technologies will implement and evaluate a new approach to transforming and normalizing data from multiple heterogeneous sources. In previous work, Fetch Technologies developed and successfully commercialized a system for creating “transformation pipelines”. In a transformation pipeline, a new source (with its own unique schema) can be “dropped” into the pipeline, and as long as the sources’ data schema satisfies some very general constraints on the type of data present, then the pipeline will successfully normalize data from that source. Our objective is to design the next generation of this system, called AutoTrans, that will minimize the human effort necessary build a robust transformation pipeline. In particular, through the use of machine learning techniques, AutoTrans system will make it easier and more automatic to configure and modify a series of transformations. It will result in pipelines sequence that are robust even when the sequence of transformations is potentially incomplete or there is uncertainty in the data. BENEFIT: The aim of this project is to create a transformation system that minimizes the human effort necessary to aggregate data from multiple heterogeneous systems. Currently, integrating information from multiple domains and applications is technically challenging. Using existing transformation design systems is difficult because the transformations generally have to be designed by knowledgeable programmers. They are often one-to-one mappings, which must be modified or redesigned when a new data source needs to be integrated. Our approach represents an advance for data aggregation problems, because it allows one to implement a data pipeline that can normalize data from a wide variety of sources without reprogramming. The new AutoTrans technology represents the next generation of this approach. It will markedly decrease the human time and the skill-level required to develop and maintain these powerful pipelines. This in turn will produce a qualitative difference in how broadly this technology can be applied in commercial and military systems.
Fetch Technologies | Date: 2011-07-26
In accordance with an embodiment, data may be automatically extracted from semi-structured web sites. Unsupervised learning may be used to analyze web sites and discover their structure. One method utilizes a set of heterogeneous experts, each expert being capable of identifying certain types of generic structure. Each expert represents its discoveries as hints. Based on these hints, the system may cluster the pages and text segments and identify semi-structured data that can be extracted. To identify a good clustering, a probabilistic model of the hint-generation process may be used.
Agency: Department of Defense | Branch: Navy | Program: SBIR | Phase: Phase I | Award Amount: 99.98K | Year: 2011
Here we propose to develop technologies which automatically generates a social network by analyzing incoming data feeds, integrating the data around the entities and (semi-)automatically generate the relations between the entities. Because the saliency of a network is critical for important intelligence products where people are on the line, we envision that the creation of a social network be done with a human-in-the-loop, where the computer does the hard analytic lifting and the human directs the kinds of relations to look for and verifies that the relations are indeed correct before sending the network on for network analysis. We propose to investigate two key technologies which directly relate to this SBIR topic: (1) automatic extraction of relations from multiple disparate information feeds, and (2) workflows for rapid definition, creation, modification and extraction of relations. Both of these are critical pieces which are needed in order to speed up the creation of social networks for network analytics in intelligence products.
Agency: Department of Defense | Branch: Air Force | Program: SBIR | Phase: Phase II | Award Amount: 743.62K | Year: 2011
ABSTRACT: The ultimate aim of this project is to enable better entity-oriented situation awareness systems to be developed. Such systems should enable operators to rapidly"connect the dots"and allow them to track entities of interest. In this Phase II project we will implement an approach for collecting information about entities from multiple heterogeneous sources, and for consolidating that information into entity profiles. The resulting system, EMonitor, will enable profiles to be monitored, so that alerts can be generated when significant changes occur. The project will explore the application of the technology, including an application to streamline the Market Research and Source Selection Phases of the Air Force"s acquisition cycle, as well as an application that will help intelligence analysts monitor open source data. BENEFIT: To achieve significant improvement in situation awareness applications, we need easy-to-use systems that enable information to be integrated and monitored, without necessitating a long, arduous, expensive programming project for each application that is created. The research described here will develop such an approach for collecting, integrating and monitoring information about entities. The work has multiple applications for the Air Force, such as streamlining the Market Research and Source Selection Phases of the Air Force"s acquisition cycle. In addition, there are important commercial markets for the technology. One market is the background search industry. Currently, background checks on both companies and individuals tend to be a done sporadically, but in many situations, monitoring relevant information sources would be highly preferred. The technology prototyped in this project will enable such applications to be developed.
Agency: Department of Defense | Branch: Defense Advanced Research Projects Agency | Program: SBIR | Phase: Phase II | Award Amount: 1.50M | Year: 2011
The goal in this project is to develop GroupPulse, a system that enables decision-makers to gain insights from online communities. To do this, the system will monitor one or more social media streams, identify and track groups and sub-communities, and provide realtime analysis of group dynamics and topics. While current technologies can address some of the key capabilities needed for GroupPulse, they do not easily integrate nor do they individually provide the scalability needed for our vision. Specifically, in order to realize the GroupPulse vision, we need to integrate community detection algorithms, text mining methods to identify topics and sub-topics as they are mentioned in the social media streams, and models of group dynamics to monitor topics of interest in a group and how those shift over time.
News Article | December 1, 2014
Have you ever come across a piece of marketing advice about social media best uses and wondered, "How could you possibly know that will work?" To take the mystery out of your social media strategy, here are eight tips backed by science about the best and worst ways to retweet, and what you’re about to read may surprise you. Analysts at the Palo Alto Research Center conducted a massive study on 74 million tweets. Their goal? To get to the bottom of what caused people to retweet. They analyzed both the content and the context of the tweets, and arrived at several interesting conclusions. The most surprising conclusion? The number of tweets you’ve posted in the past doesn’t impact how retweetable you are. Tweet as often as you want. It doesn’t change the percentage of your tweets that will get retweeted. Even the average number of daily tweets has no impact on how retweetable you are. Obviously, if you tweet more often, you’ll probably get retweeted more often, simply because there’s more material available. But your retweet rate isn’t related at all to how often you tweet. The Palo Alto study confirmed that you’re more likely to get retweeted if you include a link in your tweet. However, it doesn’t make as big a difference as you might think. In general, 21% of tweets contain a link. Twenty-eight percent of retweets, on the other hand, contained links. The difference is there, but it’s easy to overstate. Twitter may be known for being short and to the point, but it seems that TwitLonger links, which allow you to write longer tweets, make a big difference in retweet rate. Tweets with TwitLonger links are 6.06 times more likely to get retweeted. By contrast, Formspring, YouTube, Twitcam, and Foursquare links were actually less likely to get retweeted. Only 10% of tweets contain a hashtag, but 21% of retweets contain hashtags. However, it’s not as simple as using a hashtag. Some hashtags perform worse than average, and others perform very well. Popular hashtags were more likely to get retweeted than rare ones, but the popularity of a hashtag alone isn’t enough to predict how retweetable it is. For example, #nowplaying was the most popular hashtag, but its retweet rate was 25% less than average. #ff, on the other hand, which was the second most popular hashtag, more than doubled retweet rates, raising them by 149%. It shouldn’t be surprising that a tweeter with more followers is more likely to get retweeted. The relationship is pretty straightforward: More surprising is the relationship between how many people a tweeter follows and their retweet rate: Surprisingly, if you follow more people, you’re more likely to get retweeted. It’s unlikely that simply following a bunch of people is going to actually cause you to get retweeted more often. More likely, this is simply an indicator that if you build relationships on Twitter, you are more likely to get retweeted. The Palo Alto study found that the longer you have been on Twitter, the more likely you are to get retweeted. Specifically, if it’s been over 300 days, you get a boost. The rate doesn’t seem to increase after about 400 days. Interestingly, if you’re new to Twitter, you also get a slight boost: According to a study conducted by Fetch Technologies and published in the Fifth International Conference on Weblogs and Social Media, most retweets are actually conducted by people who don’t normally tweet about the same topic. The authors of the paper monitored 30,000 Tweeters for one month, gathering data on 768,000 tweets. To predict whether or not a user would retweet specific content, the researchers devised four different predictive mathematical models: A general model, where the user would just retweet at random, weighted by the tweets they saw most recently. A recent communication model, where the user would retweet based on how recently they had communicated with somebody. An on-topic model, where the user would retweet based on how similar the content of the tweet was to their own profile of tweets. A homophily model, where a user would retweet based on the similarity of their own profile to the similarity of the other user’s profile. Of the four models, the homophily model was the best fit for the entire data set, although all of the models in fact worked together. Now for the shocker. While it was true that users seemed to retweet based on how similar their profiles were, it was actually the least similar profiles that were the most likely to be retweeted. Amazingly, the more similar two people’s profiles were, and the more similar the tweet was to their own profile, the less likely they were to retweet it: In other words, it seems that most tweeters are actually looking to retweet something they haven’t tweeted about yet. Similarity, it turns out, isn’t what people want to retweet. Instead, they’re looking for something new to tweet about. According to a study conducted at Wright State on weblogs and social media, causes have a tendency to get retweeted very frequently, but credit for the retweets is rarely present. After investigating popular tweets surrounding the health care reform debate, the Iran election, and the International Semantic Web Conference, they found that popular tweets could be divided into four categories: However, the first three, which all revolved around a cause of some kind, were much less likely to give credit to the source. They would get copied all the time, but without attribution. So, if your only goal is to make a difference, then a cause is the way to do it. But if you actually want people to give you credit, you’re better off just sharing information. Of course, it’s worth realizing that a retweeter doesn’t necessarily need to give credit to your Twitter handle as long as they are referring people to something on one of your properties. It’s possible to support a cause and get credit if it involves a form, a page, or an application somewhere on your site. While the science of retweets is still very new, the scientific community is already teaching us new and unexpected things about how and why people share social content. How about you? What have you noticed about what gets retweeted, and what gets ignored? How are you using data to support your strategy? —Carter Bowles is a strategist who leverages his technical knowledge of statistics (which he holds a bachelor's degree in) and his years of experience with SEO and content marketing to organize comprehensive marketing strategies. He works at Northcutt.com.