Entity

Time filter

Source Type

WA, United States

Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

In this chapter, we will elaborate on the overall TMG framework for mining ordered subtrees as described in Chapter 4 (Tan 2008). In an ordered tree, for each internal node, the order of its children is fixed. Such trees have found many useful applications in areas such as vision, natural language processing, molecular biology, programming compilation etc. (Wang, Zhang, Jeong & Shasha 1994). In the research on automatic natural language processing, the dictionary definitions are represented syntactically as trees. Computational linguists extract semantic information about these definitions and in the process construct semantic taxonomies (Chodorow & Klavans, 1990; Neff, Roy & Omneya 1998). In the molecular biology field, large amounts of analyzed RNA structures are collected and stored in the form of ordered labeled trees. When the researchers want to acquire information about a new RNA structure, it is compared against those already in the database in order to detect structural similarities and thereby relate different RNA structures (Shapiro & Zhang 1990). Since the researchers will maintain the RNA-related information in the same order, a comparison of ordered subtrees is sufficient. A general observation about applications where ordered subtree mining is suitable is that the left-to-right order among sibling nodes is commonly fixed and known beforehand. Ordered subtree mining is useful for querying a single database where the order restriction can be placed on the query subtree because it is known beforehand. © 2011 Springer-Verlag Berlin Heidelberg. Source


Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

This chapter will discuss some new research directions in the frequent subtree mining field. This will be discussed from both the application and technical perspectives. Since frequent subtree mining (FSM) is a relatively new field compared with frequent itemset/sequence mining, many lessons can be learned form the more mature research in frequent itemset/sequence mining. A drawback of frequent pattern mining in general is that often, for a set support threshold, the number of frequent patterns becomes quite large due to some characteristics of the database. This may cause not only algorithm complexity problems, but also significant delays in the analysis and interpretation of the results. Many of the patterns may not be useful for the application at hand and/or are redundant, or not of interest to the user. Furthermore, it is also not always clear what support threshold is satisfactory for obtaining reasonable results. These are all important research areas, with some significant achievements in complexity reduction from the algorithmic and application perspectives. Some of these or similar ideas can, to a certain extent, already be applied in the FSM field, but others will need refinements and extensions to be flexible enough to cope with the additional structural properties of the data. In Section 12.2, we highlight some of the work in frequent itemset/sequence mining where the same or similar idea can be applied and prove useful in the FSM field. At the end of Section 12.2, we look at some work that has already been initiated in frequent pattern filtering and the incorporation of application-oriented constraints. © 2011 Springer-Verlag Berlin Heidelberg. Source


Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

In general, for frequent pattern mining problems, the candidate enumeration process exhaustively enumerates all possible combinations of itemsets that are a subset of a given database. This process is known to be very expensive since, in many circumstances, the number of candidates to enumerate is quite large, and also the frequent patterns present in real-world data can be fairly long (Bayardo 1998). Efficient techniques attacking different issues and problems relevant to the enumeration problem are therefore highly sought after. In addition to the enumeration problem, another important problem of frequent pattern mining is to efficiently count and prune away any itemsets discovered to be infrequent. Due to the large number of candidates that can be generated from the vast amount of data, an efficient and scalable counting approach is critically important. Another problem when extracting all frequent subtrees from a complex tree database, is that the number of patterns presented to the user can be very large, thereby making the results hard to analyze and gain insights from. © 2011 Springer-Verlag Berlin Heidelberg. Source


Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

The contents of the book have focused so far on the mining of data where the underlying structure is characterized by special types of graphs where cycles are not allowed, i.e. acyclic graphs or trees. The focus of this chapter is on the frequent pattern mining problem where the underlying structure of the data can be of general graph type where cycles are allowed. These kinds of representations allow one to model complex aspects of the domain such as chemical compounds, networks, the Web, bioinformatics, etc. Generally speaking, graphs have many undesirable theoretical properties with respect to algorithmic complexity. In the graph mining problem, the common requirement is the systematic enumeration of sub-graphs from a given graph, known as the frequent subgraph mining problem. From the available graph analysis methods, we will narrow our focus to this problem as it is the prerequisite for the detection of interesting associations among graph-structured data objects, and has many important applications. For an extensive overview of graph mining in a general context, including different laws, data generators and algorithms, please refer to (Chakrabati & Faloutsos 2006; Washio & Motoda 2003, Han & Kamber 2006). Due to the existence of cycles in a graph, the frequent subgraph mining problem is much more complex than the frequent subtree mining problem. Even though theoretically it is an NP complete problem, in practice, a number of approaches are very applicable to the analysis of real-world graph data. We will look at a number of different approaches to the frequent subgraph mining problem and a number of approaches for the analysis of graph data in general. © 2011 Springer-Verlag Berlin Heidelberg. Source


Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

For certain applications, the distance between the nodes in a hierarchical structure could be considered important and two embedded subtrees with different distance relationships among the nodes need to be considered as separate entities. The embedded subtrees extracted using the traditional definition are incapable of being further distinguished based upon the node distance within that subtree. In this chapter, we describe the extension of the general TMG framework, to enable the mining of distance-constrained embedded subtrees, (Hadzic 2008; Tan 2008). In such subtrees, the distances of the nodes relative to the root of the subtree need to be taken into account during the candidate enumeration phase. The distances of nodes relative to the root (node depth) of a particular subtree will need to be stored and used as an additional equality criterion for grouping the enumerated candidate subtrees. In Chapter 9, we will illustrate scenarios and applications where the mining of distance-constrained embedded subtrees would be preferable to mining of traditional embedded subtrees, since the extracted subtree patterns will be more informative. We also highlight the importance of distance-constrained subtree mining in the context of web log mining, where the web logs are represented in tree-structured form. In what follows, we will discuss the importance of distance-constrained embedded subtrees from a more general perspective and relate it to some previous work on extracting tree-structured queries. © 2011 Springer-Verlag Berlin Heidelberg. Source

Discover hidden collaborations