183rd St. SE.

WA, United States

183rd St. SE.

WA, United States
SEARCH FILTERS
Time filter
Source Type

Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

The contents of the book have focused so far on the mining of data where the underlying structure is characterized by special types of graphs where cycles are not allowed, i.e. acyclic graphs or trees. The focus of this chapter is on the frequent pattern mining problem where the underlying structure of the data can be of general graph type where cycles are allowed. These kinds of representations allow one to model complex aspects of the domain such as chemical compounds, networks, the Web, bioinformatics, etc. Generally speaking, graphs have many undesirable theoretical properties with respect to algorithmic complexity. In the graph mining problem, the common requirement is the systematic enumeration of sub-graphs from a given graph, known as the frequent subgraph mining problem. From the available graph analysis methods, we will narrow our focus to this problem as it is the prerequisite for the detection of interesting associations among graph-structured data objects, and has many important applications. For an extensive overview of graph mining in a general context, including different laws, data generators and algorithms, please refer to (Chakrabati & Faloutsos 2006; Washio & Motoda 2003, Han & Kamber 2006). Due to the existence of cycles in a graph, the frequent subgraph mining problem is much more complex than the frequent subtree mining problem. Even though theoretically it is an NP complete problem, in practice, a number of approaches are very applicable to the analysis of real-world graph data. We will look at a number of different approaches to the frequent subgraph mining problem and a number of approaches for the analysis of graph data in general. © 2011 Springer-Verlag Berlin Heidelberg.


Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

In general, for frequent pattern mining problems, the candidate enumeration process exhaustively enumerates all possible combinations of itemsets that are a subset of a given database. This process is known to be very expensive since, in many circumstances, the number of candidates to enumerate is quite large, and also the frequent patterns present in real-world data can be fairly long (Bayardo 1998). Efficient techniques attacking different issues and problems relevant to the enumeration problem are therefore highly sought after. In addition to the enumeration problem, another important problem of frequent pattern mining is to efficiently count and prune away any itemsets discovered to be infrequent. Due to the large number of candidates that can be generated from the vast amount of data, an efficient and scalable counting approach is critically important. Another problem when extracting all frequent subtrees from a complex tree database, is that the number of patterns presented to the user can be very large, thereby making the results hard to analyze and gain insights from. © 2011 Springer-Verlag Berlin Heidelberg.


Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

For certain applications, the distance between the nodes in a hierarchical structure could be considered important and two embedded subtrees with different distance relationships among the nodes need to be considered as separate entities. The embedded subtrees extracted using the traditional definition are incapable of being further distinguished based upon the node distance within that subtree. In this chapter, we describe the extension of the general TMG framework, to enable the mining of distance-constrained embedded subtrees, (Hadzic 2008; Tan 2008). In such subtrees, the distances of the nodes relative to the root of the subtree need to be taken into account during the candidate enumeration phase. The distances of nodes relative to the root (node depth) of a particular subtree will need to be stored and used as an additional equality criterion for grouping the enumerated candidate subtrees. In Chapter 9, we will illustrate scenarios and applications where the mining of distance-constrained embedded subtrees would be preferable to mining of traditional embedded subtrees, since the extracted subtree patterns will be more informative. We also highlight the importance of distance-constrained subtree mining in the context of web log mining, where the web logs are represented in tree-structured form. In what follows, we will discuss the importance of distance-constrained embedded subtrees from a more general perspective and relate it to some previous work on extracting tree-structured queries. © 2011 Springer-Verlag Berlin Heidelberg.


Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

Data mining strives to find frequent patterns in data. The type of data, structures and patterns to be found characterizes a task in data mining. Advancements in data collection technologies have contributed to the high volumes of sales data. Sales data typically consists of transaction timestamps and the list of items bought by customers. If one is interested in inter-transactional patterns, sales data is a good example of a sequential pattern. In addition, sequential patterns exist in data such as DNA string databases, time series, occurrences of recurrent illness, etc. © 2011 Springer-Verlag Berlin Heidelberg.


Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

This chapter will discuss some new research directions in the frequent subtree mining field. This will be discussed from both the application and technical perspectives. Since frequent subtree mining (FSM) is a relatively new field compared with frequent itemset/sequence mining, many lessons can be learned form the more mature research in frequent itemset/sequence mining. A drawback of frequent pattern mining in general is that often, for a set support threshold, the number of frequent patterns becomes quite large due to some characteristics of the database. This may cause not only algorithm complexity problems, but also significant delays in the analysis and interpretation of the results. Many of the patterns may not be useful for the application at hand and/or are redundant, or not of interest to the user. Furthermore, it is also not always clear what support threshold is satisfactory for obtaining reasonable results. These are all important research areas, with some significant achievements in complexity reduction from the algorithmic and application perspectives. Some of these or similar ideas can, to a certain extent, already be applied in the FSM field, but others will need refinements and extensions to be flexible enough to cope with the additional structural properties of the data. In Section 12.2, we highlight some of the work in frequent itemset/sequence mining where the same or similar idea can be applied and prove useful in the FSM field. At the end of Section 12.2, we look at some work that has already been initiated in frequent pattern filtering and the incorporation of application-oriented constraints. © 2011 Springer-Verlag Berlin Heidelberg.


Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

This chapter describes the extension of the TMG framework for the mining of unordered induced/embedded subtrees. While in online tree-structured documents such as XML the information is presented in a particular order, in many applications the order among the sibling-nodes is considered unimportant or irrelevant to the task and is often not available. If one is interested in comparing different document structures, or the document is composed of data from several heterogeneous sources, it is very common for the order of sibling nodes to differ, although the information contained in the structure is essentially the same. In these cases, mining of unordered subtrees is much more suitable as a user can pose queries and does not have to worry about the order. All matching sub-structures will be returned with the difference being that the order of sibling nodes is not used as an additional candidate grouping criterion. Hence, the main difference when it comes to the mining of unordered subtrees is that the order of sibling nodes of a subtree can be exchanged and the resulting tree is still considered the same. © 2011 Springer-Verlag Berlin Heidelberg.


Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

The aim of this chapter is to discuss the applications of tree mining algorithms with respect to the general knowledge analysis task, as well as some specific applications. The implications of using different tree mining parameters (i.e. subtree types, support definitions, constraints) are discussed, and illustrative scenarios are used to indicate useful application areas for different parameters. This kind of overview will be particularly useful for those not so familiar with the area of tree mining as it can reveal useful applications within their domain of interest. It gives guidance as to which type of tree mining will be most useful for their particular application. © 2011 Springer-Verlag Berlin Heidelberg.


Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

As mentioned in Chapter 1, an important category of complex data is tree-structured data. It occurs in a variety of different domains and applications such as Web Intelligence applications, bioinformatics, natural language processing, programming compilation, scientific knowledge management and querying, etc. (Wang et al. 1994). Mining of tree-structured data introduces significant new challenges which are the subject of this chapter. © 2011 Springer-Verlag Berlin Heidelberg.


Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

In this chapter, we will elaborate on the overall TMG framework for mining ordered subtrees as described in Chapter 4 (Tan 2008). In an ordered tree, for each internal node, the order of its children is fixed. Such trees have found many useful applications in areas such as vision, natural language processing, molecular biology, programming compilation etc. (Wang, Zhang, Jeong & Shasha 1994). In the research on automatic natural language processing, the dictionary definitions are represented syntactically as trees. Computational linguists extract semantic information about these definitions and in the process construct semantic taxonomies (Chodorow & Klavans, 1990; Neff, Roy & Omneya 1998). In the molecular biology field, large amounts of analyzed RNA structures are collected and stored in the form of ordered labeled trees. When the researchers want to acquire information about a new RNA structure, it is compared against those already in the database in order to detect structural similarities and thereby relate different RNA structures (Shapiro & Zhang 1990). Since the researchers will maintain the RNA-related information in the same order, a comparison of ordered subtrees is sufficient. A general observation about applications where ordered subtree mining is suitable is that the left-to-right order among sibling nodes is commonly fixed and known beforehand. Ordered subtree mining is useful for querying a single database where the order restriction can be placed on the query subtree because it is known beforehand. © 2011 Springer-Verlag Berlin Heidelberg.


Hadzic F.,Curtin University Australia | Tan H.,183rd St. SE. | Dillon T.S.,Curtin University Australia
Studies in Computational Intelligence | Year: 2011

In this chapter we discuss the main issues that arise when developing tree mining algorithms. The way that an algorithm is implemented often greatly determines its efficiency. The main aspects which affect the overall performance of the algorithm are the way that the document structure is represented at the algorithm level, the way that candidate subtrees are enumerated and counted and, in the case of unordered subtree mining, the canonical form ordering schemes. © 2011 Springer-Verlag Berlin Heidelberg.

Loading 183rd St. SE. collaborators
Loading 183rd St. SE. collaborators