Frequent subtree mining

In computer science, frequent subtree mining is the problem of finding all patterns in a given database whose support (a metric related to its number of occurrences in other subtrees) is over a given threshold.[1] It is a more general form of the maximum agreement subtree problem.[2]

Definition

Frequent subtree mining is the problem of trying to find all of the patterns whose "support" is over a certain user-specified level, where "support" is calculated as the number of trees in a database which have at least one subtree isomorphic to a given pattern.[3]

Formal definition

The problem of frequent subtree mining has been formally defined as:[1]

Given a threshold minfreq, a class of trees \mathcal{C}, a transitive subtree relation P\preceq T between trees P, T \in\mathcal{C}, a finite set of trees \mathcal{D}\subseteq\mathcal{C}, the frequent subtree mining problem is the problem of finding all trees \mathcal{P}\subset\mathcal{C} such that no two trees in \mathcal{P} are isomorphic and
\forall P\in\mathcal{P} : \quad \mathrm{freq}(P, \mathcal{D}) = \sum\nolimits_{T \in \mathcal{D}} d(P,T)\geq \mathrm{minfreq},
where d is an anti-monotone function such that if P' \preceq P then
\forall T\in\mathcal{C} : \quad d(P',T)\geq d(P,T).

Algorithms

In 2002, Mohammed J. Zaki introduced TreeMiner, an efficient algorithm for solving the frequent subtree mining problem, which used a "scope list" to represent tree nodes and which was contrasted with PatternMatcher, an algorithm based on pattern matching.[4]

Applications

Domains in which frequent subtree mining is useful tend to involve complex relationships between data entities: for instance, the analysis of XML documents often requires frequent subtree mining.[1] Another domain where this is useful is the web usage mining problem: since the actions taken by users when visiting a web site can be recorded and categorized in many different ways, complex databases of trees need to be analyzed with frequent subtree mining.[4] Other domains in which frequent subtree mining is useful include computational biology,[5][6] RNA structure analysis,[6] pattern recognition,[7] bioinformatics,[8] and analysis of the KEGG GLYCAN database.[9]

Challenges

Checking whether a pattern (or a transaction) supports a given subgraph is an NP-complete problem, since it is an NP-complete instance of the subgraph isomorphism problem.[7] Furthermore, due to combinatorial explosion, according to Lei et al., "mining all frequent subtree patterns becomes infeasible for a large and dense tree database".[10]

References

  1. 1 2 3 Chi, Yun; Muntz, Richard R.; Nijssen, Siegfried; Kok, Joost N. (28 June 2005). "Frequent Subtree Mining - An Overview". Fundamenta Informaticae 66: 161–198. Retrieved 3 July 2014.
  2. Deepak, Akshay; Fernández-Baca, David; Tirthapura, Srikanta; Sanderson, Michael J.; McMahon, Michelle M. (July 2013). "EvoMiner: frequent subtree mining in phylogenetic databases". Knowledge and Information Systems. doi:10.1007/s10115-013-0676-0. Retrieved 4 July 2014.
  3. Dai, H., Srikant, R. and Zhang, C. (2004). "Advances in Knowledge Discovery and Data Mining." 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26–28, 2004, Proceedings. 1st ed. p. 65.
  4. 1 2 Zaki, Mohammed J. (2002). "Efficiently mining frequent trees in a forest". Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining: 71–80. Retrieved 16 June 2014.
  5. Deepak, Akshay, David Fernández-Baca, Srikanta Tirthapura, Michael J. Sanderson, and Michelle M. McMahon. "EvoMiner: frequent subtree mining in phylogenetic databases." Knowledge and Information Systems (2011): 1-32.
  6. 1 2 Chi, Yun, Yirong Yang, and Richard R. Muntz. "Canonical forms for labelled trees and their applications in frequent subtree mining." Knowledge and Information Systems 8, no. 2 (2005): 203–234.
  7. 1 2 Chi, Yun; Yang, Yirong; Muntz, Richard R. (2004). "Mining Frequent Rooted Trees and Free Trees Using Canonical Forms" (PDF). Knowledge and Information Systems. Retrieved 16 June 2014.
  8. Xiao, Yongqiao; Yao, Jenq-Foung; Li, Zhigang; Dunham, Margaret H. (2003). "Efficient data mining for maximal frequent subtrees". Third IEEE International Conference on Data Mining: 379–386. Retrieved 16 June 2014.
  9. Aoki-Kinoshita, Kiyoko F. (2009). Glycome Informatics: Methods and Applications. CRC Press. p. 141. ISBN 9781420083347.
  10. Zou, Lei, Yansheng Lu, Huaming Zhang, and Rong Hu. "Mining frequent induced subtree patterns with subtree-constraint." In Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference on, pp. 3–7. IEEE, 2006.
This article is issued from Wikipedia - version of the Wednesday, September 02, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.