====== Sequential Pattern Mining ====== ===== Description ===== **Sequential pattern mining** is a popular [[:pattern_mining|pattern mining]] task. It is used to find subsequences that appear frequently appear in a set of sequences. **Sequential pattern mining** is used to analyze data that is represented as sequences of symbols. ===== An Example ===== The input data for sequential pattern mining is a **sequence database. **It is a set of sequences. Each **sequence **is an ordered list of** itemsets. **And an **itemset** is a set of symbols. For example, consider the sequence database below, which represents shopping data. This database contains four sequences (rows) called S1, S2, S3 and S4. |**Sequence** |**items ** | |S1|(a), (a b c), (a c), (d), (c f)| |S2|(a d), ( c ) , (b c), (a e)| |S3|(e f), (a b), (d f), ( c ), (b)| |S4|(e), (g), (a f), ( c ), (b), ( c )| The first sequence indicates that a customer has purchased some item (a). Then the customer has bought some items a, b and c together. Then the customer, has bought items a and c together. Then, he bought item d, and finally the customer bought items c and f together. The goal of sequential pattern mining is to find all the **frequent sequential patterns. A frequent sequential pattern **is a subsequence that appears in many input sequences. To find frequent sequential patterns, a user must have a **sequence database** (as the one presented above), and must also choose a value for a parameter called** //minsup//** , which means **minimum support threshold.** The **output of frequent sequential pattern mining **is all subsequences that appear at least //minsup// sequences of the database. For example, for the above database and //minsup// = 2, the three following subsequences are frequent sequential patterns: (b c) (a) because it appears in at least two sequences\\ (f) (b) because it appears in at least two sequences\\ (f) (b) ( c ) because it appears in at least two sequences There are also other frequent sequential patterns, but here only a few are listed for brevity. ===== Applications ===== There are many applications. A few examples are: * Find the frequent sequences of purchases made by customers in a store (each sequence is a list of purchased items by a customer). * Find the frequent sequence of locations visited by tourists in a city (each sequence is a list of locations visited by a tourist). * Find the frequent sequences of words that appear in a book (each sentence is a sequence of words) ===== Algorithms ===== Numerous algorithms have been designed for sequential pattern mining. But most of them are inspired by one of the following classic algorithms: **SPAM**, **SPADE**, **PrefixSpan**, **GSP**, **CM-SPAM **and **CM-SPADE**. * **GSP**: R. Agrawal, and R. Srikant, Mining sequential patterns, ICDE 1995, pp. 3–14, 1995. * **SPAM**: Ayres, J. Flannick, J. Gehrke, and T. Yiu, Sequential pattern mining using a bitmap representation, KDD 2002, pp. 429–435, 2002. * **SPADE**: M. J. Zaki, SPADE: An efficient algorithm for mining frequent sequences, Machine learning, vol. 42(1-2), pp. 31–60, 2001. * **PrefixSpan:** J. Pei, et al. Mining sequential patterns by pattern-growth: The prefixspan approach, IEEE Transactions on knowledge and data engineering, vol. 16(11), pp. 1424–1440, 2004. * **CM-SPAM and CM-SPADE**: P. Fournier-Viger, A. Gomariz, M. Campos, and R. Thomas, Fast Vertical Mining of Sequential Patterns Using Co-occurrence Information, PAKDD 2014, pp. 40–52, 2014. ===== Survey papers ===== Here are a few survey papers that gives an overview of itemset mining. * Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). [[http://www.philippe-fournier-viger.com/dspr-paper5.pdf|A Survey of Sequential Pattern Mining]]. Data Science and Pattern Recognition (DSPR), vol. 1(1), pp. 54-77. * Gan, W., Lin, J. C. W., Fournier-Viger, P., Chao, H.-C., Yu, P. S. (2019) **A Survey of Parallel Sequential Pattern Mining**. ACM Transactions on Knowledge Discovery from Data (TKDD), 13(3): 25:1-25:34. ===== Tutorial videos ===== * … ===== Software and datasets ===== To apply** frequent itemset mining**, the SPMF software provides open-source efficient implementations of many algorithms and variations. The **SPMF software** can be downloaded from the website: [[http://www.philippe-fournier-viger.com/spmf/ |http://www.philippe-fournier-viger.com/spmf/ ]]. To install the software, you may follow the instructions on the **download page **of that website. Then, you may check the **documentation page **which provides examples of how to run various algorithms such as GSP, PrefixSpan, CM-SPAM, CM-SPADE and others for sequential pattern mining. Besides, you may check the **datasets ****page **of that website provides several benchmark datasets for testing the algorithms and comparing their performance.