This post is based on Stanford CS246 lectures by Jure Leskovec and Mining of Massive Datasets by Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman.

#### Random sampling algorithm

Recall that Apriori Algorithm requires passes to find frequent itemsets of size . In many applications, however, finding every frequent itemsets is not important. If it is enough to find most but not all frequent itemsets, there are algorithms that requires 2 or fewer passes for all itemset sizes.

The random sampling algorithm requires 2 passes for all sizes but may miss some frequent itemsets. The algorithm reads the entire file once and selects baskets with a fixed probability . If the file contains baskets, we can expect the algorithm to select baskets.

To avoid bias, baskets should not be selected from one part of the file. For example, the transaction data in a supermarket may be organized by the date of sale, which means the first part of the file is older than the rest and the frequent itemsets found by analyzing such data may be less relevant.

Selected baskets are loaded into the main memory. The algorithm then runs Apriori or one of its improvements in the main memory. Given a support threshold , an itemset is frequent in the sample dataset if its support is at least in the sample dataset.

Note that Apriori still requires passes to find itemsets of size , but there is no disk access since the baskets are in the main memory. Thus, the algorithm so far requires only a single pass over the data on disk.

The random sampling algorithm produces two kinds of errors:

*false negatives*: itemsets that are frequent in the full dataset but infrequent in the sample dataset.*false positives*: itemsets that are infrequent in the full dataset but frequent in the sample dataset.

We can eliminate false positives by making a second pass over the full dataset and making sure that all the itemsets that are frequent in the sample dataset are frequent in the full dataset. To do this in a single pass, we need to be able to count all frequent itemsets of all sizes at once in the main memory. This is possible since the support threshold is set to be high enough to make the number of candidate itemsets small.

We can’t eliminate false negatives. However, we can reduce their number by lowering the support threshold during the first pass of the algorithm to, say, . This requires more main memory and increases the number of false positives, but catches more truly frequent itemsets. The second pass of the algorithm will then eliminate the false positives, leaving us with, hopefully, very few false negatives.

#### SON algorithm

The SON (Savasere, Omiecinski, and Navathe) algorithm finds every frequent itemsets of all sizes in 2 passes. The first pass of the SON algorithm is very similar to the first pass of the random sampling algorithm. Instead of randomly sampling the dataset with a fixed probability , the SON algorithm divides the dataset into chunks of size . It then loads the baskets into the main memory and runs the Apriori or one of its improvements in the main memory. Given a support threshold , an itemset is frequent in a chunk if its support is at least in the chunk.

**Theorem.** If an itemset in infrequent in every chunk, is infrequent in the full dataset.

*Proof.* Since is infrequent in every chunk, in each chunk. Since there are chunks, in the full dataset.

The contrapositive of the theorem follows:

**Corollary.** If an itemset in frequent in the full dataset, is frequent in at least one chunk.

This guarantees that all truly frequent itemsets are in the set of itemsets that are frequent in at least one chunk. Therefore, if we store all itemsets that are frequent in at least one chunk, we know that we won’t have any false negative.

The second pass of the SON algorithm is exactly the same as the second pass of the random sampling algorithm: the algorithm goes over the full dataset and makes sure that all the itemsets that are frequent in at least one chunk are frequent in the full dataset.

#### SON algorithm with MapReduce

The SON algorithm lends itself to distributed data mining and can be implemented in two map-reduce steps.

The input to the first Map tasks is a fraction of the full dataset. If the full dataset consists of baskets, each chunk consists of baskets. The output of the first Reduce tasks is itemsets that are frequent in at least one chunk.

The first map-reduce step:

map(key, value): // value is a chunk of the full dataset Count occurrences of itemsets in the chunk. for itemset in itemsets: if supp(itemset) >= p*s emit(itemset, null) reduce(key, values): emit(key, null)

The input to the second Map tasks is the candidate itemsets and a fraction of the full dataset.

The second map-reduce step:

map(key, value): // value is the candidate itemsets and a chunk of the full dataset Count occurrences of itemsets in the chunk. for itemset in itemsets: emit(itemset, supp(itemset)) reduce(key, values): result = 0 for value in values: result += value if result >= s: emit(key, result)

#### Toivonen’s algorithm

Toivonen’s algorithm is a different way of using randomness to find frequent itemsets. Unlike the random sampling algorithm, Toivonen’s algorithm finds every frequent itemsets of all sizes in 2 passes on average. However, each round of Toivonen’s algorithm has a small probability of not producing any answer at all, in which case we need to repeat the algorithm until it produces an answer.

For details of Toivonen’s algorithm, please see Section 6.4.5 and 6.4.6 of Mining of Massive Datasets by Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman.

Biswarup Ghosh says

hello how Do i make a chunk of a file on the go in python in the first map algorithm