Data Compression has been one of the enabling technologies for the on-going digital multimedia revolution for decades which resulted in renowned algorithms like Huffman Encoding, LZ77, Gzip, RLE and JPEG etc. Researchers have looked into the character/word based approaches to Text and Image Compression missing out the larger aspect of pattern mining from large databases. The central theme of our compression research focuses on the Compression perspective of Data Mining as suggested by Naren Ramakrishnan et al. wherein efficient versions of seminal algorithms of Text/Image compression are developed using various Frequent Pattern Mining(FPM)/Clustering techniques. This paper proposes a cluster of novel and hybrid efficient text and image compression algorithms employing efficient data structures like Hash and Graphs. We have retrieved optimal set of patterns through pruning which is efficient in terms of database scan/storage space by reducing the code table size. Moreover, a detailed analysis of time and space complexity is performed for some of our approaches and various text structures are proposed. Simulation results over various spare/dense benchmark text corpora indicate 18% to 751% improvement in compression ratio over other state of the art techniques. In Image compression, our results showed up to 45% improvement in compression ratio and up to 40% in image quality efficiency.

Frequent Pattern Mining(FPM) is a non-trivial phase of Association Rule Mining(ARM) and is formally defined as follows: Let _{1}, _{2}, _{3}, …, _{n}_{1}, _{2}, _{3}, …, _{m}_{i}

Compression, Approximation, Induction, Search and Querying are the different perspectives of Data Mining identified by Naren Ramakrishnan et al. (

Lossy Compression is a class of data encoding method that uses inexact approximations. It finds extensive applications in the transfer and storage of multimedia data like Image, Audio and Video where slight data loss might not be noticed by the end user. It achieves better compression ratio compared to lossless compression. MPEG, MP3, JPEG, JBIG, PGF, WMA etc. are a few techniques based on this principle (

This work focuses on efficient Data Mining approaches to Text and Image compression. The process of Data Mining focuses on generating a reduced(smaller) set of patterns(knowledge) from the original database, which can be viewed as a compression technique. FPM is incorporated in Huffman Encoding to come up with an efficient text compression setup. Moreover, the concept of clustering is employed in our knowledge engineering based compression approaches where clustering works on an objective function which results in high intra similarity within the clusters and less inter similarity with other clusters. The rest of the paper is organized as follows. Related studies are given in Section 2, Formal presentation of the problem and proposed algorithms are given in Section 3. Section 4 explains a detailed theoretical analysis of various text characterization along with time and space complexity analysis. Details of the datasets, Results and Performance discussion are given in Section 5 and Section 6 concludes with further research directions.

Some of the major class of algorithms in statistical based encoding include Shannon-Fano Encoding, Huffman Encoding, Dynamic Huffman, Canonical Huffman, Adaptive Huffman, LDMC, Arithmetic Encoding(AE), PAQ, Context Tree Weighting, and RLE etc. (

Frequent Pattern Mining Algorithms(FPM) find applications in several Data Mining tasks such as ARM, Classification, Clustering etc. Apriori, one of the first algorithms for ARM is very commonly used which uses a

Our initial contribution started with Frequent Pattern Mining based text compression and gradually with graceful transitions towards efficient versions of them. Later the clustering approach to text compression evolved to observe the performance of compression with respect to frequent phrase based clustering. Finally our most time efficient version of text compression using an hybrid approach was developed.

In this work, a sequence (pattern) is considered to be the character/s in an ordered and contiguous manner w.r.t the index. Consider the text ‘_{abs}_{abs}_{mod}_{mod}_{abs}_{mod}_{abs}_{mod}_{abs}_{abs}_{mod}

A formal definition of Frequent Pattern based Data Compression(Data refers to Text or Image or Video) problem is, “Given an input file _{abs}

The algorithm mines the patterns in the text through the process of Frequent Pattern Mining and thereby encodes each of those patterns using Huffman Encoding (_{abs}

FPH-AS has better compression ratio than FPH-Initial because of the introduction of a novel pattern counting mechanism which is the modified frequency (_{mod}_{mod}

In this approach, through the concept of _{mod}_{abs}_{mod}_{mod}

According to a statistics, we need at most 20–21 bits if we want to denote each word with a unique binary code. Therefore if we consider a text with

Scan the input text and find the _{abs}_{1}.

Generate the Frequent sequences considering word as the atomic unit. Let the set of frequent sequences be

Construct the Huffman tree with the set _{1} for encoding. Decoding is done as per the conventional Huffman strategy.

_{R}_{L}_{R}_{L}

Even though, FPH algorithms does better on compression ratio, reduction of time to compress to a great extent is still a challenging research. A novel algorithm for mining sequential patterns(word/s) using a graph and hybridization of non-Huffman based compression algorithms is the main focus of this work (_{mod}_{1}, _{2}, …, _{x}_{i}, q_{j}_{i}, q_{j}_{j}

Flow Chart of GA78.

Graph of text

Reducing the computation time to mine patterns by switching over to data clustering yielded us higher compression rates compared to FPH methods in few datasets. Also the problem of handling the elimination of redundant patterns is avoided in this method. The seminal Hierarchical Clustering technique has been modified in such a way that optimal number of words (patterns which are sequence of characters with a space as suffix) are obtained. An important aspect of this approach is the employment of cosine similarity measure to find the similarity between the partitions to place them in the right cluster. This approach made the most similar words to get clustered into one cluster which helped in reducing the generation of new codes thus reducing the code table size. For the input file _{i}_{p}_{i}_{y}_{i}

It has been observed that to find the number of unique words it takes

The complexity of calculating the cosine similarity of each pair of partitions is similar to insertion of entries in the table with the newly formed cluster, which takes ^{2}). —(

Time taken to scan the table subsequently to find maximum cosine value is ^{2}) + ^{2}) + ^{2}) + … + ^{2}) = ^{3}). —(

The deletion of entries in the cosine similarity table with at least one element from the newly formed cluster is similar to computing the complexity in scanning through the table to find if such an entry exists. Let us assume updation and deletion of cluster frequency vectors is done in ^{2}) + ^{2}) + ^{2}) + … + ^{2}) = ^{3}). —(

To retain only the non zero elements in cluster frequency vectors which is equal to finding the complexity of scanning all

Huffman codes are generated, which takes

Encoding ^{2}). —(

For decoding, it takes ^{2}). —(

Finally, the time complexity of the Clustering based Huffman approach is (^{3}) = ^{3}).

The space consumed to maintain the count of each word in ^{2}) space is taken by Cosine similarity table in worst case and to maintain code tables _{1} and _{2}, ^{2}) + ^{2}) + ^{2}) = ^{2}).

The generic function with respect to pattern growth for variation in

Given _{1}: ∀_{i}, c_{i}_{i+1}_{i}, c_{i+1}_{1}, 1 ≤ _{1} =

Given _{2}: ∀_{i}_{j}_{i}_{i+1}_{j}_{j+1}_{i}_{i+1}_{j+1}_{i}, c_{i+1}_{j}, c_{i+1}_{2}, 1 ≤ _{2}.

Given _{3}: ∀_{i}, c_{i}_{i+1}_{i}, c_{i+1}_{3}, 1 ≤ _{3}| =

We have taken the FPH-HB approach for a detailed study on the time complexity impacted on various text structures. Since FPH-HB has an efficient mechanism in mining the patterns along with computing the frequencies among other FPH approaches, it sounds better to choose it. We have chosen the text structure _{1} which can be taken as the base for other text structures as well.

_{1}

^{2}).

_{1}| = _{1}, _{2}, _{3} and ^{2}, (^{2}, (^{2} and (^{2} respectively. _{r}

Since

The time complexity using _{2} and _{3} are ^{3}^{3}) which is contributed by the generation of the frequent sequences. The Space Complexity for all the text structures, it is given as O(^{2}) = O(^{2}^{2}) which is contributed by the frequent pattern generation process.

Our first approach was towards a data clustering based Image Compression and then on a hybrid data clustering with a frequent sequence mining approach towards image compression.

_{r}

Simulation is performed on an Intel core i5-3230M CPU 2.26 GHz with 4GB Main Memory and 500GB Hard disk on Ubuntu 13.04 OS Platform. Using C++ 11 standard, MatLab code, Python proposed algorithms are implemented. Calgary, Canterbury and Amazon Reviews etc. falls under text whereas SIPI for Images are the benchmark datasets which the algorithms have been tested over (_{r}

Table _{r}_{r}_{mod}

Simulation results of various Text Corpora.

Corpus/Size (kB) | Compression Ratio _{r} |
Time(sec) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

FPH-AS/CH | FPH-HB/ | UHE | GA78 | LZ78 | AE | gzip | FPH-AS/CH | FPH-HB/ | GA78 | LZ78 | AE | gzip | |

alphabet/98 | 242.13 |
416.66 | – | 92.25 | – | 1.67 | 302.15 | 29485 |
18.01 | 0.06 | – | 0.06 | 12.87 |

bible/3953 | 2.83 |
2.86 | 2.65 | 2.65 | 2.06 | 1.84 | 1.91 | 24272 |
1123 | 29.05 | 6.23 | 0.78 | 24.21 |

Reviews/102,399 | 4.97 |
2.52 | 2.54 | 2.79 | 2.51 | 1.61 | 1.8 | 1442.6 |
3710 | 248.43 | 29.29 | 11.92 | 5.65 |

plrabn12/470 | 2.25 |
2.24 | 2.38 | 1.93 | 1.42 | 1.75 | 2.2 | 84.72 |
47.47 | 3.62 | 1.74 | 0.08 | 3.5 |

world192/2415 | 2.37 |
2.48 | 2.15 | 2.59 | 1.97 | 1.59 | 1.75 | 16509.5 |
613.89 | 0.36 | 4.81 | 0.72 | 16.32 |

The compression ratio of our approach is 416.66 in alphabet dataset, which is very high as compared to its competitors. This is due to the fact that even though the original count of patterns are more, _{r}_{mod}

Figure _{mod}^{3}. Even though Apriori has time overhead, we had a polynomial reduction of O(_{mod}

At some higher support values for few datasets |_{r}_{r}

Figure

Here ^{’}_{J}

It can be observed from Figure _{r}_{r}_{r}

This paper presented various novel and optimal text and image compression algorithms employing various concepts from FPM in the conventional encoding strategies. We addressed the problem of mining very large and optimal set of patterns using compressed pattern sets employing novel pattern pruning strategy which has contributed to improvements in time factor by an order of magnitude. Proposed text compression algorithms achieved high compression ratios, with a reduction in code table size whereas without compromising image quality. Our demonstration of various text structures along with time complexity have shown a wider aspect of looking at mining optimal patterns. For future work, we shall focus on the scope of applying various itemsets like Closed Frequent Itemsets, Maximal Frequent Itemsets, etc. Further research will also address the issue of time to reduce encoding in large size images.

The authors have no competing interests to declare.