History Of The Data Compression Information Technology Essay

With the increased usage of computers in the business world and personal computing, the volume of data stored on the Internet had grown significantly. This growth has led to a need for “data compression”. The passing of information via the Internet is critical to all types of business structures at all levels. The necessity for improved speed and the accuracy of information exchange is beneficial to the business function. As technology advances so should the rate of transfer for data and this can be accomplished by improving the compression application through which data is transferred and/or by changing the format of the data so that the data can be transmitted at low cost and maximum speed. “Data compression is also useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth”. (Data Structures and Algorithms in C++, 3rd edition, Adam Drozdek 2005)

There are several different types of data compression: lossy, lossless, context dependent, non-context dependent. Claude E. Shannon developed the Lossless form of data compression and it is used when data has to be compressed and uncompressed with the matching information before compression. One very common application that uses lossless compression is text files; because it is very important that text files are uncompressed with the same information before compression. Can you just imagine if your text was compressed and when it was uncompressed it was not the same text you sent because a few letters were not uncompressed correctly and the receiving person got the incorrect message. Lossless compression is essential in keeping of master sources for pictures, videos and audio data. However, there are strict limits to the amount of compression that can be obtained with lossless compression. “Lossless compression ratios are generally in the range of 2:1 to 8:1”. (Lossless Compression Handbook) However, lossy compression works on the assumption that the data doesn’t have to be stored perfectly. The compression factor can be larger than those available from lossless methods. The speed in which the transfer of data takes place depends on which option of data compression used. “A judicious choice can improve the throughput of a transmission channel without changing the channel itself”. (Data Structures and Algorithms in C++, 3rd edition, Adam Drozdek 2005) There are many different methods of data compression that reduce the size of the representation without affecting the information itself. Data compression that is known as the run length encoding and is also commonly known as “run length limiting” is one of the simplest forms of data compression. “Run – length encoding is very useful when applied to files that are almost guaranteed to have many runs of at least four characters.” (Data Structures and Algorithms in C++, 3rd edition, Adam Drozdek 2005 581) Run length encoding is only modestly efficient for text files in which only the blank character has a tendency to be repeated. “Although (RLE) is not very useful when compressing text files, because a normal text file doesn’t consist of long repeated character strings”. (A Concise Introduction to Data Compression 2008) “Huffman coding”, takes characters in a data file and convert them to a binary file using the lossless compression technique in an efficient and sophisticated manner. “Huffman coding is the construction of an optimal code that was developed by David Huffman, who utilized a tree structure in this construction: a binary tree for a binary code”. (Data Structures and Algorithms in C++, 3rd edition, Adam Drozdek 2005) By reducing the average code length used to represent the symbols of an alphabet is a statistical data compression technique used in Huffman coding. This type of statistical data compression technique can be used in many ways with the use of a priority queue. By using a priority queue this involves the removing of the two smallest probabilities and inserting the new probability in the proper position. The use of a singly linked list of pointers to a tree is one way to implement this algorithm. As noted before, the taking a stream of symbols and transforming them into codes is data compressions. If the data compression works without errors, the compressed stream of codes will be smaller than the input stream of code. “The decision to output a certain code for a certain symbol or set of symbols is based on a model and the model is simply a collection of data and rules used to process input symbols and determine which code to output.” (The Data Compression Book 2nd edition, Mark Nelson chapter 2) And this leads us to understand that data compression is perhaps the fundamental expression of Information Theory. Because of its concern with redundancy data compression enters into the field on Information Theory. “Information Theory uses the term entropy to measure how much information is encoded in a message. If we change the model, the probability will change with it.

This has lead to the discovery that the Huffman coding is inefficient due to using an integral number of bits per code, it is relatively easy to implement and very economical for both coding and decoding. Robert G. Gallager and Donald Knuth improved on Huffman coding and it was labeled as the Adaptive Huffman encoding technique. This adaptive encoding algorithm is based on the following sibling property: That is, if each node has a sibling (do not include the root) and the breadth-first right -to-left tree traversal generates a list of nodes with non-increasing frequency counters, it can be proven that a tree with the sibling property is a Hoffman tree. Without paying any penalty for added statistics adaptive coding lets us use higher-order modeling. The sibling property is very important to the adaptive Huffman coding since it helps show what you need to do to a Huffman tree when it is time to update the counts. “By maintaining the sibling property during the update assures that we have a Huffman tree before and after the counts are adjusted. (Data Structures and Algorithms in C++, 3rd edition, Adam Drozdek 2005) Adaptive Huffman coding surpasses simple Huffman coding in two respects: first, only one pass through the input and second, it adds only an alphabet to the output.

Huffman’s algorithm provided the first solution to the problem of constructing minimum-redundancy codes. Because Huffman’s algorithm was well known to have the best possible compression ratio results, Huffman coding was considered the best coding to use and no improvements was necessary. But a different type of coding, Arithmetic coding, takes a significantly different approach to data compression from that of the Huffman’s algorithm technique. Arithmetic coding is capable of achieving compression results, which are very close to the entropy of the source. Arithmetic coding replaces the source ensemble by a code string; it does not construct a code, in the sense of a mapping from source messages to code words. Unlike all of the other codes, it is not the concatenation of code words corresponding to individual source messages. Because arithmetic coding requires more CPU power and is slower when compared to older coding methods, this it is one major drawback. But the gains of cost of storing or sending information out weight the loss of speed. If the coder doesn’t have a model feeding it good probabilities, it won’t compress data regardless of the efficiency of the coder. And this brings us to the Statistical modeling, which reads in and encodes a single symbol at a time using the probability of that character’s appearance. The simplest forms of statistical modeling use a static table of probabilities. Using a universal static model has limitations and if the input coding doesn’t match up with the previously accumulated statistics, the compression ratio will not match causing a data mis-match. Between 1973 and 1985 Faller, Gallager and Knuth each studied the adaptive Huffman coding independently. But Knuth contributed improvements to the original algorithm and the resulting algorithm is referred to as algorithm FGK. ” The algorithm is based on the Sibling Property. The Sibling Property states that each node has a sibling (except for the root) and the breadth-first right to left tree traversal generates a list of nodes with increasing frequency counters, it can be proven that a tree with the sibling property is a Huffman tree.” (Data Structures and Algorithms in C++, 3rd edition, Adam Drozdek 2005 576) The algorithm FGK, defined by Gallager, both sender and receiver maintain the continuously changing Huffman code trees. The leaves of the code tree represent the original messages and the weights of the leaves represent frequency counts for the messages.” (Data Compression, section 4) “The adaptive Huffman algorithm of Vitter (algorithm V) incorporates two improvements over algorithm FGK. First, the number of interchanges in which a node is moved upward in the tree during a re-computation is limited to one”. (Data Compression, section 4) The intuitive explanation of algorithm V’s advantage over algorithm FGK is as follows: as in algorithm FGK, the code tree constructed by algorithm V is the Huffman code tree for the prefix of the ensemble seen so far. “The adaptive methods do not assume that the relative frequencies of a prefix represent accurately the symbol probabilities over the entire message”. (Data Compression, section 4) The improvements made by Vitter do not change the complexity of the algorithm; algorithm V encodes and decodes in 0(1) time as does algorithm FGK. (Data Compression, section 4)

Ziv and Lempel developed another improvement to adaptive Huffman encoding, and it will circumvent this limitation, not by relying on pervious knowledge of the source characteristics, but by building this knowledge in the course of data transmission. This method is called a universal coding scheme, and “Ziv-Lempel code is an example of a universal data compression code”. (Data Structures and Algorithms in C++, 3rd edition, Adam Drozdek 2005 p581) “The Ziv-Lempel coding gives us a different perspective from the original thinking of a code as a mapping from a fixed set of source messages (words, letters or symbols) to a fixed set of code words. Ziv-Lempel coding defines the set of sources messages as it parses the ensemble. Ziv and Lempel first compression algorithm is commonly referred to as LZ77”. (The Data Compression Book 2nd Edition, chapter 2) LZ77 is a dictionary that consists of all the strings in a window into the previously read input stream and this makes it is a relatively simple compression application. While new groups of symbols are being read in, “the algorithm looks for matches with strings found in the previous 4k bytes of data already read in. Any matches are encoded as pointers sent to the output stream. Popular programs such as PKZIP and LHarc use variants of the LZ77 algorithm, and they have proven very popular”. (The Data Compression Book 2nd Edition, chapter 2) LZ77 algorithm has a major efficiency problem, CPU cost. While the encoding phrases of LZ77 achieve a good compression rapidly it has to use more computer resource to find matching phrases are not found in the dictionary. When this is the case, the compression program will take longer and needs more CPU time. Another compression program developed by Ziv-Lempel is the LZ78 program. Unlike the LZ77 methods, strings in LZ78 can be extremely long, which allows for high-compression ratios. (The Data Compression Book 2nd Edition, chapter 2) This program takes a different approach to building and maintaining the dictionary. Instead of having a limited size window into the previously seen symbols in the preceding text, a dictionary of strings is built a single character at a time. (The Data Compression Book 2nd Edition) This incremental procedure works very well at isolating frequently used strings and adding them to the table.

And finally, another improvement in data compression is the algorithm BSTW (Bentley, Sleator, Tarjan and Wie), possesses the advantage that it requires only one pass over the data to be transmitted yet has performance which compares well to that of the static two-pass method along the dimension of number of bits per word transmitted. This number of bits is never much larger than the number of bits transmitted by static Huffman coding, and can be significantly better. (Data Compression, section 4)

Data compression is a topic of much important to many applications. Many methods of data compression have been studied for 40 years. In this paper I have tried to provide an overview of data compression methods of general utility. While algorithm have changed and improved in efficiency, they play a major role in everyday computer business processing and in computer science as a whole. As we continue to use computers for almost every thing we do, the need for data compression grows more and more every day. In recent years companies have started to offer back-up service to personal computer users. I am sure that data compression is used to store the information from all the customers using the on-line back-up service. I along have about 400 GB of data that I have store on a back up hard drive that I wish not to lose. So, imagine if there are 100 customers wanting to back-up and store on-line 400 GB of information. That is a lot of space being used in the standard form of storage. Computer data is compressed for everything we do from banking on-line to sending a text message. And without the technology of data compression none of this would be possible. But because data compression has made some strong advances this service is possible with out used a lot of hard drive disk space. So, the next time you send a text message, an email, or if you use an on-line data back up service you are in using one form of data compression.

Order Now