Statistical techniques for cryptanalysis

Introduction:

Cryptography is the art of writing messages in code or cipher, to disguise, and thereby secure the content of a particular stream of text. When encrypted, a plain text message can be revealed only through the use of the key used to encode the cipher. Cryptography does not mask the existence of the message, but does disguise its content [1]. In contrary, cryptanalysis is the art of recovering the plaintext of a message without access to the key. Successful cryptanalysis may recover the plaintext or the key for a specific ciphertext [2].

There are five general types of cryptanalytic attacks:-

1. Ciphertext-only attack: In this type of attack, the cryptanalyst has a series of cipher texts encrypted using the same encryption algorithm. Then, the cryptanalyst deduces the plain text of each of the cipher texts or identifies the key used to encrypt the cipher text

2. Known-plaintext attack: In this type of attack, the cryptanalyst has a series of ciphertext and their corresponding plaintext values encrypted using a specific key. The cryptanalyst then tries to deduce the key by forming a relationship between the ciphertext and plaintext entries.

3. Chosen-plaintext attack: In this type of attack, the cryptanalyst not only has access to the ciphertext and associated plaintext for several messages, but he also chooses the plaintext that gets encrypted. His job is to deduce the key used to encrypt the messages or an algorithm to decrypt any new messages encrypted with the same key.

4. Frequency analysis: It is the study of thefrequency of lettersor groups of letters in aciphertext. The method is used as an aid to breakingclassical ciphers. Frequency analysis is based on the fact that, in any given stretch of written language, certain letters and combinations of letters occur with varying frequencies.

5. Rubber-hose cryptanalysis: The cryptanalyst threatens, tortures or blackmails the person who has the key until they give it up.

Among the many cryptanalytic techniques, frequency analysis or frequency counting is the most basic technique applied to break substitution cipher based algorithms, among the varied list of attack techniques. The basic use of frequency analysis is to first count the frequency of ciphertext letters and then associate guessed plaintext letters with them. More complex use of statistics can be conceived, such as considering counts of pairs of letters digrams, trigrams, and so on. This is done to provide more information to the cryptanalyst.

It exploits the weakness in the substitution cipher algorithm to encrypt similar plaintext letters to similar ciphertext letters. Frequency analysis based cryptanalysis techniques were used to break ciphers based on the traditional cryptographic algorithms, but they do not work well with the modern block cipher based cryptographic algorithms.

Statistical properties of English:

Frequency analysis based cryptanalysis uses the fact that natural language is not random in nature and single alphabetic based substitution does not hide the statistical properties of the natural language. In the case of encryption using monoalphabetic substitution, to start deciphering the encryption it is useful to get a frequency count of all the letters. The most frequent letter may represent the most common letter in English, E followed by T, A, O and I whereas the least frequent are Q, Z and X [7]. Statistical patterns in a language can be detected by tracing the redundancy of the text in the language. It has been realized that various universal regularities characterize text from different domains and languages. The best-known is Zipf’s law on the distribution of word frequencies [5], according to which the frequency of terms in a collection decreases inversely to the rank of the terms. Zipf’s law has been found to apply to collections of written documents in virtually all languages [5].

English language characters have a very high redundancy rate when used for cryptographic substitutions. If we have a message encrypted using the substitution cipher that needs to be cracked, we can use frequency analysis. “In other words, if the sender has used an encryption scheme, that replaces one letter in the English to be another letter in English, we can still recognize the original plain text as, the frequency characteristics of the original plain text will be passed on the new cipher text characters “[4]. To apply frequency analysis, we will need to know the frequency of every letter in the English alphabet, or the frequency characteristics of the language used by the sender to encrypt the text.

Below is a list of average frequencies for letters in the English language. So, for example, the letter E accounts for 12.7% of all letters in English, whereas Z accounts for 0.1 %. All the frequencies are tabulated and plotted below:-

For example, let us consider the following sentence: “We study Cryptography as part of our course”. Using a simple substitution cipher, let us consider the following:

a->c , b-> d, c->e…………..w->y, x->z, y->a, z->b

So, the cipher text becomes: “yg uvwfa etarvqitcrja cu rctv qh qwt eqwtug”. A simple frequency analysis of the cipher text can be carried out and the results are as given below:

The above data can be used by a cryptanalyst to identify the key or the plaintext by using simple substitution to the cipher text till a suitable plaintext value is not identified.

Apart from the use of mono alphabetic frequency analysis, cryptanalysts also identify frequency of paired letters better known as digram frequency and that of three letter words, called as Trigram frequencies. These help the cryptanalyst to exploit the redundant features of English language to break the cipher.

The most common Digrams (in order):

th, he, in, en, nt, re, er, an, ti, es, on, at, se, nd, or, ar, al, te, co, de, to, ra, et, ed, it, sa, em, ro.

The most common Trigrams (in order):

the, and, tha, ent, ing, ion, tio, for, nde, has, nce, edt, tis, oft, sth, men

Table 1: Digram and Trigram Frequencies [6]

These help in identifying the most commonly used terms in English to break a cipher. The digram frequencies are used to break two letter words such as an, to, of etc and the trigram frequencies are used to break three letter words such as the, are, for etc. After breaking a significant two letter and three letter words, it is practically east to identify the key from the cracked values of plaintext by matching the corresponding values in the ciphertext. This huge weakness in English language is used to break cipher texts encrypted using simple algorithms that make use of English alphabets. In practice the use of frequency analysis consists of first counting the frequency of ciphertext letters and then assigning “guessed” plaintext letters to them. “Many letters will occur with roughly the same frequency, so a cipher with X’s may indeed map X onto R, but could also map X onto G or M. But some letters in every language using letters will occur more frequently; if there are more X’s in the ciphertext than anything else, it’s a good guess for English plaintext that X is a substitution for E. But T and A are also very common in English text, so X might be either of them also” [4]. Thus the cryptanalyst may need to try several combinations of mappings between ciphertext and plaintext letters. Once the common single letter frequencies have been resolved, then paired patterns and other patterns are solved. Finally, when sufficient characters have been cracked, then the rest of the text can be cracked using simple substitution. Frequency analysis is extremely effective against the simpler substitution ciphers and will break astonishingly short cipher texts with ease.

Attacks on Traditional algorithms

Encrypting using traditional algorithms have been defenseless against cryptanalytic attacks as they use bit by bit encryption, which can be easily broken using frequency analysis based attacks.

1. Caesar Cipher:

Considering the case of one of the oldest ciphers, the Caesar Cipher, this cipher replaces one letter of the plaintext with another to produce the ciphertext, and any particular letter in the plaintext will always, turn into the same letter in the cipher for all instance of the plaintext character. For instance, all B’s will turn into F’s. “Frequency analysis is based on the fact that certain letters, and combinations of letters, appear with characteristic frequency in essentially all texts in a particular language” [9]. For instance, in the English language, E is very common, while X is not. Likewise, ST, NG, TH, and QU are common combinations, while XT, NZ, and QJ are very uncommon, or even “impossible” to occur in English. This clearly shows how the Caesar cipher can be broken with ease by just identifying the frequency of each letter in the cipher text. A message encrypted using Caesar cipher is extremely insecure as an exhaustive cryptanalysis on the keys easily breaks the code.

2. Substitution Ciphers:

The Caesar cipher forms a subset of the entire set of substitution ciphers. Here, the key of the encryption process is the permutation of all the twenty six characters of the English alphabets. Rather than choosing a particular key for all encryption process, we use a different key for successive encryption processes. This technique increases the number of possible key to 26!, which is about 4 X 1026, which eliminates the exhaustive cryptanalysis attack on the keyspace [7]. To decrypt the cipher the, statistical frequency distribution of single letter occurrence in English language is analyzed. Then, the digram and trigram frequencies of standard English words are compared with the frequencies of the trigrams in the cipher to finally reconstruct the key and in turn decipher the text. This is an efficient method to break the substitution cipher as, each plaintext letter is represented by the same ciphertext letter in the message. So, all properties of plaintext are carried on to the cipher text.

3. Vigenere Cipher:

In a Vigenere cipher, there is greater security as, a given plaintext letter is not always represented by the same ciphertext letter. This is achieved by using a sequence of n different substitution ciphers to encrypt a message. This technique increases the possible number of keys from 26! to (26!)n. Although this was considered to be unbreakable, the Kasiski’s method of attacking a Vigenere cipher yielded successful results of decrypting the message. According to this method, the first step is to find the key length (n).

Find identical segments of plain text that get encrypted to the same ciphertext, when they are b positions apart, where b=0 mod n. According to Kasiski, the next step is to find all the identical segments of length greater than 3, and record the distance between them [7].

This can then be used to predict the length of the key (n). Once this is found the key is found by an exhaustive search of the keyspace for all possible combinations to identify the key. This is done by substituting all possible values for n to generate substrings. Once the substring is formed, the plaintext message can be automatically identified by using the back substitution of the key into the cipher [7]. This can be done for all possible values for n until finally arriving at the actual key, which reveals the plaintext that was encrypted. This method can take a long time to break the key to identify the plaintext incase the key length is very long, as the keyspace value would be large for larger keys.

Defeating frequency based attacks:

Frequency based attacks have been used for a long time to break traditional encryption algorithms. It uses the fact that, traditional encryption algorithms do not eliminate the statistical properties of the language upon encryption.

The first way to defeat frequency based attacks is to encrypt blocks of characters at a time rather than single letters [7]. This would ensure that, the same text in the plaintext is not encrypted to the same text in the ciphertext upon encryption. For e.g., if we use the Caesar cipher encryption scheme, the word “ADDITIONAL” will be encrypted to “CFFKVKQPCN”, we can see that the alphabets A, D and I are repeated more than once and at each instance, the encryption scheme used always encrypts A to C, D to F and I to K. This can clearly be used during frequency analysis to analyze the redundancy of the characters and in turn map them back to get the original plaintext character. Using a block encryption scheme, one can be satisfied that, this phenomenon does not occur as, in a block encryption scheme, the whole plaintext is broken into chunks or blocks of data, that is fed in as input to the encryption algorithm. The algorithm then, reads the input block along with the key and encrypts the complete block of plaintext, rather than individual characters, so there is a smaller chance that two blocks will produce the same chunk of ciphertext.

The second way of defeating frequency analysis is to make use of synonyms of words [7], rather than repeating the same word over and over again in a sentence. There are a lot of words in English, which have more than one synonym, thus providing with a set of words to be used as convenient in the particular context. To help in the selection of a synonym, grammar checking would have to be used to ensure that, the meaning expressed in the sentence is not altered by changing the words. Attacks against this technique could include creating a list of the best synonyms, but this would not help the attacker as different word could be used at each instance the same meaning needs to be expressed, defeating the benefit of this technique. This technique of using alternate words to represent common words to defeat cryptanalysis attacks is called “Homophones” [7] in cryptography.

A third technique that can effectively defeat cryptanalysis is Polyalphabetic substitution, that is, the use of “several alphabets to encrypt the message” [3], rather than using the same substitution technique again and again. The Vigenere Cipher is a form of Polyalphabetic cipher. This ensures that, no two characters are encrypted to the same ciphertext alphabet in the same message. This ensures that, direct frequency analysis of the cipher is not possible to successfully retrieve the original message. However, other techniques need to be used to identify the key length, if this is possible, then frequency analysis attack could be used to identify the original plaintext message successfully.

Finally, a possible technique that could be used to defeat frequency analysis is to “encrypt a single character of plaintext with two ciphertext characters” [3]. Upon encountering the same character twice, then different characters should be used to encrypt the message. This can be achieved by using a key size double that of the plaintext message and then encrypting the same plaintext with two values in the key and save them together for the same plaintext character. This would ensure that no two plaintext characters will have the same ciphertext character, defeating the frequency analysis method of breaking the cipher.

Modern encryption algorithms and cryptanalysis:

Modern cryptographic algorithms take a better approach in defeating frequency analysis based attacks. The cryptographic algorithms nowadays use block encryption, rather than encrypting characters bit by bit, thus eliminating the redundancy of ciphertext alphabets for similar plaintext alphabets. “Block ciphers are the central tool in the design of protocols for shared-key cryptography. A block cipher is a function E: {0, 1}k Ãƒ- {0, 1}n Ã¢Ã¢â‚¬Â ‘ {0, 1}n. This notation means that E takes two inputs, one being a k-bit string and the other an n-bit string, and returns an n-bit string” [2]. The first input is the key, which is used to encrypt the secret message. The second string is called the plaintext, and the output is called a ciphertext. The key-length k and the block-length n are parameters associated to a specific block cipher. They vary from block cipher to block cipher, and depend on the design of the algorithm itself. Some of the most trusted symmetric ciphers include AES, Triple-DES, Blowfish, CAST and IDEA. In public-key cryptography, the most commonly used cryptosystems are RSA and the Diffie-Hellman systems, which have not been found to have any vulnerabilities till date.

Preferably, the block cipher E is a public specified algorithm. “In typical usage, a random key K is chosen and kept secret between a pair of users. The function EK is used by the sender to encrypt the message, for a given key, before sending it to the intended receiver, who decrypts the message using the same key” [2]. Security relies on the secrecy of the key. So, at first, one might think of the cryptanalyst’s goal as recovering the key K given some ciphertext, intercepted during transmission. The block cipher should be designed to make this task computationally difficult. In order to achieve this, the algorithms that are used to encrypt the message must be designed with a high degree of mathematical complexity, which cannot be reversed to obtain the plaintext from a known ciphertext.

The length of the key used during encryption of a message plays an important role in deciding the effectiveness of an algorithm. Key length is conventionally measured in bits, and most of the well known strong ciphers have key lengths between 128 and 256 bits. A cipher is considered strong if, after years of attempts to find a weakness in the algorithm, there is no known effective cryptanalytic attack against it. This indicates that, the most efficient way of breaking an encrypted message without knowing the key used to encrypt it is to “brute force” it, i.e. trying all possible keys. “The effort required to break an encrypted message is determined by the number of possible keys, known as thekeyspace. Knowing the speed of the computer to break the key, it is easy to calculate how long it would take to search the keyspace to break a particular cipher” [2].

For example, considering a cipher that uses 128-bit keys, each bit can either be 0 or 1, so, there are 2128 or 3Ãƒ-1038 keys approximately. Suppose we imagine that about ten billion computers are assigned the task of breaking the code, each capable of testing ten billion keys per second, then, the task of running through the entire keyspace would take around 3Ãƒ-1018seconds, which is about 100 billion years. “But, in fact, it would be necessary to run through only half the keyspace to hit upon the correct key, which would take around 50 billion years. This is longer than the estimated age of the universe according to modern cosmology, which is about 15 billion years” [2]. This shows that, it is practically infeasible to crack modern cryptographic algorithms using Brute Force attacks. So, one can imagine the effectiveness of the modern cryptographic algorithms and their resistance towards cryptanalytic attacks.

Conclusions:

Cryptography has progressed in recent years and modern cryptographic algorithms have proved to be successful in defending against most forms of cryptanalytic attacks. Frequency analysis based attacks have proved to exploit the weaknesses in traditional encryption algorithms into revealing the plaintext message that was encrypted using them. The natural language used to encrypt messages is not considered to be random in nature, which is exploited by frequency counting based attacks. Based upon the frequency of letters that occur in the ciphertext, one can guess the plaintext characters due to their redundancy rate and the specific combination of letters in a word. This weakness can be repelled by using stream ciphers, which do not carry the redundancy in the plaintext to the ciphertext. Modern block cipher, encrypt a chunk of plaintext into ciphertext and vice versa, eliminating the redundancy of language used in encryption.

Although the algorithm plays an important part, it is the key length used in block ciphers that helps in repelling cryptanalysis. Modern ciphers use a key length starting from 128 bits, eliminating the possibility of a brute force attack to decrypt the message. The higher the key length, the more time it takes to break these ciphers. These advantages have made modern cryptographic algorithms more popular among the security community. No known weaknesses have been found in these algorithms yet, that may allow one to identify the plaintext message.

Bibliography:

[1] Stallings, W., Cryptography and Network Security, Chapter 1, Third Edition, Prentice Hall, 2003

[2] Schneier, B., Applied Cryptography, Chapter 1, Second Edition, John Wiley & Sons, New York City, New York, USA, 1996

[3] Hart, G.W., To Decode Short Cryptograms, Communications of the ACM 37(9), 1994, pp. 102-108

[4] Lee, K.W., Teh, C.E., Tan, Y.L., Decrypting English Text Using Enhanced Frequency Analysis, National Seminar on Science, Technology and Social Sciences (STSS 2006), Kuantan, Pahang, Malaysia

[5] Zipf, GK., Human Behaviour and the Principle of Least Effort, 1949, Cambridge: Addison Wesley Publications.

[6] Lewand, R.E., Cryptological Mathematics, The Mathematical Association of America, 2000, Pages 345-346

[7] Stamp, M and Low, R.M., Applied Cryptanalysis, 2007, Chapter 1 and 2, John Wiley & Sons, New York City, New York, USA

[8] http://www.simonsingh.net, Online internet frequency analysis tools

[9] http://www.textalyser.net, online text analysis and frequency analysis information

Order Now