Frequency Analysis, in R

In Software Engineering

I recently started reading Command Line Kung Fu, a blog focused on using command line tools to solve just about everything (especially all things text-related). At the time of writing, their latest post deals with frequency analysis. According to Wikipedia,

In cryptanalysis, frequency analysis is the study of the frequency of letters or groups of letters in a ciphertext… Frequency analysis is based on the fact that, in any given stretch of written language, certain letters and combinations of letters occur with varying frequencies. Moreover, there is a characteristic distribution of letters that is roughly the same for almost all samples of that language.

Ever since the epic reddit cryptography thread last week (and also just reading about the recently-cracked Copiale Cipher), I’ve become increasingly interested in cryptography. The example presented in CLKF’s post seemed like an easy puzzle for an absolute newbie to try.

The cipher is as follows:

“YETU HTPVI MOF UELCP MOF STC LCRVCU T DOOZ SLXEVW LK MOF ETRV CO VQXVWULIV LC UEV IFNAVSU? HTMNV MOF STC, NFU LU’I COU UVWWLNJM JLPVJM. LHTDLCV EOY MOF YOFJZ WVTSU LK MOFW ZOSUOW UOJZ MOF “MOF ETRV TXXVCZLSLULI, T ZLIVTIV UETU LI JLKV-UEWVTUVCLCD LK COU UWVTUVZ. YV ETRV T ULHV-UVIUVZ SFWV UETU SFWVI 99% OK TJJ XTULVCUI YLUE CO COULSVTNJV ILZV-VKKVSUI, NFU L’H COU DOLCD UO DLRV MOF UETU: L’H DOLCD UO DLRV MOF T CVY VQXVWLHVCUTJ UWVTUHVCU HM SOFILC ZWVTHVZ FX JTIU YVVP. CO, HM SOFILC ETI CO HVZLSTJ UWTLCLCD. CO, L ETRV CO VRLZVCSV UETU UEV CVY UWVTUHVCU YLJJ YOWP, TCZ LU’I CVRVW NVVC UVIUVZ OW TCTJMBVZ LC ZVXUE — NFU L’H DOLCD UO DLRV LU UO MOF TCMYTM NVSTFIV HM SOFILC UELCPI LU LI DOOZ IUFKK.” MOF’Z KLCZ TCOUEVW ZOSUOW, L EOXV. WTULOCTJ XVOXJV JVTRV HVZLSTJ STWV UO UEV HVZLSTJ VQXVWUI. UEV HVZLSTJ VQXVWUI ETRV T HFSE NVUUVW UWTSP WVSOWZ UETC UEV GFTSPI.”
— ZTRLZ YTDCVW XEZ, ISL.SWMXU, 19UE OSU 02.

The introduction to the post suggests that the cipher is using a simple monoalphabetic substitution. In other words, each letter in the ciphertext is mapped to a fixed cleartext letter for the entire passage. The only thing to do is figure out that mapping.

In CLKF’s post, PowerShell and *NIX command line tools are used to decrypt the message. But seeing that frequency analysis is a statistical analysis, I thought that decryption in R might be much easier and more straight-forward.

The strategy is thus:

  1. Count the number of times each letter appears in the ciphertext.
  2. Substitute the most frequent ciphertext letter with the most frequent cleartext letter, the second most frequent ciphertext letter with the second-most frequency cleartext letter, and so on.
  3. Tweak and adjust the substitution scheme to arrive at a comprehensible cleartext.

So, we first load read in the ciphertext (with backslashes to escape quotation marks).

cipher <- "YETU HTPVI MOF UELCP MOF STC LCRVCU T DOOZ SLXEVW LK MOF ETRV CO 
VQXVWULIV LC UEV IFNAVSU? HTMNV MOF STC, NFU LU'I COU UVWWLNJM JLPVJM. LHTDLCV 
EOY MOF YOFJZ WVTSU LK MOFW ZOSUOW UOJZ MOF \"MOF ETRV TXXVCZLSLULI, T ZLIVTIV 
UETU LI JLKV-UEWVTUVCLCD LK COU UWVTUVZ. YV ETRV T ULHV-UVIUVZ SFWV UETU SFWVI 
99% OK TJJ XTULVCUI YLUE CO COULSVTNJV ILZV-VKKVSUI, NFU L'H COU DOLCD UO DLRV 
MOF UETU: L'H DOLCD UO DLRV MOF T CVY VQXVWLHVCUTJ UWVTUHVCU HM SOFILC ZWVTHVZ 
FX JTIU YVVP. CO, HM SOFILC ETI CO HVZLSTJ UWTLCLCD. CO, L ETRV CO VRLZVCSV 
UETU UEV CVY UWVTUHVCU YLJJ YOWP, TCZ LU'I CVRVW NVVC UVIUVZ OW TCTJMBVZ LC 
ZVXUE -- NFU L'H DOLCD UO DLRV LU UO MOF TCMYTM NVSTFIV HM SOFILC UELCPI LU LI 
DOOZ IUFKK.\" MOF'Z KLCZ TCOUEVW ZOSUOW, L EOXV. WTULOCTJ XVOXJV JVTRV HVZLSTJ 
STWV UO UEV HVZLSTJ VQXVWUI. UEV HVZLSTJ VQXVWUI ETRV T HFSE NVUUVW UWTSP 
WVSOWZ UETC UEV GFTSPI.\" -- ZTRLZ YTDCVW XEZ, ISL.SWMXU, 19UE OSU 02."

And count the frequency of each letter.

library(stringr)
cipher <- tolower(cipher)
(count <- str_count(cipher, letters))
freq <- data.frame(letters, count)
(freq <- freq[order(-count), ])
(code <- paste(freq[,1], collapse=""))

The variable letters is built into R, so it doesn’t have to be defined. The stringr library is, I believe, a built-in library that gives us the function str_count. str_count(cipher, letters) essentially counts the number of times each letter appears in the ciphertext, and vectorizes these counts.

I converted the ciphertext to lowercase because the letters variable contains all lower-case variables.  The other way around (converting letters to upper-case letters) is fine too. A data frame is used to match each count to their respective letter. Finally, the paste command, with the collapse parameter equal to “”, collapses all the letters (sorted by frequency) into a single string. We ultimately get the string vultocwesizfmjhdxryknpqabg.

The most common letters in the English alphabet are

[table width=”60%”]
Letter[attr style=”width:20%; align:center”],Relative frequency in the English language[attr style=”width:30%; align:center”]
a,8.17%
b,1.49%
c,2.78%
d,4.25%
e,12.70%
f,2.29%
g,2.02%
h,6.09%
i,6.97%
j,0.15%
k,0.75%
l,4.03%
m,2.52%
n,6.75%
o,7.51%
p,1.93%
q,0.10%
r,5.99%
s,6.33%
t,9.06%
u,2.76%
v,1.04%
w,2.47%
x,0.15%
y,1.97%
z,0.07%
[/table]

We can read these into R, then substitute each letter in the ciphertext with its corresponding letter.

stdfreq <- c("e","t","a","o","i","n","s","h","r","d","l","c","u",
             "m","w","f","g","y","p","b","v","k","j","x","q","z")
key <- paste(stdfreq, collapse="")
chartr(code, key, cipher) # Substitute characters

We get this:

[1] "phot woked uic thank uic ron anyent o fiil raghes ab uic hoye ni \nejgestade an the dcvxert? wouve uic ron, vct at'd nit tessavmu makemu. awofane \nhip uic picml seort ab uics lirtis timl uic \"uic hoye oggenlaratad, o ladeode \nthot ad mabe-thseotenanf ab nit tseotel. pe hoye o tawe-tedtel rcse thot rcsed \n99% ib omm gotaentd path ni nitareovme dale-ebbertd, vct a'w nit fianf ti faye \nuic thot: a'w fianf ti faye uic o nep ejgesawentom tseotwent wu ricdan lseowel \ncg modt peek. ni, wu ricdan hod ni welarom tsoananf. ni, a hoye ni eyalenre \nthot the nep tseotwent pamm pisk, onl at'd neyes veen tedtel is onomuqel an \nlegth -- vct a'w fianf ti faye at ti uic onupou verocde wu ricdan thankd at ad \nfiil dtcbb.\" uic'l banl onithes lirtis, a hige. sotainom geigme meoye welarom \nrose ti the welarom ejgestd. the welarom ejgestd hoye o wcrh vettes tsork \nserisl thon the zcorkd.\" -- loyal pofnes ghl, dra.rsugt, 19th irt 02."

This doesn’t make any sense, but we do have a few clues to work with. Manually changing around letters in the key, we can arrive at

# Std Freq: "etaoinshrdlcumwfgypbvkjxqz"
key <- "etiaonrhcsduylmgpvwfbkxjzq"
[1] "what makes you think you can invent a good cipher if you have no \nexpertise in the subject? maybe you can, but it's not terribly likely. imagine \nhow you would react if your doctor told you \"you have appendicitis, a disease \nthat is life-threatening if not treated. we have a time-tested cure that cures \n99% of all patients with no noticeable side-effects, but i'm not going to give \nyou that: i'm going to give you a new experimental treatment my cousin dreamed \nup last week. no, my cousin has no medical training. no, i have no evidence \nthat the new treatment will work, and it's never been tested or analyzed in \ndepth -- but i'm going to give it to you anyway because my cousin thinks it is \ngood stuff.\" you'd find another doctor, i hope. rational people leave medical \ncare to the medical experts. the medical experts have a much better track \nrecord than the quacks.\" -- david wagner phd, sci.crypt, 19th oct 02."

Fiddling around with keys manually isn’t usually such a great idea (though it can work for simple ciphers like this). In CLKU, one suggestion to get around this is to try all possible permutations of the key. However, it is noted that there are about 4E26 permutations, which can take a lot of time. Instead of going through all possible permutations, an alternative method is to randomly sample letters, with weights assigned to each position.

letter.prob <- c(8.167, 1.492, 2.780, 4.253, 12.702, 2.288, 2.022, 6.094,
                 6.973, 0.153, 0.747, 4.025, 2.517, 6.749, 7.507, 1.929, 0.098,
                 5.987, 6.333, 9.056, 2.758, 1.037, 2.465, 0.150, 1.971, 0.074)
letter.sample <- sample(letters, 26, prob = letter.prob)
key <- paste(letter.sample, collapse="")

Alternatively, we can start with an initial solution (i.e., the standard frequency chart), and randomly swap letters.

swap <- function(x) {
  pos <- sample(length(x), 2, replace=FALSE)
  x[pos] <- x[ pos[2:1] ]
  x
}
letter.sample <- swap(stdfreq)
key <- paste(letter.sample, collapse="")

There is also the issue of determining when a key works. CLKU tested spelling. A similar method is to test whether words in the attempted cleartext can be found in a dictionary. I found that just testing just the first word, the first two words, or the first three words to come up with non-solutions. So here the first four words are tested.

words <- scan("http://docs.oracle.com/javase/tutorial/collections/interfaces/examples/dictionary.txt",
              what = character())
dictstatus <- logical(0)
for (i in 1:4) {
  word <- strsplit(cipher, " ")[[1]][i]
  attempt <- chartr(code, key, word)
  dictstatus[i] <- attempt %in% words
}

A repeat loop can be used to loop this process over and over. I was also interested in measuring the amount of time and iterations to arrive at a solution (albeit with this very inefficient method).

ptm <- proc.time()
counter <- 0
repeat {
  letter.sample <- swap(stdfreq)
  key <- paste(letter.sample, collapse="")
  dictstatus <- logical(0)
  for (i in 1:4) {
    word <- strsplit(cipher, " ")[[1]][i]
    attempt <- chartr(code, key, word)
    dictstatus[i] <- attempt %in% words
  }
  if (all(dictstatus)) {
    clear <- chartr(code, key, cipher)
    print(clear)
    print(key)
    print(paste("Loop iterations: ", counter, sep = ""))
    break
  }
  counter <- counter + 1
}
proc.time() - ptm

I found that my computer can try between 50 and 100 iterations per second. If all possible permutations are tried at my computer’s fastest speed, this means I’d have to wait 4E26 seconds, or 1E23 hours, for this thing to finish. Clearly not a very good solution.

Some improvements I would probably try at this point:

  1. using bigram and trigram frequency charts (i.e., making use of the most common two-letter and three-letter starting sequences, rather than just the first letter)
  2. implementing a better hill-climbing method.

Leave a Reply