sentence fragments will not save us

Thomas Baekdal’s post on using passphrases (from 2007) came up again two weeks back. In that post, Baekdal maintained the following thesis (I paraphrase):

Passphrases are better than passwords, because they are easier to remember and (because they are longer) they are “mathematically” harder to crack.

A series of security articles last week pointed to his post, and it received a round of retweets, including William Gibson‘s approving retweet.  The security articles that raised this article from the gloomy depths of 2007 were critiques, though, and Baekdal took the time to respond to those critiques.

Unfortunately, Baekdal is still badly misled (and misleading!) about his “mathematical” evidence regarding multiword expressions and the use of dictionaries to attack these.  The short form of the problem is:

The suggestions Baekdal proposes for better passphrases are themselves information leaks: they give clever crackers more –not less — information about the structure of your secret.

I address two of these leaks after the jump:

Dictionaries are not uniform probability

Baekdal’s mathematics assumes that crackers would use dictionary attacks that systematically progress through the entire dictionary, in an order that assumes no richer knowledge of the cracking space. For example, he asserts that orange requires 3 minutes to crack using a dictionary of 20,000 common words, but that is the amount of time required to exhaustively explore all of the dictionary entries in the scenario he provides (100 queries per second).

In fact, a smart cracker — they do exist — could find a reasonably good way to sort the dictionary entries based on their likelihood of being a password choice. A noisy but not-bad approximation is to sort one’s cracking dictionary by the word’s frequency in English text.  For a 20,000-word dictionary, one might choose the Corpus of Contemporary American English frequency-sorted wordlist , a sample of which is available from WordFrequency.info.  There, I find that orange is in fact fairly high-frequency:

3164	orange	j	9755	0.94

If I had attempted to crack orange by using the frequency-sort, I’d have found it in just over 3000 queries: 30 seconds (not 3 minutes!), and that’s assuming I hadn’t done the obvious and thrown out the first 500-1000 most frequent words (the, of, to, … etc) as “too obvious for passwords”. Although frequency is a lousy approximation of likelihood of password use, it’s good enough to highlight the error here: Baekdal should really be examining not the time required to exhaustively search the dictionary space, but the time required to expect to find the answer.  Constraining yourself to dictionaries also ties you (however loosely) to a frequency distribution that itself bears information about your password choice.

Phrases are also constraints

Baekdal suggests that you use a phrase, because it’s longer and thus (“mathematically”) more secure.  But using a phrase — like using a word — is itself leaking information. Let’s look at some of the example passphrases he offers:

this is fun
alpine fun
fluffy is puffy
yummy salted peanuts

Things to note here:

  • each of these is a syntactically legal word-sequence
  • each is also a syntactic constituent (S, NP, S, NP respectively)

Baekdal asserts, furthermore, that one need not fool about with case: “The lower-case password is already secure enough for online use.” He also says:

Even if we used a much smaller dictionary, with say 1,000 words, that is still 1 billion combinations, with 100 requests per second, it would take 115 days of continued hacking.

This assertion occupies the same naive peninsula as the one above that requires exhaustive search of all possible space. Instead, a cracker should explore three-word sequences in descending order of likelihood — and all of the examples that Baekdal is suggesting are  reasonably plausible and syntactically-valid chunks of English.  Again, encouraging users to structure their passwords along a particularly well-structured territory (legal English grammatical constituents!) allows blackhats to constrain their search.

Pass what?

Perhaps the real mind-virus is that the words “passphrase” and “password” both suggest that you should use a “word” or a “phrase”, which binds a user’s secret to the structural distributions of their language. Can we start saying “passstring” instead? (Short answer: no. But I’m open to better suggestions that don’t involve three s characters in a row!)

This entry was posted in information theory, statistics. Bookmark the permalink.