Let’s look at our new word lists carefully. I have used Lexos for today’s numbers, so we will move to the official Lexos count of words:
- In the Shire text: 11,073 words
- Distinct words within the Shire: 3,013 words
- Words which occur only once in The Shire: 2,029
- Words which occur only in the Shire: 559 (that’s from me, not Lexos)
That rate of 27% distinct words (3,013/11,073) is crazy high! And 18% unique words? In a non-technical work? pretty much unheard of in contemporary work – I would be excited to compare this to Lewis Carroll, C.S. Lewis, and perhaps Patrick Rothfuss! (Oh, why is Lexos’ count different from mine? Lexos counted the word “Chapter” nineteen times, as well as the Roman numerals for chapters, and any numerals in the text).
But before I hand-type the Slow Regard of Silent Things (about half the length of The Hobbit), let’s compare The Hobbit to The Hobbit. I used our new random-text-grab script to create a same-size file using words from the whole work (Dear Dave Kale and other number fans, the random grab works with replacement).
- In the Random text: 11,073 words
- Distinct words: 2,050 words
- Words which occur only once in the grabbed text: 1,109
So that’s a rate of 18.5% distinct words and 10% unique words. Drat it all. You know me, now I have to see those rates for the whole work.
- In the whole text: 97,436 words
- Distinct words: 12,325
- Words which occur only once in the total text: 7,091
We have in the total text a rate of 12.5% distinct words and 7.2% unique words. The randomly-grabbed text is not the same.
… and Tech Support just texted me that they would be able to moosh on the code so that the floating-point math definitely doesn’t get in the way. For total transparency, this is where Tech Support goes far and away beyond me. I’ve heard of floating points, and I cheered in the 90s when they were available because I saw the numbers behave better… but I wouldn’t know a floating point from a non-floating point to save my soul.