The Shire and Mirkwood compared to random text grabs.

From earlier this week: The Shire text uses 11,119 words, of which 1,484 do not appear in Mirkwood, this is counting every word used – “yes” counts as six words.  That’s 13.3% Shire words.

What we learned today: The Shire text compared to a random word grab of the same sample size – 1,339 Shire words do not match my random text.  That is basically indistinguishable from the Mirkwood difference.  Hmm, fascinating!  Yet most of our Lexos graphs which show both regions paint them as very different from one another at the word level.  Hold on…

Oho!  the Mirkwood text has more words – 16,400 – and only 1,265 are different from a random grab of 16,400 words in the whole novel.  That’s 7.7%.  Very different, my friends!

Let’s clean that up a bit:

  • Shire text: 11,119 words
  • Shire words not appearing in Mirkwood: 13.3%
  • Shire words not appearing in Random text: 12%
  • Mirkwood text: 16,400 words
  • Mirkwood words not appearing in the Shire text: 14.6
  • Mirkwood words not appearing in Random text: 7.7%

Well, well, well.  time to poke at Mirkwood a bit, friends.  Also, it’s time to use the newly-discovered Lexos feature “how many of these words are unique”!  See you soon!

 

 

Thank you, Tech Support

My very dear Tech Support has added a new tool to the Digital Humanities Toolkit (which is also linked on our About page).  It is random-choice.py and it will grab your choice of a number of words from a given text file as randomly as a computer can grab and present them on your Terminal window along with the ordinal number of that word in the text.  It will grab numbers as though they are words, but it will not grab things inside of square brackets (like our paragraph references) or double-x (like our phrase separator).  It’s a short little bit of code, so I simply copy/pasted it from github to a text file and named it random-choice.py.  Seems to have worked.

Thank you to Daroc Alden, who always has the time to write a little script for their Mama, even during mid-terms.

Then they came to lands where people spoke strangely

I am following a little rabbit-trail, Word Fans, about dialogue and narration in the Shire.  What are the characteristics of these bits which distinguish it from all the other bits?  Won’t this be fun!

[02.028] At first they had passed through hobbit-lands, a wide respectable country inhabited by decent folk, with good roads, an inn or two, and now and then a dwarf or a farmer ambling by on business. Then they came to lands where people spoke strangely, and sang songs Bilbo had never heard before.

It would be luxurious to include all the prose about the Shire as well, but my current project has made me stare at a deadline and hmph at it.  For our purposes, then, I am counting “In the Shire” as from [01.001] to [02.028], up to but not including the words in the title of this post, plus [19.028] to the end, [19.048], inclusive.

To pass on a tantalizing bit of my thought, I’m calling “In Mirkwood” from [07.154] through [09.069], inclusive.

The plan is to use the Mirkwood text as the stopwords to look at the Shire text and vice-versa…  I wonder if I need to do this for all regions and chart their differences from the Shire?  I may have to.  If I don’t come up for air in a few days, please send chocolate.

The Shire text uses 11,119 words, of which 1484 do not appear in Mirkwood, this is counting every word used – “yes” counts as six words.  That’s 13.3% Shire words.  There are 562 words used in the Shire which are not used anywhere else in the book – 5%.  And yes, I see the logical error there and am going to – soon! – compare the Shire Text with a similarly sized sample.  If I’m lucky, Tech Support can create a “grab a random sample of text from here of size N” script.

The Mirkwood text uses 16,400 words, of which 2,400 do not appear in the Shire, and variations on “spider” account for about 60 of these.  14.6% . Nearly identical.  I do find it odd that the Mirkwood text numbers come out on an even “400” – I will chase that for a while with your indulgence, Word Fans.

What shall we do with Mountain-king?

In my mission to identify which hyphenated words are Tolkien original compositions, I have use the Oxford English Dictionary’s word on whether something like “Moss-green” is only ever found as “mossgreen” or “moss green” and if the hyphenated form is not attested, I’ve given it the “JRRT” tag.

Further, if the hyphenated form is found in OED, but the only example is from Tolkien’s work, I’m giving him credit for putting together this form as his own intentional style.

Mountain-king“,   however, has three examples, one of which is Tolkien’s and one of which comes earlier.

I would love to hear from you, Word Fans!  This is the type of art-work that has crept into what I thought would be the cut-and-dry list-making of this project.

Thanks for your notes, Word Fans – I have reached clarity.  Since the other examples of “Mountain king” do not have the hyphen (unbehyphenated?), I am giving JRRT credit for an original-ish spelling.

First Pass for the Food Words!

Word Fans, I have done it!  All the way from “cellar” to “tobacco-jar“, I have scanned for all the food words, common and uncommon, and entered them into the concordance.  I’m certain to have missed some, and I am humbly ready to call this my First Pass.  Alert Readers who put me wise to food words I have missed will have a verse written in their honor in the style of the Tra-la-la-lalley Elves.

Let it be noted that I have already had a good argument with myself over “supplies”, and have decided that it’s not a food word.  It is used in “food-supplies”, which is counted separately, and in all other instances can indicate “bandages” as well as it stands for “food”.

Next I will make some lovely graphs of food words.  I’m interested in their frequency and location in the text; I also have an idea in the back of my mind to do a deeper analysis including a negative valence for those times that food words indicate a lack of food.

As I made this first pass, I also took the chance to improve my file of the text.  I’ve eliminated many of the phrase-breaks which left only one-word phrases, fussed with punctuation breaks, and started keeping an eye out for use or non-use of a marked subjunctive.

Concordance-reversed!

Great news!  On the About page, you can always find listed the tools which have been created for this project.  As of yesterday, Tech Support added Concordance-Reversed.py to the Digital Humanities Toolkit.

Concordance.py will take your text and strip out a set of Stop Words.  It’s what I used to strip out the Ten Thousand most common words from The Hobbit.

Concordance-reversed.py will strip out everything but a set of Go Words!  It’s what I’m using now to find specific lemmas, such as Fish, Fishes, Fish’s, Fishes’, Fished, Fishing.  Put all the different forms of your lemma into a text file of Go Words, and you’re on your way!

I love you, Tech Support, from your grateful Mama.

Rescuing lovely uncommon forms of common words

In the beginning of this project, I needed to simplify our list and I bid farewell to such beauties as “shod” and “unbeknown”.  My current occupations, now that I am free to expand our concordance in any manner that is useful to us, is to find those delicious words and record them properly in our project.  I have begun with those which I noted in this blog as I had to wave them goodbye.

10K tag complete

It took a while, but all the Concordance entries so far have been tagged “10K” so that we can make some new entries of common words this weekend.  A good handful of the entries to date will also get the new “common” tag, of course, as they are common words spelled in a gollumesque way.

Lately

What’s happening lately in the project is a boatload of behind-the-scenes work.  I am not surprised (but still chagrined) to learn that repetitive tasks I could have completed in a day when I was 100% focused on this project take many, many weeks when I am returned to the workaday world of family and profession.

I have been proofreading, correcting entries in my spreadsheet, double-checking for words which got lost between the cracks, searching for more onomatopoeia, and making judgement calls on a few more food words (like dining-room).  Today I am still marking the uncommon words with the 10K tag so that I’ll be ready to add more common words very soon.  Already the post of Feminine Pronouns is crying out to become the official entry on “She”.