Double-checking, Project Gutenberg, and The Ten Thousand

As I approach the end of the month, I am double-checking details, spelling, formatting.  I checked my stopwords file – the list we have called The Ten Thousand.  In their original format, they’re listed as words which are used at least 20 times per million.  I set up my filter to take the first ten thousand of those words in order of use, and thus was born my stopwords file.  This morning, I checked that not only were there no more than ten thousand words, but that there were indeed at least that many.

Oh, dear.  There were just over five thousand.

In searching for my next resource, I discovered that Project Gutenberg’s corpus is being used to create just such a list!  It’s a work in progress, updating as more work enters the project, and going out to 100,000 words.

Thank you, Project Gutenberg!

I’ve easily created my new stopwords file and let’s see how things turn out, shall we?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s