As I approach the end of the month, I am double-checking details, spelling, formatting. I checked my stopwords file – the list we have called The Ten Thousand. In their original format, they’re listed as words which are used at least 20 times per million. I set up my filter to take the first ten thousand of those words in order of use, and thus was born my stopwords file. This morning, I checked that not only were there no more than ten thousand words, but that there were indeed at least that many.
Oh, dear. There were just over five thousand.
In searching for my next resource, I discovered that Project Gutenberg’s corpus is being used to create just such a list! It’s a work in progress, updating as more work enters the project, and going out to 100,000 words.
Thank you, Project Gutenberg!
I’ve easily created my new stopwords file and let’s see how things turn out, shall we?