Part of Speech tagging

DiscussionsLiterary Computing

Rejoignez LibraryThing pour poster.

Part of Speech tagging

1Petroglyph
Modifié : Mai 16, 2022, 1:40 pm

For this Lunch Break Experiment (tm), I'll share some of the tinkering I've been doing with Part of Speech (POS) tagging in R and Stylo

Stylo comes with a function called parse.pos.tags. What it does is take a text (or an entire corpus) that's been tagged for Parts of Speech, and it separates the tags from the words. Here are the first sentences of Wuthering Heights as tagged by the Stanford POS tagger:

"I_PRP have_VBP just_RB returned_VBN from_IN a_DT visit_NN to_TO my_PRP$ landlord_NN -_: the_DT solitary_JJ neighbor_NN that_IN I_PRP shall_MD be_VB troubled_VBN with_IN ._. This_DT is_VBZ certainly_RB a_DT beautiful_JJ country_NN !_. In_IN all_DT England_NNP ,_, I_PRP do_VBP not_RB believe_VB that_IN I_PRP could_MD have_VB fixed_VBN on_IN a_DT situation_NN so_RB completely_RB removed_VBN from_IN the_DT stir_VB of_IN society_NN ._."

PRP is "pronoun, personal", VBP is "verb, present, non-3rd person", RB is "adverb", VBN is "verb, past participle", and so on.

From a text like this, you can use the function parse.pos.tags to extract the POS, which looks like this:

PRP VBP RB VBN IN DT NN TO PRP$ NN : DT JJ NN IN PRP MD VB VBN IN . DT VBZ RB DT JJ NN . IN DT NNP , PRP VBP RB VB IN PRP MD VB VBN IN DT NN RB RB VBN IN DT VB IN NN .

(Or, you can use the same function to strip away the tags and get this: "I have just returned from a visit to my landlord: the solitary neighbor that I shall be troubled with...")

Stylo can apply this function to a single text, or to an entire corpus. If you do the latter you'll end up with a corpus consisting of nothing but POS tags.

You can then use the Stylo applications I have demonstrated in this group to, say, look at groups of three POS tags and use them for authorship attribution.

Of course, tagging a text with POS like this requires following the instructions here, with advanced-looking code that while it is copy/pasteable, also looks forbidding and unfamiliar at first glance.



Tagging a few novels

So instead of copying/pasting code, I ran a few novels through TreeTagger, which has at least somewhat of a GUI in addition to a command-line interface. Initially, I just tagged a few novels by Jane Austen (6), Anne Bronte (2), and Joseph Conrad (5), but later I added a few by Charlotte Bronte (4), Marie Corelli (3), Emily Bronte (1), Ronald Firbank (2), Dorothy Richardson (3) and Virginia Woolf (3). That's 29 in total, by nine different authors.

Here is the TreeTagger output for that famous first line of Pride and Prejudice. I've added the green column to explain what the abbreviations mean. (Right-click to embiggen)



All tags are correct, except "acknowledged", which should be VVN "past participle." (It's still a verb, though, so the superordinate class is not wrong -- it hasn't switched to noun or something else egregious). Note, in particular, line 22, where want is not characterized as a verb but as a noun. That is entirely correct.

Here is what that sentence looks like when Analyzemywriting.com does the tagging:



Several words are simply not assigned a POS (e.g. determiners) and are, therefore, completely ignored, and some are assigned the entirely wrong wordclass (I've marked the errors in red): that is a complementizer, and AMW should ignore it as it does all the others; must is an auxiliary; want is a noun. Some of AMW's output is correct, but the kind of errors AMW commits doesn't inspire much confidence for larger quantities of data.

Anyway. I imported the TreeTagged texts into R. There are actually multiple ways of getting Stylo to recognize them.

  1. Using the tidyverse packages dplyr and stringr, it would have been easy to extract the combinations of word and POS and convert them to a string where the word and the POS are connected with an underscore: "It_PP is_VBZ a_DT truth_NN universally_RB...". Once you write that to a .txt file, you can run Stylo > parse.pos.tags and select the Stanford way of tagging files.

  2. But why the long way round? You could just put your TreeTagger-tagged texts in a folder, set your Working Directory in R as that folder, and then use Stylo commands like these (the hash-tagged lines are comments explaining what the next line of code does) to directly use the TreeTagger style of output:

    # Load a corpus that consists of all the files in the working directory and call it "temp_corpus"
    temp_corpus = load.corpus(files = "all", encoding = "UTF-8")
    # Apply the parse.pos.tags function to temp_corpus and put the extracted POS tags in an object called "processed_corpus_pos"
    processed_corpus_pos = parse.pos.tags(temp_corpus, tagger = "treetagger", feature = "pos")

    And boom, just like that, with two lines of code, you've converted your POS-tagged texts in a corpus ready for analysis.

    (There has to be a way to batch-tag an entire corpus in TreeTagger. Probably on the command line. But that is something for a future Lunch Break Experiment (TM).)

  3. Or you can do some table-wranging to extract only the POS column (again using dplyr and stringr) and arrange it as a string. The novels then appear like this:

    NN NN PP VBZ DT NN RB VVD IN that DT JJ NN IN NN IN DT JJ NN MD VB IN NN IN DT NN RB RB VVN DT NNS CC NNS IN PDT DT NN MD VB IN PP JJ VVG DT NN DT NN VBZ RB RB VVN IN DT NNS IN DT VVG NNS WDT PP VBZ VVN IN DT JJ NN IN DT CD CC JJ IN PP NNS NA PP JJ NP NP NA VVD PP NN TO PP

Anyway, I did not go the automated way and instead went with the third option: I extracted the POS tags myself and arranged them in 29 .txt files, one for every novel. These files I fed to Stylo and told it to look at groups of three POS: we're essentially looking whether the various texts can be distinguished based on sequences of Parts of Speech.

A few novels in, I decided to run a quick sanity check to see if it worked. Here is a graph of the 100 most frequent sequences of three POS in two novels each by Anne Bronte and Jane Austen:



Cool! Looks like it works!

And here is the graph for the novels by Austen, Anne Bronte and Joseph Conrad, measuring the 300 most frequent sequences of POS:



All three authors are correctly separated. And what's more, the two authors closest in time to each other (Austen & Bronte) form an initial subcluster, before Conrad joins them, who a) wrote much later, and b) spoke and wrote English as a second language.

Here is a graph for all 29 novels, measuring the 300 most frequent combinations of three POS:



Only Charlotte Bronte's Shirley is on the wrong branch: it's placed with Corelli's books. But all the others are correctly grouped together. What is more, there are two large groupings that are quite distinct from one another: the oldest authors, who wrote in the late-1700s-mid-1800s (Austen, the Bronte's) or who emulated that style (Corelli), and more modernist 20thC authors (Conrad, Woolf, Richardson and Firbank).

Finally, I also ran the software on single Parts of Speech -- basically a most-frequent-words search on a corpus with only 49 unique "words"/POS. I did so for the interim corpus Austen + ABronte + Conrad as well as for all 29 novels:



The twelve novels in the left-hand picture are quite alright, even when there are only 49 Parts of Speech to run stats on. But the cracks are beginning to show in the right-hand picture. There still is this big separation between the "older" and the "more recent" groups, and many authors are correctly separated. But the Bronte novels are much more mixed; two of Charlotte's are judged closest to Firbank's books (now there's an unexpected combination!); and one of Conrad's novels is placed among Corelli's books.

Clearly there are limits to what the ranking of single POS can do (the proportion of nouns vs verbs vs adverbs etc): the reliability decreases as the number of authors involved increases. Ngrams of three Parts of Speech are still quite successful, though.



But I only want percentages!

Of course, if you don't want to do anything with these POS and all you want to know is how many there are of each, it's easy to call up a list. Here's one for Conrad's Under Western Eyes, in descending order. I added a column called "Proportional" that contains the percentages:



If all you want is "main" POS (but why would you??), you can filter on all the POS that start with N (i.e. all the nouns), or filter on all the rows that start with VV (i.e. all the verbs that aren't be, have, do):



And so on. And here's the sum of all the types of Verb in Under Western Eyes (including be, have, do):



Doing all of this manually for each individual text is relatively quick and easy, especially if you rename each text imported into R as, say, "tagged_text": this will allow you to run all the operations on that object, and you'll just be able to reuse the same code for each text. Essentially, you'll use a line of code importing the text; a line of code renaming it to "tagged_text"; and a few operations to extract the various Parts of Speech, sum their numbers, and calculate their percentage. Once you've written and tested all the operations, you can extract these figures from each text in under thirty seconds, since you're just copy/pasting the same chunk of code for each novel.

Of course, if you're serious about this, you'll point R at a folder full of .txt files, run a for-loop and apply the same calculations to each text in turn, which you then write to a file. But that's programming, and we can't have that. So I didn't and just extracted these numbers for a single text.



Finegrained POS vs only 7 POS

How well do the seven POS perform that Analyzemywriting.com uses?

Starting from the POS-only files I produced in the previous section, I again used R (dplyr and stringr) to replace all the NN (singular noun), NP (proper noun), NNS (plural noun), NPS (proper noun, plural) with "NOUN"; I replaced all the JJ (adjective), JJR (adjective, comparative) and JJS (adjective, superlative) with "ADJECTIVE", and so on:



I did this for nouns, adjectives, verbs, adverbs, prepositions, pronouns, auxiliaries. Words like and, but, or and the, a(n) are completely ignored by AMW, so I removed those from consideration. Words like "three" and "hundred" are counted as Nouns by AMW, so I did the same thing.

Note that, because the 49 finegrained POS are much more precise and accurate than AMW, these seven categories are much better organized and correspond better to the actual texts than how AMW would do it (cf. must be in want of a wife). However, Analyzemywriting does not provide the option of exporting a tagged text, and so I had to do it like this.

For the interim corpus of Austen, ABronte and Conrad, things look alright:



Neat: even 7 Parts of Speech can be used to differentiate the texts of three fairly distinct authors.

However, does this scale? This is what the 29 novels look like when separated by only 7 crude POS:



In short: no. Apart from Austen (in green), not a single author's books are all grouped together. The two books each by Anne Bronte (in red) and Ronald Firbank (in grey) cluster together, but other works are grouped with them, so that doesn't really count for all that much. There is some structure in this dendrogram -- often, two novels by the same author are grouped together -- so it isn't quite pandemonium. But the point of diminishing returns has clearly been passed somewhere between the twelve Austen-ABronte-Conrad novels and these 29 by nine authors.

The 49 more fine-grained POS used above were still mostly fine at this level, but even they were reaching the point of diminishing returns here. So while 49 fine-grained POS are clearly better than 7 big buckets, even they do not scale well to a many-author corpus. (Keep in mind that these 7 POS, based as they are on larger groupings of fine-grained TreeTagger POS, are already performing more accurately than AMW would.)

Does this mean that only distinguishing 7 POS is entirely useless? Well, if all you do is look at their relative proportions, yes. But what if you could look at, say, groups of 3 POS? AMW won't let you do that. But R and Stylo do:



This is pretty good! Corelli and Charlotte Bronte are not quite correct, but the rest is! It appears we're reaching the point of diminishing returns here, too, though. Alright, let's try sequences of four POS:



Everything is correct again! But we can add only so many texts and authors before even sequences of 4 won't be reliable any more.



Conclusion

POS are a solid tool to help in authorial attribution. But they work best when a few conditions are met:

  • The more fine-grained the POS division used is, the more accurate the results
  • As the number of texts and the number of different authors grows, an analysis based on POS becomes increasingly unreliable.
  • Their quickly-diminishing returns can be counteracted by looking at sequences of two, three, four... POS.
  • But for corpora of dozens or even hundreds of texts it would be better to not include them at all, since they appear to be unreliable under those circumstances