Linguistic Inquiry and Word Count

DiscussionsLiterary Computing

Rejoignez LibraryThing pour poster.

1Petroglyph
Avr 1, 2023, 8:50 am

I thought I might post the writeup of another Lunch Break Experiment (TM) -- just a few notes and exploratory graphs. This time, I've played around a little with sentiment analysis, which has most successfully been formalized in an approach called Linguistic Inquiry and Word Count.

What is LIWC?

Linguistic Inquiry and Word Count, or LIWC for short, is a computer-assisted approach to language analysis that attempts to say meaningful things about the psychology of the author based on the texts they produce. Not only individuals can be studied this way, but groups as well.

For example, this software has been used to analyze the different thinking and arguing styles of Trump and Clinton during the 2016 federal election in the USA, and the prevalent emotional responses in tweets about these two; it's been used to estimate the effect a depression diagnosis has (positive or negative) by comparing patients' Reddit posts before and after their diagnosis; it's also been used to track customer satisfaction, what kind of experiences lead people to leave better reviews (which can be exploited in future marketing campaigns).

In short: this methodology has applications in social media analyses, clinical psychology, and customer research. There's only one study (to my knowledge) that uses LIWC for authorial attribution, and it's this one:

Boyd, Ryan L., and James W. Pennebaker. 2015. ‘Did Shakespeare Write Double Falsehood? Identifying Individuals by Creating Psychological Signatures With Text Analysis’. _Psychological Science_ 26 (5): 570–82. https://doi.org/10.1177/0956797614566658).

If you really get down to it, the way this works is very simple in principle: the software compares the words in the target texts to a series of built-in dictionaries, in which the words and multi-word phrases have been marked as "positive" or "negative", or whether or not they express anger, authenticity, and so on. At the end of that process, the software will spit out scores: how much of the target text, percentage-wise, expresses positive emotions? Or authenticity?

This official LIWC help page explains a few of the metrics they implement: "clout" is a measure of how confident, assured, or authoritative people write. It can be used to assess e.g. degree of expertise in a subject. "Analytical thinking" looks at several kinds of function words to assess "formal, logical, and hierarchical thinking patterns" -- think therefore, whereas. A low score on this metric implies "more intuitive and personal" language.

More sophisticated analyses of this kind will also be able to handle cases with more complex negatives, like "I'm not saying I dislike your product" and "I don't hate dill-flavoured crisps", which a simple look-up table would categorize as plainly negative. Or things like "Thankfully, this latest splatter-fest is rife with gore and bloody guts", where superficially-negative words actually convey a very positive message. Things like jokes and sarcasm are other problems. I wonder how "And Brutus is an Honourable man" is categorized by LIWC -- superficially, it's positive, but in context, it's biting sarcasm.

(So I just tried that last one: in isolation, that sentence is judged to be positive, of course -- there's no context to say otherwise. But the entire "Friends, Romans, countrymen" monologue is judged to be twice as negative than positive, because there's lots of other negative words in there to contextualize the sarcasm. So that works out.)

But of course, this is a Lunch Break Experiment (tm), so I'm just going to poke at this for a bit until it starts feeling like work.

Trying it out

While I would like to play with the actual LIWC software, the free demo version of LIWC online only allows for a 5000-character limit (roughly 1000 words), so I'm going to have to use an alternative.

Fortunately, there exist packages for R that perform this kind of analysis, sort of, in some limited and very specific ways. I picked the package "tidytext", which does positive/negative emotions, and which seems like an accessible first step into this area.

Here is a snippet from the built-in dictionary that input texts are compared to (right-click to embiggen):

Even though the complete word list runs to 6786 items, this is a fairly basic dictionary, in that it scores words as merely positive or negative; words not included in this lookup table are considered "neutral" in emotional tone and are not counted. More sophisticated versions have gradations of these emotional values (words will be scored like +3 or -2) that would be more appropriate for a serious dive. But for a quick Lunch Break Experiment (tm), this will do.

Anyway. Here is a sentiment analysis graph of six Jane Austen novels:

(Right-click to embiggen)

These graphs show to what extent positive or negative words outweigh each other in equally-sized segments of the novels: if there are roughly equal numbers of positive/negative words, the bar will be near zero; but if one significantly outweighs the other, the bar will be proportionally large in the relevant direction.

Plenty can be said about these graphs, I'm sure. Just one thing that jumps out at me -- and this is probably a spoiler for most of these novels -- is that, in general, the happy endings are preceded by a gloomier segment, which itself is often preceded by a peak (or a buildup) of positivity. So that's neat visualisation of storytelling techniques. It also looks like Sense and Pride (and perhaps even Emma and Persuasion) have very similar emotional arcs -- the highs and lows seem to come in similar orders.

I do want to stress, however, that this graph was extremely easy to produce. All the effort it took was basically following the instructions on this page. I'm not exaggerating: all I did was copy/paste the code there into my R console and press enter.

However, if you read that page carefully, and understand what each line of code does, and have a little experience in dealing with R code, then it's easy-peasy to apply this code to other texts. I threw a few random books from my corpora into a folder and ran the code.

Wanna see the positive/negative emotions for Jane Eyre? Here you go:

(Right-click to embiggen)

People more familiar with this novel than me (I read it once many years ago) can probably chime in with how accurate this is. But it looks like it tracks the ups and downs of the novel quite well: there's the unhappy childhood at the start, and I think I can see the sections later in the novel that concern the friendship at school, the madwoman in the attic and that whole missionary business.

What about Wuthering Heights? I remember that as a pretty bleak book, full of family feuds and people who are mean to each other. The visualization is worse than I expected:

(Right-click to embiggen)

There are only a few segments in this novel where the positive words outweigh the negative ones, and those hover around a mere +10; a single positivity peak around +28 is an outlier. The negativity is much more sustained, outweighing the positive words across almost all sections. And that outweighing regularly reaches over -20; there's many, many dips into the high -30s and a few over -40.

Another famously bleak book is Heart of darkness, and here's what that looks like:

(Right-click to embiggen)

The negative words outweigh the positive ones throughout: occasional positivity peaks of around 10 are more than offset by the consistent negativity of around -10, regularly sinking to almost -20.

Compared to that, The Wonderful Wizard of Oz should be a much happier book. And indeed, so it is:

(Right-click to embiggen)

The positivity rates go to ~+30, and peak negativity is only -16 or thereabouts; for Heart of darkness, the positivity peaked around 12, and the negativity around -22. They're almost polar opposites!

A few more for the road? Here are graphs for Virginia Woolf's Mrs. Dalloway, Iain Banks' The Wasp Factory and Ruth Thompson's Ozoplaning with the Wizard of Oz:

(Right-click to embiggen)

I'm not going to pretend like this is the full range of LIWC analyses -- it so obviously isn't. All I've done is produce a few graphs visualizing positive/negative emotions in a few novels. But this first foray into sentiment analysis is already far beyond a simple "% of positive/negative words in an entire text" -- it shows you exactly where in the novel things switch from one to the other, and the general shape of the graphs and how the peaks and troughs alternate (or even if they do) is useful information.

And this view can be zoomed in or out. Here are three versions of the novel Wuthering heights, at different levels of magnification:

(Right-click to embiggen)

The bars in these graphs cover successively larger stretches of the text, meaning that you can adjust the granularity depending on what you want to use these graphs for (chapter-by-chapter, scene-by-scene).

But that's gonna be it for today, because taking this further would involve actual work.

A more systematic analysis of a curated corpus might reveal genre trends, for instance, or changes over time (Victorian novels vs. modernist novels vs post-modern novels). Are second instalments in a trilogy really the bleakest? It would also be interesting, I think, to look at what might unfavourably be called "assembly line" works -- books that are produced according to a pre-arranged formula. Like Goosebumps or Fear Street (those were speed-written to come out at a rate of one a month in the 90s), James Patterson thrillers (who has ghostwriters work off a scene-by-scene framework), or even A.E. Van Vogt books (who wrote following the principle that something should shake up the plot every 800 words), or Harlequin books (which target well-established and explicitly formulated comfort zones).

But those are ideas for another Lunch Break Experiment (tm).

2Keeline
Avr 3, 2023, 9:51 am

I have not had a chance to look at the detailed instructions for this method. So I can get a better understanding of the graphs, what is the unit of each bar — sentence, paragraph, or ???

How does word count fit into this analysis? Perhaps I missed where this is stated in my brief time with the summary.

It is generally said that Jules Verne’s works have a less optimistic view of technology and its ability to make life better in his later works. A theory is that the ones written after he was shot in the leg by a nephew who suffered temporary insanity might be a turning point. Analysis of this has potential if one knew the dates of writing. It might be necessary to work with French texts and word lists to minimize issues with rewriting alterations in translation.

James

3Petroglyph
Avr 4, 2023, 5:45 am

>2 Keeline:

what is the unit of each bar — sentence, paragraph, or ???

For most of these graphs the unit of each bar corresponds to 80 lines of the novel (minus the function words, and only counting the words that occur in the get_sentiments("bing") dictionary).

In more detail: The code places every word in a novel on its own row, with the original line number next to it. For Wuthering Heights, that is 10,202 lines. Then, it applies the sentiment dictionary via an inner_join, which only keeps the rows where the novel words and the dictionary words are identical; all other rows are dropped. Finally, the underlined segment in this bit of code:


count(book, index = line %/% 80, sentiment)

creates a column called "index" that contains as many rows as the number of times you can divide the number of lines (i.e. 10,202 for Wuthering Heights) into 80. That is 127.525, but the %/% operator forces whole integers, i.e. a round 127. This column "index" is then used to produce the graph with, in this case, 127 bars.

How does word count fit into this analysis?

The words that are counted are those words in the novel that happen to match the 6786 words in the sentiment dictionary. Per slice of 80 lines you then subtract the negative from the positive, and that is the height of each bar.

Jules Verne’s works have a less optimistic view of technology and its ability to make life better in his later works

I did not know that! And it sounds like an excellent idea for a future experiment. Thanks!

It might be necessary to work with French texts and word lists to minimize issues with rewriting alterations in translation

Oh, definitely. I know there are boatloads of issues with the various translations (translators liberally cutting / adding / rewriting). The tricky part is going to be to find a reasonably-complete French-language dictionary tagged for sentiment.

shot in the leg by a nephew who suffered temporary insanity

I did not know that either! A permanent limp, it says on la wiki. Poor fellow. The nephew was never again released from the mental asylum: an even poorer fellow.

4prosfilaes
Avr 4, 2023, 10:13 am

>3 Petroglyph: A permanent limp, it says on la wiki.

I actually had to check to make sure you didn't mean la.wikipedia.org ; unfortunately their article on Iulius Verne doesn't include that fact.

Linguistic Inquiry and Word Count

DiscussionsLiterary Computing

1Petroglyph
Avr 1, 2023, 8:50 am

2Keeline
Avr 3, 2023, 9:51 am

3Petroglyph
Avr 4, 2023, 5:45 am

4prosfilaes
Avr 4, 2023, 10:13 am

Groupe: Literary Computing

À propos

Liens rapides

Œuvres

Auteurs

Séries

Linguistic Inquiry and Word Count

DiscussionsLiterary Computing

1PetroglyphAvr 1, 2023, 8:50 am

2KeelineAvr 3, 2023, 9:51 am

3PetroglyphAvr 4, 2023, 5:45 am

4prosfilaesAvr 4, 2023, 10:13 am

Groupe: Literary Computing

À propos

Liens rapides

Œuvres

Auteurs

Séries

1Petroglyph
Avr 1, 2023, 8:50 am

2Keeline
Avr 3, 2023, 9:51 am

3Petroglyph
Avr 4, 2023, 5:45 am

4prosfilaes
Avr 4, 2023, 10:13 am