Content analysis

Creating the sets of tests

I have two sets of tests, taken from the corpus of Paul d’Estournelles de Constant. After some reading of the content, I did the separation according to what I deduced was their theme: one of them is mostly about war, the other deals with other topics. There are nine letters in each test set; the war set is about 31-pages long; the other is about 76-pages long (For more information about the dataset, see here).

To analyse those sets, I first needed to concatenate them in one file, which is done with the groundtruth.py script. I then have two text files, one with all the content from the war set and one with all the content from the other set.

Once this is done, I have to transform again this content to only obtain lists of words. I have to suppress all the punctuation, numbers, etc., put all the words in lowercase and then, I use the NLP tool spaCy to tokenize my text and obtain two lists of results: one with all the tokens in their original text form and one with those tokens lemmatized, which means in their dictionary form (e.g.: “aidera” will stay “aidera” in the first list but will be “aider” in the second one). To do so, I use the producing_tokens_and_lemmas.py script. Then, I will retrieve information and statistics from my sets but I will have two series of sets. The first one will be the one I just produced. The second one will be a corrected version of this sets, by removing all the words that does not exist in the dictionary. To do this, I will use a script working in two steps: the first one extracts the incorrect words from the lists and the second one removes the incorrect words from the lists. The goal is to see in the end if the presence of incorrect words distorts the results.

Producing data from the content

To analyse the vocabulary of those sets, I will proceed to create three new lists of vocabulary: the vocabulary only found in the war set, the vocabulary only found in the other set and the common words. Furthermore, I will look at the numbers of those to see if there are a lot of similarities or no. Finally, I will also take an interested in the part-of-speech of the words to see if there are some types more present in certain lists. For my analysis, I have two scripts: one for the first two processes I mentioned and one for the last one. The first script, creating_list_of_tokens.py, uses two methods - difference() and intersection() -, and a function - counting(). The first method, difference(), will produce two lists, that will only contain the words unique to each set. The second one, intersection(), will produce the common words between these two lists. The function, counting(), will produce dictionaries that counted the occurrences of each token present in the lists given as inputs. This function is used for creating many dictionaries: first of all, it will create dictionaries counting the words in the war and the other set; then, by using those produced dictionaries, we will produce the dictionaries for the unique and common results, which we can’t do with the list previously created, because set() will have created ensemble with no duplicates in the list. To produce the dictionaries, I simply used the complete dictionaries and remove either the unique words or common words from it to get my three new dictionaries. The second script, producing_part_of_speech_tagging.py, uses again the NLP tool, spaCy, to retrieve the part-of-speech tagging of each token from the given lists (e.g., ‘abdication’ = NOUN, or ‘ignorer’ = VERB or ‘et’ = CCONJ).

Results

The different results of this analysis can be found at the following links:

Analysis of results

First of all, although it was already known from when I have chosen my two sets of tests, we can observe that there is a big difference between the vocabulary of the war set and of the other set. I knew that one was way longer than the other one but I thought that it might not be clearly visible on the results. However, it is noticeable, first by the sheer number: there is the double of tokens and/or lemmas from the war set to the other. Then, when we do the separation between unique and common words, this difference stays noticeable because the common words of “war” represents half of the whole set, while it is only a quarter for the “other”, which gives us the idea that this test set will contain a lot of unique word that will not really help our study.

One of the interesting points in this study is displayed thanks to the choice of not only having the tokens but also the lemmatized version. With it, we can observe a rather big difference in number in the two sets: for the war, which is the small one, there is about one fifth (20%) of the set that disappeared and with the other, there is a quarter (25%). After the cleaning, the common did not move much, only 5% (about 45 words) which probably means plurals and conjugation.

Nevertheless, if we are only talking about the part-of-speech tagging, which is represented through a pie chart, we can observe that they are pretty similar in their division, with usually only one point of difference. However, when we go into details with the separation between common words, unique to “war” and unique to “other”, we start to see some differences: for example, “common” logically got most determinants, adverbs, conjunction, auxiliary and other tags that would help with the creation of a proper sentence. “Other” has a lot more adjectives and verbs than the other two (32 and 27% compared to 29/22% (war) and 23/20% (common) but less nouns (29%) compared to common and war (31%). Yet, this was with the tokens and we observe that the results appear rather differently with the lemmas: with this case, verbs and nouns are pretty much the same (26% and 34-35%). The number of adjectives differs though: 19% (common), 22% (other) and 25% (war).

If we observe the three main word clouds, the first thing we notice is the scale of results: “common” has words in really big fonts in the centre; “other” has some big words (while very small compared to the previous WC) although it is drowning behind a series of unique words; the same can be said for “war”, except that it is rather poor in big words, some appears more clearly than the rest but it is pretty light. However, they are all easily readable which makes it easier to try to find correlations or patterns.

Common: We can mostly see the presences of the determinants, conjunctions, adverbs and the auxiliaries, which shows us that even though their percentage wasn’t that big in the POS tagging, it is still predominant in this study. We can also see appear clearly, among the cloud, some elements from the opener and closer (butler, affectueusement, président)
War: This word cloud has some words that appear more clearly than others and I mostly can discern nouns, which is compliant with some of the POS analysis (1/3 of nouns). Moreover, what I can also see when observing that is that the lexicon seems to be military terms (désarmement, superdreadnought, colonie, amiral, marine, capitaine, etc.)
Other: This word cloud seems to have a more political lexicon (chambre, républicain, assemblée, électeur, canton, etc.)