1. Introduction
  2. A few definitions to start with
  3. Obtaining data about our n-grams
    1. First step: Retrieving tokens from the test sets
    2. Second step: Obtaining lists of n-grams from the tokens
    3. Third step: Producing dictionaries of occurrences from the n-grams
    4. Fourth step: Cleaning the dictionaries
    5. Fifth (and last) step: Obtaining more specific data
    6. A new variable added: the ground truth of the collection studied
  4. Results
  5. Observations
    1. Observations from the table of results
      1. Comparison of results from the general tables between the three sets
      2. Comparison of results from the various data between the three sets
    2. Observations for the most and least popular tokens
      1. Most popular tokens
      2. Least popular tokens
    3. Observations of the stopwords
  6. Conclusion

Introduction

As I presented it previously, I already did a lexicon analysis with my test corpus (the sets of War corpus and Other (subjects) corpus; (dataset detailed here)) and it wasn’t very conclusive, since the lexicon didn’t seem to be involved in the bad results of the models. Given this result, I decided to go a little deeper in the source of my test as I decided to go at a sublexical level and to study tokens. Following along my research on what is recognized by the text recognition software, I want to know if the patterns recognized in handwritten text recognition are those of recurring tokens, i.e. recurrent combinations of characters. This would also mean that the mistakes in the transcription would be caused by less known patterns from the ground truth.

A few definitions to start with

During this experiment, I will use some terms that it is important to define prior to any mention, in order to easily understand what I am about to present. So, I work here with tokens and n-grams.
A token, as it is used here, can be defined as a string of characters. I mostly use this word to present my experiment because it starts with a tokenization, which is a process where long characters sequences will be split into much smaller units.
The term that will be used to call those smaller units is n-gram. An n-gram is a “contiguous sequence of n items from a given sample of text or speech” (https://en.wikipedia.org/wiki/N-gram).
As n is meant to not specify a length for the sequence, I will use it when I want to remain general, but during the experiments I will also use the terms created to name sequences of two characters (2-grams or bigrams), three characters (3-grams or trigrams) and four characters (4-grams or tetragrams).

Obtaining data about our n-grams

Before I can observe and deduce anything, it is essential that I obtain not only my n-grams, but also information on their composition, their occurrences, the similarities between the two test sets (War and Other), etc. This will be done in several steps.

First step: Retrieving tokens from the test sets

This is a pretty simple step, mostly because it is really close to what I have already done in a script from the content analysis’ experiment to create my list of tokens. However, this is not the same script because I am not cleaning the file beforehand as much as I did, because this time, some punctuations are important for the tokenization.
With creating_list_of_tokens_from_texts.py, I am using spaCy again, with its French model, to produce tokens. The tokenization is done as such: spaCy’s tokenizer separates at every space in the text, but also according to some punctuations. In French, when a word is link to its pronoun with an apostrophe, the apostrophe is kept and put with the pronoun as the ensemble of the token. Once all the tokens from the called file have been retrieved, they are placed in a Python list written in an output file. Those two lists of tokens (War and Other) can be found in the file of results at the lines 12 and 55.

Second step: Obtaining lists of n-grams from the tokens

The goal, now that I have the lists of tokens, is to produce different n-grams from those lists. To proceed, I use the Python module textwrap and its method called wrap(). It works with two arguments: a string text and a max width of characters. The string will then be divided into the width given, and it will render a list of the new output lines. In my script producing_list_of_ngrams.py, it is used as such:

  • Empty lists for each n-gram (bigrams, trigrams and tetragrams) are created
  • A for loop is created ; it will iterate on every element of the list of tokens called (War or Other), so each token will be considered individually
  • In the loop, the wrap() method is called, its first argument will call individually the items from the list and the second argument will be either 2, 3 or 4. The idea here is that, if one of the element from the list of token contains more than 2, 3 or 4 characters, it will be divided into an even smaller unit of characters to fit the n-gram called at the time. Then, I transform the list of outputs into a string, in order to add it into the lists of n-gram created earlier.
  • Once the outputs have been collected for each n-gram, it is joined in n-gram dedicated lists (with some “cleaning” done to delete the punctuation from the wrap() list created before and transform into a string), then written in an output file.

From this script, I obtain six lists, with three for each set, which can be found in the file of results at the lines 18 (4-grams), 30 (3-grams) and 42 (2-grams) for the set ‘Other’, and the lines 61 (4-grams), 73 (3-grams) and 85 (2-grams) for the set ‘War’.

Third step: Producing dictionaries of occurrences from the n-grams

Even though obtaining the lists of n-grams is a good step in itself, it does not teach me anything as it is. To really start my study and understand the patterns in the recognition, it is essential to add a new step, which will be to count the occurrences of each n-gram to find those that appears in numerous occasions and those that are only there once.
To proceed to this step, I used a function created for the content analysis’ experiment, which is counting.py. The function in itself was used as such ; the only changing elements were the input and output.
The script producing_dictionary_of_ngrams_occurences.py has a lot of lines, but the last one is the one that really matters. After creating an output file, the script reads the input, which is one of the lists from the file of results. The counting() function will read the list, count the number of occurrences for the elements present in it, and produce a dictionary. Moreover, I added some specificities to the output of this dictionary, so, according to the arguments I added, the dictionary will be sorted by the number of occurrences (values), from the most present to the least. From this script, I obtain eight dictionaries, with four for each set, which can be found in the file of results at the lines 15 (tokens), 21 (4-grams), 33 (3-grams) and 45 (2-grams) for the set ‘Other’, and the lines 58 (tokens), 64 (4-grams), 76 (3-grams) and 88 (2-grams) for the set ‘War’.

Fourth step: Cleaning the dictionaries

I now have all the dictionaries that I need for my experiment and my observations, but when I look at them, I realize that they contain a lot of irrelevant data. Indeed, as I presented it previously, the wrap() method created smaller units of 2, 3 or 4 characters when the string was longer than that, but there were a lot of elements where the token was smaller than that, where punctuation is involved or even cases where, for example, a token was 4 characters and the wrapping with 3 created a unit of 3 characters and one of 1 character for that token. In that case, the first unit is interesting for us but not the second. Therefore, the goal in this step is to get rid of all this useless data and only keep the string of 2 characters for bigrams, 3 characters for trigrams and 4 characters for tetragrams.
This cleaning can be done with a text editor and regular expressions ; here is an example for a dictionary of 3-grams:

  • In a text editor, copy/paste the chosen list
  • Do a find/replace with , (a comma then a space) in “Find” and \n (a newline) in “Replace”
  • Once each key from the dictionary is on its own line, I will search for the trigrams: ^('|").{3}('|"):.+
  • Then, I click on “Find all” which will select all the lines corresponding to the regular expression
  • I copy it and paste it on another file
  • I do some manual cleaning by also deleting trigrams where one of the characters is punctuation such as a dash, a quotation mark, a dot, etc. ; for that, I use the following regex .+(-|"|\.|,|€|/|£).+\n in “Find” and do a “Replace All” with nothing it to delete those lines. I can also delete trigrams where there are numbers, because I only want letters : '[0-9]{3}':.+\n in “Find” and do a “Replace All” with nothing in it to delete those lines.
  • From there, there are two options:
    1. I transform it back into a list, by reversing the above command (\n in “Find” and , in “Replace”) and adding again the brackets and title of the dictionary
    2. I transform my file into a CSV by using the following commands : ^' –> "; $ –> "; ': –> "," and then adding a header in the first line ("tokens","occurrences")

Here is a detailed explanation on the regular expressions I used:

  • ^('|").{3}('|"):.+ –> This means that I search lines with a single or double quotation marks ('|") at the begining of the line ^, then 3 characters .{3} immediatly followed by the closing quotation marks ('|") then the colon and the number of occurrences :.+. This regex ensures to dismiss all non-trigrams lines. This can also be easily adapted to bigrams or tetragrams, by changing the number in the curly brackets: .{2} for bigrams and .{4} for tetragrams.
  • .+(-|"|\.|,|€|/|£).+\n –> This means that I searched, in the line, cases where one of the characters mentioned in the parenthesis is presented between the opening of the line and the occurrences number. The newline \n is there to be sure that the whole line disappear after I clicked on the empty “Replace All”.
  • '[0-9]{3}':.+\n –> This means that I searched, in the line, cases where between the quotation marks there are only numbers. This can be adapted to other n-grams by changing the number in the curly brackets, like above.

The cleaned versions of my dictionaries of n-grams can be found in CSV files in a folder on the repository. The file contains six columns, because I choose to separate the results in three: all capitals, initials and lowercases. Otherwise, the table is ranking as such: most to least popular occurrences and when there is several tokens with the same number of occurrences, it is sorted alphabetically, from A to Z.

Fifth (and last) step: Obtaining more specific data

The dictionaries are now cleaned, which means that I will be able to properly conduct my observations. However, even though the data that I currently have can be enough for me to do my work, I also wanted to have some more precise elements and results so I decided to do a few more transformations and to gather specific data from what I already had.

To be more precise in my observations, I decided to add two other types of lists: the most and least popular tokens based on the number of occurrences. Considering the results I got, I choose to go with 11 occurrences or more for the most popular and only one occurrence for the least popular. I collected it from every n-gram and added them in the file of results:

  • The lists for the most popular tokens can be found at the lines 24 (4-grams), 36 (3-grams) and 48 (2-grams) for the set ‘Other’, and the lines 67 (4-grams), 79 (3-grams) and 91 (2-grams) for the set ‘War’.
  • The lists for the least popular tokens can be found at the lines 27 (4-grams), 39 (3-grams) and 51 (2-grams) for the set ‘Other’, and the lines 70 (4-grams), 82 (3-grams) and 94 (2-grams) for the set ‘War’.

From those lists, I went again deeper, to search for the common and unique elements between the sets. For that, I used two methods that were already proven effective when used during the content analysis’ experiment: difference() and intersection(). This is done in the script creating_list_of_tokens_from_ngrams.py. This script operates in three steps:

  1. In the preamble, the lists that will be exploited are imported from the file of results mentioned many times in this document;
  2. New lists of common and unique tokens are created and stored in variables;
  3. An ouput file is created and in it, each new lists are written, with their variable name as their list name.

There are two versions of this script, which can be chosen by putting in triple quoting marks the one not wanted. One will input and output lists about the most popular tokens and the other will be about the least popular tokens.

Then, I extracted the information from the lists of the most popular tokens, to give a concrete idea of what was most popular in terms of tokens. Those popular tokens can be found in the following document. In it, there are tables with the alphabetically-sorted list of popular tokens, with three columns, one for each n-grams. There are also two versions of thoses tables: a plain one with the lists and one where the stopwords were highlighted in each lists.

Now, I have an important number of lists (38) and dictionaries (8) and I thought this might be easier to convert some of that data into tables of numbers, to get statistics about my results. Thus, I created a table of results, divided in three parts: set Other, set War and comparison. For the comparison, I used the results from the lists I obtained with the script mentioned a few paragraphs before. There are six tables, three that gives the length of each list (all tokens and most and least popular tokens), one that gives the percentages of all caps, initials and lowercases for each n-gram and two that gives the percentages of unique tokens from each set compared to all the tokens.
For each set, there are six tables. Three present the quantity and percentage of most and least popular tokens, with also the presence of the stopwords for the popular tokens. The other separates the results of occurrences in three formats: tokens in all caps, with uppercases initials and all lowercases. The table themselves contain several rows, that corresponds to range of occurrences: only 1, 2 to 5, 6 to 10, 11 to 50. When it happened in the result, I also added 51 to 100 and 101 to 500. In one table, the range “501 to 1000” and “more than 1000” were also added.

Finally, with those tables, I also decided to add a graphical representation of the data, in the form of bar charts. The graphics can be found here. The vertical axis represents the quantity of tokens. The horizontal axis is divided into parts, one for each range of occurrences. In each range, there are three bars, one for each n-gram to compare them more clearly: the bigrams in green, the trigrams in blue and the tetragrams in yellow.

A new variable added: the ground truth of the collection studied

The collection that is studied here with test sets was already used to produce an efficient text recognition model, after the training of about 250 transcribed pages, which gave a model of 97,98% of accuracy. This model and its ground truth represents a huge gap in terms of numbers and I though that it would be interesting to add it to the study, for comparison and to see if that can help answer the initial question of the token effect on the recognition.
Therefore, all of the results files that I mentioned earlier and that I am summarizing below also contains data about those ground truth.

Results

Here is a summary of the different results produced from the whole experiment.

Observations

Observations from the table of results

Comparison of results from the general tables between the three sets

In total, the set Other (SO) has way more tokens than the set War (SW): it is multiplied by almost 3 for tetragrams, 2.5 for trigrams and bigrams. The difference in all caps is not very big, only 100 more for SO usually; it is also a few for the rest, barely 1-3% of all of the tokens apparently.
In terms of repartition (percentages), both sets have pretty much the same, the numbers doesn’t really change from one another. The numbers in lowercases is huge, notably in bigrams (more than 32k for SO), which make sense with the quantity of text.
Between the table of all tokens and unique counting, I can see a huge difference in numbers. For SO, the numbers of tokens was divided by 4.2 in tetragrams, 9.3 in trigrams and 48.8 in bigrams. For the SW, the numbers of tokens was divided by 2.7 in tetragrams, 5.2 in trigrams and 23.9 in bigrams. This is logical, because there are way less text in the SW than in SO. Moreover, those high dividers, in both sets, would help conclude that there are many tokens that have a lot of occurrences, more than some with only 1, which could explain why the models are working.
In terms of the ground truth, we can see that those numbers are even bigger, with total numbers in the hundred of thousands, which can seem logical considering how many pages and lines were used to train the model, as I explained when I presented the dataset. Compared to the other sets, the set Ground truth (SGT) is multiplied by 14 (SW) and 5.5 (SO) for tetragrams, trigrams and bigrams. The percentage of repartition is also pretty much the same as the other two sets: the only slight difference is with the all capitals. Its percentage is usually 2 or 3 points bigger than the other two (which could be explained by the fact that some lines of the ground truth were composed of only capital letters, so that it can learn them better). Between the table of all tokens and unique counting, the difference in numbers is even bigger. The numbers of tokens was divided by 10.6 in tetragrams, 27.02 in trigrams and 185.6 in bigrams. This could be a reason for the great recognition of this model (about 98% of accuracy) because it has a lot more occurrences while also having a great diversity of single occurrences, as we will observe afterwards.

Comparison of results from the various data between the three sets

All caps

In terms of unique tokens, SO and SW don’t have many in all caps, which could explain the difficulty in recognizing those parts in the text, because it does not have enough examples to learn how to recognize it. This can also be observed with the occurrences tables, because we can see, for the SO that it is mostly 1 to 5 occurrences in the text and not many with more than 10 occurrences, and for the SW, it is pretty much the same, with many single occurrences but barely more. On the other end, the SGT should be much better at recognizing words in all caps because, not only it has way more unique tokens of all caps, with about five to ten times more for trigrams and tetragrams, but also because their occurrences are also pretty high, going to the 51 to 100 and even more than 100, while the other were usually at maximum 50 occurrences. Even more, for the bigrams, for example, there are not much more unique tokens than the other sets (281 for SGT, 143 for SO and 105 for SW), but the occurrences are way higher, with 101 uniques tokens with 11 to 50 occurrences, 25 with 51 to 100 and even 15 with more than 100.

Initials

For the initials, SO and SW have sometimes way more than all caps, with the tetragrams and trigrams, but not for the bigrams, which have a number of initials similar to all caps, whether it is SO or SW. The SGT has even less unique initials for bigrams and trigrams than for all caps, but way more for tetragrams. In the occurrences tables, there are a lot of tokens with single occurrence or 2 to 5 occurrences, which is the case for all three sets. For the SO, there are also still a good number (30 to 40) with 6 to 50 occurrences for all n-grams and even few with more than 50 for bigrams. The numbers are way lower for SW, with barely more than 10 tokens with more than 6 occurrences, except for 20 that has 11 to 50 in bigrams. That could be explained by the fact that a lot of initials could be pronouns or conjunctions so no more than two letters which would mean that they exist in the bigrams but can’t be found in the other n-grams (which is indeed partly proven with the stopwords). At their own scale, the SGT has a pretty similar distribution, with 100 to 200 tokens with 6 to 50 occurrences. There are only 10 to 15 for the trigrams and tetragrams that has 51 to 100 and more than 100 occurrences, but the numbers are, just like the other sets, a little higher for the bigrams, with 20 to 30 with the high occurrences.

Lowercases

The lowercases are on another level because their number are bigger, in the thousands for trigrams and tetragrams, and hundreds for bigrams. This is logicial, considering the fact that the number of all tokens was way higher for the lowercases, especially compared to all caps and initials, with usually one or two more digits. However, in that case, the numbers should be higher but, with the table of occurrences, we can see that if there are not that much unique tokens compared to the total, it is because there are many reccurents tokens with a lot of occurrences, as proven by the new rows added to the tables (2 or 3 depending on the set). The all caps and initials tables were going to 50, 100 and sometimes a little bit more over in terms of occurrences. With the lowercases, there is tokens with occurrences in the range of 100, 500, 1000 and even more than 1000 in the case of two bigrams in the SO, and 52 bigrams and seven trigrams in the SGT. This explains thus the low numbers of bigrams because there are less unique bigrams but they are way more present. On the contrary, there are many unique tetragrams in lowercases (SW: 818; SO: 1005; SGT: 1349) while there are only 69 (SW), 61 (SO) and 76 (SGT) in bigrams. It could be explained by the fact there are many unique words or long combination of words but it is made of bigrams appearing recurrently.

For the SO, the number and repartition is very different from an n-gram to another. For the tetragrams, as we’ve seen before, there are much more unique tokens than recurring ones (46% to 9%). For the trigrams, there are more unique but the numbers are not very different (32% to 21%), so 1/3 is unique tokens but 1/5 are recurring ones. The numbers are inversed in bigrams, with almost half of them being recurring tokens and 1/5 being unique tokens. In those recurring tokens, there is not a lot of stopwords, only about 1/10 for all of them.
For the SW, the numbers are a little bit different. The difference in tetragrams is greater between single and recurrent tokens, more than half (56%) for singles and less than 1/25 for recurrent. However, in this low number of recurrent tokens, almost 1/3 of it are stopwords. For the trigrams, the recurrent tokens has 10 points less than SO, while the single tokens have 10 more points, and in the 1/10 of recurent, 1/6 are stopwords. For the bigrams, there are less recurring ones than in SO, with only 1/3 of them and 1/6 of stopwords in that. For the single tokens, it is 1/4 of all unique tokens that are single occurrences.
Finally, for the SGT, the percentages are really different that the other two and while the percentages of most popular tokens are higher, those of least popular are lower. The difference between single and recurrent tokens for tetragrams and trigrams are not that big (about more or less 1/3 for the trigrams, 1/5 for the recurrent tetragrams and less than 2/5 for the single tetragrams). In both cases, for the recurrent tokens, there are barely any stopwords (only 3% for trigrams and tetragrams).This is with the bigrams that we can really observe the difference of the SGT: there aren’t many unique tokens (12%) but more than half are recurrent tokens (57%) and only a small portion are stopwords (13%), which mean that the model should be skilled on more various words.

The SW seems to have the same recurrent tokens as SO and SGT, because the numbers of unique tokens to SW are either really low and even nonexistent for bigrams with SO and every n-gram for SGT. Compared to the SGT, the same goes for the SO, which has barely any unique tokens to itself. Between SO and SW, there are a lot of commons tokens in bigrams, but less in trigrams and tetragrams where SO has a lot of unique ones (more than 60%). The numbers are all pretty low for SW, which means that it doesn’t have a lot of unique unique recurrent tokens (less than 10% for all), which would make me think that those n-grams should be well recognized in SW whether we use the MO, MW or MGT, because it knows them. On the other hand, in SO, many are unique to this set, and even though, for bigrams, unique unique is only 30% of all recurring occurrences, for trigrams and tetragrams, it is 3/5 and 3/4 of all tokens, so those are occurrences that might not be recognized in MW. Finally, the MGT should not meet any difficulty on the SW because there is no element from the SW that aren’t also in SGT; for the SO, the percentage of uniqueness to the set compared to SGT is even lower than SW to SO, with less than 4% for every n-gram so the recognition should not be hindered.

Those numbers are way different than those of most popular tokens, because there are many more single tokens. There aren’t that many common single tokens between SO and SW, SO and SGT or SW and SGT and the percentages of unique single tokens are pretty much the same whether it is SO or SW and bigrams, trigrams or tetragrams, which is about 80%, so the majority of all. With the SGT and the other sets, it is even in the 90-95%. This might appears as a problem during the recognition not matter the model because the models will have trouble recognizing some n-gram that it never saw but the model might have a problem on its own set because it does not learn well enough some n-gram.

Observations of the stopwords

For the bigrams and the trigrams, we can observe that the differences between each set is mostly in the type of the stopwords: SW contains mainly lowercases stopwords and a few initials, SO contains initial and lowercase versions of the stopwords, and SGT contains all caps, initials and lowercase versions.
For the tetragrams, SW contains 20 stopwords which are all unique, except one that is an initial version of a one already present. SO contains, juste like before, the initials version of most of the stopwords in SW but also new ones that had not 11 or more occurrences in SW. SGT contains no new stopwords compared to SO but has a lot more initial version or even all caps in some cases.

Conclusion

If the n-gram are at the center of the model training, as I am trying to demonstrate, then, as there are many bigrams in SO that have a lot of occurrences, this would explain why the recognition is still working. On the other hand, the number are way down with SW, which could explain the lower recognition with MW, since it learns less patterns. The all caps and sometimes intials are also way less presents so that might hinder the recognition. Regarding those results, MGT should have no real difficulties in recognizing both sets, because it had a lot of recurrent tokens, but it could make some mistakes since it has learned more characters combinations with various number of examples, which could lead to confusion and wrong recognition.