Comparative analysis

Introduction

As I previously mentioned, my thesis aims to determine if the lexicon of the ground truth has an impact on the efficiency of the model, especially if it is specific. To prove or refute this theory, I need to do different kind of tests. In the last entry of my logbook, I presented the content analysis I did on two tests sets I developed from the corpus of Paul d’Estournelles de Constant. The idea was to obtain a thoroughly knowledge of the content of those tests sets.
Those tests sets have been selected by reading the letters and each is supposed to have their own specific subject: one is about the war and the other is not about war but about “everything else” (For more information about the dataset, see here). After analysing the content and notably the unique tokens from each set, the theme selection appears more clearly. The war set is full of military terms, while the ‘other’ test set seems to be more about politics and administrative business. This seems logical, as war and politics are the two main topics of discussion between d’Estournelles and Butler.
Now that I know more about my sets, I will try to demonstrate my theory by using them for transcription. With each set, I did a training on eScriptorium and produced a model. Then, I apply to each set the model developed from the opposite set, but also the model developed from their respective set. The idea is to see how good the transcription can be, if the model has some problems recognizing some parts of the text, because it is not in the vocabulary it was trained with; moreover, we know, thanks to the word clouds the kind of words it should have problems with.
Right from the start, I must point out that the efficiency of one of the models might be instantly better than the other, due to the quantity of data given for the training. Indeed, the “war” set is made of about 30 pages, while the “other” has double or even more pages, so it gave the trainer the opportunity to recognize characters and words with more occurrences, which will be a bonus.

How to do a comparative analysis of transcription?

Before starting to check the transcriptions, it is important to ask how will we be able to evaluate the quality of the transcription and to determine how and why did it or did it not work. This is achievable with the help of some metrics created to evaluate exactly this sort of results and luckily, a specific tool have been developed to do that for us, by simply providing a text reference and a prediction.

Some definitions

There are some metrics to know in order to understand how to evaluate the quality of a transcription. First of all, the evaluation will be made by calculating the Levenshtein distance, which is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. For example, the Levenshtein distance between “complete” and “complet” will be 1, because there was one deletion (‘e’), but, between “extraordinary” and “ektraodinnary”, it will be 3 because there was one insertion (‘n’), one deletion (‘r’) and one substitution (‘x’–>’k’). Then, from that, the Word Error Rate (WER) and Character Error Rate (CER) will be calculated. The WER is a way to evaluate the quantity of words correctly transcribe by a model. It will be obtained as such:

Word substitution(s) + Word insertion(s) + Word deletion(s) / Number of words in the reference

From this WER, we can also deduce the word accuracy (Wacc) of the transcription, because we only need to do the following calculus:

Word Accuracy = 1 - WER

The CER operates pretty much the same but we will be at the level of the characters (counting the spaces, punctuations, etc.):

Substitution(s) + Insertion(s) + Deletion(s) / Number of characters in the reference

A high CER doesn’t particularly mean a high WER, because the character errors could be centralised on some words and in the contrary, the CER could be quite low but with a WER that seems high, which would mean that the errors were spread on the transcription. However, this second case is usually the one most encountered, the CER being lower than the WER most times.

For the subsequent analysis that we will do, our metrics will be:

Levenshtein distance, in characters and in words
WER, CER and Wacc
Hits (number of characters correctly guesses)
Substitutions/Insertions/Deletions
Length of the reference and of the prediction

The tool: KaMI

For my comparative analysis, I want to obtain some of the metrics I mentioned above. In order to do so, I will use KaMi, which stands for Kraken Model Inspector, a tool built for the evaluation of models and based on the Kraken transcription system.

Functionalities

This tool evaluates the success of a transcription task on an image or several, comparing a correct transcription - the reference - and a prediction, produced by transcribing with a chosen model. The results will be a series of metrics, with notably the Levenshtein distance between the reference and the transcription, the Word Error Rate (WER) and the Character Error Rate (CER), the Word Accuracy (Wacc), as well as some others statistics taken from the Speech Recognition domain. With the results, the length of the reference and the prediction are also available, which is an already quick way to determine a difference.
With the web application, it is also possible to have access to a ‘versus text’, which will show where are the differences between the reference and the prediction. This is a good way to determine more easily where the model had a problem, which could then be helpful to know how to better train and improve it. However, the web application is limited on the number of characters that can be submitted in the reference/prediction (7000 characters), so it will only be possible to test little bit of the transcription.

How to have access to KaMI

Table of results

General results

Results

	Set War/Model War	Set War/Model Other	Set Other/Model Other	Set Other/Model War
Levenshtein distance (char)	372	770	451	4090
Levenshtein distance (words)	322	540	346	2734
WER	4,92%	8,25%	2,08%	16,48%
CER	0,95%	1,97%	0,45%	4,08%
Word Accuracy	95,08%	91,74%	97,91%	83,51%
Hits	38658	38299	99797	96425
Substitutions	279	557	274	3301
Deletions	67	148	118	463
Insertions	26	65	59	327
Length (reference)	39004	39004	100189	100189
Length (prediction)	38963	38921	100130	100053

Observations

First of all, the most striking thing we can observe is the Levenshtein distance in characters where the gap between the model with the set it trained on and the model trained with the other set is really high. For the war set, the number has more than doubled and for the other set, the number has been multiplied by almost 10. Then, we can see with the length of the reference and the prediction that all predictions are missing characters compared to the reference (41 SW/MW; 83 SW/MO; 59 SO/MO; 136 SO/MW). This can be partly explained by the fact that there is a lot of deletions in every model application but at the same time, the insertions aren’t that high. We can also observe that the substitutions are really high, the smallest number being 274 for the SO/MO, while the SO/MO is at more than ten times that with 3301. Overall, the word accuracy percentage are not that bad: for the model applied to the set it trained on, we have 95% (SW) and 97% (SO) which is pretty good; for the model applied to the opposite set, we have 91% (SW) and 83% (SO). With those number, we can see that the MO did really well on its own set but it wasn’t so bad either on the other set. On the other end, the MW did pretty bad on the SO but it wasn’t as high as it could have been on its own set. This tends to prove the idea that the SW should had had the same number of pages on its training set as the SO, because the problem of the model come from the lack of content rather than the content itself.

Results by page