Model testing
First batch of test to see if my hypothesis that the vocabulary plays a part on the efficiency of the transcription model can be verified.
The test sample is composed of 18 letters, spanning from 1920 to 1923, and that makes 110 pages in total.
It is divided between letters that talks about the events of the war or something related to it and letters that talks about literally anything else, like family life or politics.
War | Other subjects |
---|---|
Letter 617 | Letter 607 |
Letter 678 | Letter 722 |
Letter 844 | Letter 753 |
Letter 927 | Letter 846 |
Letter 948 | Letter 1029 |
Letter 957 | Letter 1103 |
Letter 1000 | Letter 1170 |
Letter 1364 | Letter 1217 |
Letter 1367 | Letter 1358 |
Results from the KaMi App
The automatic transcription has been done with a model trained on more than 300 pages of ground truth from the PEC corpus with an accuracy of 93,78%.
Number of pages | Number of lines | Levenshtein Distance (Char.) | Levenshtein Distance (Char.)* | Levenshtein Distance (Words) | Levenshtein Distance (Words)* | WER (%) | WER (%)* | CER (%) | CER (%)* | Word Accuracy (%) | Word Accuracy (%)* | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Letter 607 | – | – | – | – | – | – | – | – | – | – | – | – |
Letter 617 | 3 | 65 | 115 | 73 | 94 | 49 | 15,93 | 8,57 | 3,32 | 2,24 | 84,1 | 91,43 |
Letter 678 | 2 | 36 | 46 | 36 | 44 | 23 | 17,05 | 9,16 | 2,85 | 2,33 | 82,95 | 90,84 |
Letter 722 | 2 | 35 | 53 | 41 | 36 | 29 | 14,06 | 11,65 | 3,41 | 2,75 | 85,94 | 88,35 |
Letter 753 | 3 | 53 | 107 | 81 | 87 | 65 | 18,12 | 13,8 | 3,7 | 2,92 | 81,88 | 86,2 |
Letter 844 | – | – | – | – | – | – | – | – | – | – | – | – |
Letter 846 | 3 | 57 | 98 | 69 | 76 | 50 | 15,32 | 10,18 | 3,21 | 2,36 | 84,68 | 89,82 |
Letter 927 | – | – | – | – | – | – | – | – | – | – | – | – |
Letter 948 | 4 | 76 | 184 | 157 | 131 | 103 | 20,5 | 16,35 | 4,76 | 4,23 | 79,5 | 83,65 |
Letter 957 | 4 | 107 | 171 | 131 | 140 | 89 | 15,01 | 9,75 | 2,92 | 2,35 | 84,99 | 90,25 |
Letter 1000 | 5 | 127 | 122 | 77 | 119 | 56 | 10,24 | 4,93 | 1,74 | 1,16 | 89,76 | 95,07 |
Letter 1029 | 3 | 62 | 78 | 58 | 60 | 42 | 11,65 | 8,32 | 2,41 | 1,86 | 88,35 | 91,68 |
Letter 1103 | 2 | 39 | 69 | 54 | 55 | 41 | 16,77 | 12,81 | 3,47 | 2,87 | 83,23 | 87,19 |
Letter 1170 | 4 | 97 | 131 | 112 | 123 | 85 | 14,32 | 10,04 | 2,44 | 2,19 | 85,69 | 89,97 |
Letter 1217 | 4 | 102 | 173 | 124 | 140 | 81 | 15,52 | 9,35 | 3,13 | 2,39 | 84,48 | 90,65 |
Letter 1358 | – | – | – | – | – | – | – | – | – | – | – | – |
Letter 1364 | 2 | 45 | 50 | 40 | 44 | 31 | 11,83 | 8,61 | 2,22 | 1,87 | 88,17 | 91,39 |
Letter 1367 | 2 | 54 | 161 | 138 | 84 | 72 | 18,97 | 16,67 | 5,97 | 5,39 | 81,04 | 83,33 |
*Diacritics, digits, cases and punctuation have been ignored.
Results of the experiments
First of all, two letters seem to not be suited for the test as they have different and bad results compared to the rest of the sample. After observing the versus text and the pages on eScriptorium, it seems that it is due to the quality of the images and their segmentation, which were messy and prevent a good text recognition. This is why, even though their topic is war, those letter will not be taken into consideration for the study as they would be a mislead for the results. The best example of a good transcription and great results is the letter 1000. It is the longest one from the sample and yet, it has the best WER and CER, with or without the diacritics, punctuation, cases and digits. We can observe that generally, the CER is way better than the WER with almost ten points of difference from one another. The CER percentage are usually around 2-3% which indicates pretty good results and seems to say that our model have an almost exhaustive databank of characters, which ensures that it is unlikely for it to misrecognize a character.
Here are the results:
Best WER
- Letter 1000 (5 pages)
- Letter 1029 (3 pages)
- Letter 1364 (2 pages)
- Letter 722 (2 pages)
- Letter 1170 (4 pages)
Best WER with no diacritics, digits, cases and punctuation
- Letter 1000 (5 pages)
- Letter 1029 (3 pages)
- Letter 617 (3 pages)
- Letter 1364 (2 pages)
- Letter 678 (2 pages)
Best CER
- Letter 1000 (5 pages)
- Letter 1364 (2 pages)
- Letter 1029 (3 pages)
- Letter 1170 (4 pages)
- Letter 678 (2 pages)
Best CER with no diacritics, digits, cases and punctuation
- Letter 1000 (5 pages)
- Letter 1029 (3 pages)
- Letter 1364 (2 pages)
- Letter 1170 (4 pages)
- Letter 617 (3 pages)