CITlab Keyword Spotting | API

CITlab Keyword Spotting can be used on all documents that have been processed with CITlab HTR(+). Searching is currently restricted to a single documents per search request.

  • Sending a search request:

POST https://transkribus.eu/TrpServerTesting/rest/kws/queries?collId={myCollectionId}&id={myDocId}&query={searchTerm1}&query={searchTerm2}&query=...

  • List KWS search processes. As those might take some time to finish, watch the “status” field of a process until its value is “Completed”:

GET https://transkribus.eu/TrpServerTesting/rest/kws/queries

  • Retrieve the hits of a completed search process:

GET https://transkribus.eu/TrpServerTesting/rest/kws/queries/{myKwsJobId}/hits

The GET requests allow paging, e.g. use the query parameters “?index=0&nValues=2” to get the first two elements

Expert syntax

Extended query syntax can be enabled by sending a parameter map in the POST request starting the search, enabling this feature:

   "type" : "kwsParameters",
   "entry" : [ {
      "key" : "caseSensitive",
      "value" : "true"
   }, {
      "key" : "expert",
      "value" : "true"
   } ]

Instead of words, the search patterns are now defined by regular expressions.

To define the part of interest, one must to define a group “KW”. As result the part which contains this group will be returned, e.g.

  • date: .*(?<KW>[0-3][0-9]\.[0-1][0-9]\.[0-9]{4}).* matches any line containing a date of the form TT.MM.JJJJ
  • abbreviations: .*(?<KW>Dr\.|Doctor).* matches any line containing Doctor and its abbreviation Dr.
  • uncertainties: .*(?<KW>(k|c|che|chh)rist?).* matches any line containing Old High German spellings for Christ: e.g. kris, krist, crist, cherist, chhrist

In contrast to standard usage of regular expressions, the search patterns have to match the whole line, e.g. .*[0-9]{4,6} will match only lines which end with a number of at least 4 digits. To allow arbitrary characters after the 4 digits, one has to add .* at the end: .*[0-9]{4,6}.* Analogously, [0-9]{4,6}.* matches only lines which begin with 4 digits.

Standard regular expression features which are supported:

.any character
+one or more repetitions of the previous literal
*zero or more repetitions of the previous literal
[]class of characters, e.g. [0-9] any digit between 0 and 9; [aeiou] any vowel; [A-Z] any capital letter
?the previous literal is optional
{X}repeat previous literal X times
{X,Y}repeat previous literal between X and Y times
|b means either a or b
()b)c matches ac or bc while a|bc matches a or bc
\escape operator: to match e.g. a + or . one needs to escape it by \+ or \.

Standard regular expression features which are not supported:

^begin of line is not supported
[^....]negation in character is not supported
{,Y} {X,}open repetitions are not supported (in case {,X} write {0,X})
$end of line is not supported
[:alpha:]predefined character classes like this alphabetical class are not supported