CITlab Keyword Spotting can be used on all documents that have been processed with CITlab HTR(+). Searching is currently restricted to a single documents per search request.
- Sending a search request:
- List KWS search processes. As those might take some time to finish, watch the “status” field of a process until its value is “Completed”:
GET https://transkribus.eu/TrpServerTesting/rest/kws/queries
- Retrieve the hits of a completed search process:
GET https://transkribus.eu/TrpServerTesting/rest/kws/queries/{myKwsJobId}/hits
The GET requests allow paging, e.g. use the query parameters “?index=0&nValues=2” to get the first two elements
Expert syntax
Extended query syntax can be enabled by sending a parameter map in the POST request starting the search, enabling this feature:
{ "type" : "kwsParameters", "entry" : [ { "key" : "caseSensitive", "value" : "true" }, { "key" : "expert", "value" : "true" } ] }
Instead of words, the search patterns are now defined by regular expressions.
To define the part of interest, one must to define a group “KW”. As result the part which contains this group will be returned, e.g.
- date:
.*(?<KW>[0-3][0-9]\.[0-1][0-9]\.[0-9]{4}).*
matches any line containing a date of the form TT.MM.JJJJ - abbreviations:
.*(?<KW>Dr\.|Doctor).*
matches any line containing Doctor and its abbreviation Dr. - uncertainties:
.*(?<KW>(k|c|che|chh)rist?).*
matches any line containing Old High German spellings for Christ: e.g. kris, krist, crist, cherist, chhrist
In contrast to standard usage of regular expressions, the search patterns have to match the whole line, e.g. .*[0-9]{4,6}
will match only lines which end with a number of at least 4 digits. To allow arbitrary characters after the 4 digits, one has to add .*
at the end: .*[0-9]{4,6}.*
Analogously, [0-9]{4,6}.*
matches only lines which begin with 4 digits.
Standard regular expression features which are supported:
. | any character |
+ | one or more repetitions of the previous literal |
* | zero or more repetitions of the previous literal |
[] | class of characters, e.g. [0-9] any digit between 0 and 9; [aeiou] any vowel; [A-Z] any capital letter |
? | the previous literal is optional |
{X} | repeat previous literal X times |
{X,Y} | repeat previous literal between X and Y times |
| | b means either a or b |
() | b)c matches ac or bc while a|bc matches a or bc |
\ | escape operator: to match e.g. a + or . one needs to escape it by \+ or \. |
Standard regular expression features which are not supported:
^ | begin of line is not supported |
[^....] | negation in character is not supported |
{,Y} {X,} | open repetitions are not supported (in case {,X} write {0,X}) |
$ | end of line is not supported |
[:alpha:] | predefined character classes like this alphabetical class are not supported |