CST's keyword extractor extracts 20 keywords characterising an input text. This is done by comparing the words of the input text with 1500 articles (app. 900.000 words) from the Danish newspaper Berlingske Tidende 1999. Words in the text are assumed to characterise a text if they only occur in relatively few articles in Berlingske Tidende, meaning that they are not ordinary frequent words.
First the input text is pos-tagged and lemmatised. Then the relative frequency of the nouns is calculated by the well known weighing algorithm TF*IDF1 that combines the term frequency (TF) and the inverse document frequency (IDF). The function is cable of discriminating terms characterising the individual text from terms characterising the whole document collection or general language:
TF*IDF = log10((n/df)*tf)
- where n is the number of documents (in this case 1500 articles from Berlingske Tidende), df is the number of documents in which the term occurs, and tf is the term frequency in the input text.