CST Keyword Extractor – META-SHARE

Last view: 2024-07-09

90 Last view: 2024-07-09

CST Keyword Extractor

CstKeyExt

http://cst.dk/online/keywords/uk/index.html

CST's keyword extractor extracts 20 keywords characterising an input text. This is done by comparing the words of the input text with 1500 articles (app. 900.000 words) from the Danish newspaper Berlingske Tidende 1999. Words in the text are assumed to characterise a text if they only occur in relatively few articles in Berlingske Tidende, meaning that they are not ordinary frequent words.

First the input text is pos-tagged and lemmatised. Then the relative frequency of the nouns is calculated by the well known weighing algorithm TFIDF1 that combines the term frequency (TF) and the inverse document frequency (IDF). The function is cable of discriminating terms characterising the individual text from terms characterising the whole document collection or general language:

TFIDF = log10((n/df)*tf)

- where n is the number of documents (in this case 1500 articles from Berlingske Tidende), df is the number of documents in which the term occurs, and tf is the term frequency in the input text.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

CLARIN ACA - NC

Restrictions: Academic - Non Commercial Use

IPR Holder

University of Copenhagen

Contact Person

Dorte Haltrup Hansen

toolService

Tool

Language Dependent

Input

Media type: Text

Resource type: Corpus

Modality: Written Language

Language: Danish

Resource Creation

Resource Creator

Dorte Haltrup Hansen

Metadata

Created: 12/13/2012

Last Updated: 12/14/2012

Metadata Creator

Dorte Haltrup Hansen

Documentation

Samples Location: http://cst.dk/online...

People who looked at this resource also viewed the following: