Croatian n-grams

56 Last view: 2024-07-09

hrNgrams

ID:

317 This resource contains sets of n-grams of different sizes (from 1 to 3) computed from the Croatian National Corpus v2.5. N-grams were computed both from lowercased text and text in original character case. For every size of n above one (i.e. for bigrams and trigrams), n-grams were computed in two ways: taking to account only those appearing within sentence and across sentence boundaries. Regarding the tokenization of the corpus, token is considered to be a continuous sequence of non-whitespace characters. Punctuation markings are treated as separate tokens. Complex punctuations are tokenized as a sequence of simple punctuations. Resource consists of 10 textual files, each computed with different combination of paramaters (i.e. n-gram length, character case, sentence boundaries). Each line in the file represents one unique n-gram and its absolute frequency in the corpus, separated by a tabulator. N-grams are ordered according to their frequency, starting from highest to lowest. The n-grams lists were produced using methodology and tools developed by the CESAR Polish partner IPIPAN.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

CC - BY - SA

Restrictions: Attribution, Share Alike

Execution location: hidden

Distribution Access/Medium: Downloadable

Distribution rights holders:

University of Zagreb, Faculty of Humanities and Social Sciences

IPR Holder

University of Zagreb, Faculty of Humanities and Social Sciences

Marko Tadić

Contact Person

Marko Tadić

text

Lexical Conceptual Resource General Information

Machine Readable Dictionary

Encoding

Encoding level: Syntax

Linguistic information: Usage - Collocations, Usage - Frequency

Creation

Creation mode: Mixed

Original Sources

N-gram list generated using the methodology developed by CESAR Polish partners IPIPAN.

Creation Tools

N-gram generating scripts

Monolingual text lexicalConceptualResourceLanguages

Croatian

Language Script: Latn

Linguality

Linguality type: Monolingual

Size

8,681,475 Entries

Character encoding

UTF - 8

Resource Creation

Resource Creator

University of Zagreb, Faculty of Humanities and Social Sciences

Creation started: 10/01/2012

Funding Project

Central and South-East European Resources (CESAR)

URL: http://www.cesar-pro...

Funding Types: Eu Funds, National Funds, Own Funds

Funders: European Commission (50%), University of Zagreb, Faculty of Humanities and Social Sciences (50%)

Project duration: 02/01/2011 - 01/31/2013

Metadata

Created: 01/30/2013

Last Updated: 04/08/2013

Metadata Creator

Marko Tadić

Version

Version: 1.0

People who looked at this resource also viewed the following: