LX-Tokenizer

51 Last view: 2024-07-09

http://lxcenter.di.fc.ul.pt/tools/en/LXTokenizerEN.html

The present tool, that was built to deal with Portuguese-specific issues concerning a few non-trivial cases that involve tokenization-ambigous strings, segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the | (vertical bar) symbol is used to mark the token boundaries clearier:

um exemplo → |um|exemplo|

Expands contractions. Note that the first element of an expanded contraction is marked with an _ (underscore) symbol:

do → |de_|o|

Marks spacing around punctuation or symbols. The \* and the / symbols indicate a space to the left and a space to the right, respectively:

um, dois e três → |um|,/|dois|e|três|
5.3 → |5|.|3|
1. 2 → |1|./|2|
8 . 6 → |8|\.*/|6|

Detaches clitic pronouns from the verb. The detached pronoun is marked with a - (hyphen) symbol. When in mesoclisis, a -CL- mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a # (hash) symbol:

dá-se-lho → |dá|-se|-lhe|-o|
afirmar-se-ia → |afirmar-CL-ia|-se|
vê-las → |vê#|-las|

This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:

deste → |deste| when occurring as a Verb
deste → |de|este| when occurring as a contraction (Preposition + Demonstrative)
This tool achieves a f-score of 99.72%.

LX-Tokenizer was developed and is maintained at University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

Proprietary

Restrictions: Academic - Non Commercial Use

User Nature: Academic

Download location: hidden

Distribution Access/Medium: Downloadable

Licensors:

António Branco

Distribution rights holders:

António Branco

IPR Holder

António Branco

Contact Person

António Branco

toolService

Tool

Language Dependent

Input

Media type: Text

Resource type: Corpus

Modality: Written Language

Output

Media type: Text

Resource type: Corpus

Modality: Written Language

Segmentation level: Word

Operation

Operating system: Linux

Evaluation

Evaluated: True

Evaluation measure: Automatic

Resource Creation

Resource Creator

João Ricardo Silva

Metadata

Created: 11/23/2012

Last Updated: 11/23/2012

Source: METANET4U

META-SHARE

Metadata Language: English (en)

Metadata Creator

Catarina Carvalheiro

Version

Version: 1.0

Last Updated: 11/23/2012

Usage

Foreseen UseNlp Applications

Use NLP Specific: Parsing, Pos Tagging

Actual Use - Nlp Applications

Use NLP Specific: Parsing, Pos Tagging

Documentation

Tool Documentation: Online

Document Type: Other

Catarina Carvalheiro, LX-Tokenizer Narrative Description, http://194.117.45.19...

Document Type: Masters Thesis

João Silva, Shallow Processing of Portuguese: From Sentence Chunking to Nominal Lemmatization, http://docs.di.fc.ul... , 2007

Document Language: English

People who looked at this resource also viewed the following: