Parallel sense-annotated corpus ELEXIS-WSD 1.0
Descrição
“ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.
The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.”
(Texto transcrito daqui)
Identificador
http://hdl.handle.net/11356/1674
Páginas do projeto
https://elex.is/
https://clunl.fcsh.unl.pt/en/projetos/projetos-curso/elexis-european-lexicographic-infrastructure/
Menu < voltar
- Corpora
- Léxicos, Dicionários, Glossários
- BDTT-AR – Base de Dados Terminológica e Textual da Assembleia da República
- Dicionário de Abreviaturas Digitais
- DLP – Dicionário da Língua Portuguesa
- Dicionário Multilingue Multidomínio
- DVPM – Dicionário de Verbos do Português Medieval
- Glossário Colaborativo COVID-19
- Glossários Terminológicos Multilingues para fins específicos na CPLP – Angola
- Termos básicos em diagnóstico de patologia da fala e da linguagem
- Ontologias
- Material de formação
- Outros
- BILP – Bibliografia de Linguística Portuguesa
- Caderno de exercícios. Ensinar com o dicionário: informações linguísticas e lexicográficas para ensino de Português
- CORPORART_GRAMM_IT_1.0: Gramática Semântica WordSketch para o CORPORART
- CORPORART_GRAMM_PT_1.1: Gramática Semântica WordSketch para o CORPORART
- Instrumento de diagnóstico PIPALE 1
- Oneness – On-line less used and less taught language courses
- Scrinium – Traduções Portuguesas Medievais de Textos Latinos