Corpora

CAL2 – L2 Acquisition Corpus

Description: CAL2 compiles the spontaneous production data (written and oral) collected under the Morphology and Syntax in L2 Acquisition project (read more)
Link: http://cal2.clunl.fcsh.unl.pt

CIPM – Digital Corpus of Medieval Portuguese

Description: CIPM consists of texts dating from the 12th to the 16th centuries, and it includes texts in prose, both literary texts (hagiographic, historical and travel narratives, doctrinal prose, philosophical treatises, texts of a moralistic and religious nature) and non-literary texts (private notarial documents, royal documents, wills, charters, i.e., primarily legal documents) (read more)
Link: http://cipm.fcsh.unl.pt

CORPORART – PT/IT specialized comparable corpora of Public Art

Description: CORPORART – PT/IT is a bilingual comparable corpus of the Public Art domain. It comprises sub corpora for contemporary European Portuguese and Italian, from 2000 to 2018, covering text types and subdomains representative of the production of specialized texts in this highly interdisciplinary domain (read more)
Link: https://clunl.fcsh.unl.pt/en/online-resources/corpora/corporart-corpus-comparavel-pt-it-de-especialidade-no-dominio-da-arte-publica/

Corpus of Written Narratives PIPALE

Description: The Corpus of Written Narratives is a corpus of texts produced by primary school children (2nd and 3rd grades) obtained within the PIPALE project (read more)
Link: https://pipale.fcsh.unl.pt/corpus-de-narrativas-escritas/

Portuguese Literature Corpus for Distant Reading

Description: The Portuguese Literature Corpus for Distant Reading is a literary corpus of non canonical novels by Portuguese authors, from the period 1840-1920 (read more)
Link: https://github.com/COST-ELTeC/ELTeC-por

G&T.Comenta

Description: The corpus G&T.Comenta was created within G&T.Comenta project for study and categorization of the commentary as an activity of language and textual practice. The corpus results from a collection of texts circulating in different media and from different sources (read more)
Link: https://projetos.dhlab.fcsh.unl.pt/s/GTComenta/item

G&T_COMMENTARY_TD

Description: G&T_COMMENTARY_TD is a FAIR-compliant, manually annotated corpus that comprises 82 commentary texts published in Portuguese newspapers and magazines between 2005 and 2016, segmented into 373 Discourse Type (TD) units, following the theoretical framework of Sociodiscursive Interactionism (SDI). (read more)
Link: https://zenodo.org/records/18084593

HEREDITermCorpus_en (V0.1)

Description: The HEREDITermCorpus_en_V0.1 compiles a curated selection of texts dedicated to the microbiota-gut-brain axis (MGBA) and its emerging role in neurodegenerative disorders. The dataset comprises 1,060 documents, 234,215 sentences, 4,132,486 words and 6,029,603 tokens (read more)
Link: https://doi.org/10.5281/zenodo.16968962

HEREDITermCorpus_pt (V0.1)

Description: The HEREDITermCorpus_pt_V0.1 compiles a curated selection of texts dedicated to the microbiota-gut-brain axis (MGBA) and its emerging role in neurodegenerative disorders. The dataset comprises 126 documents, 100,610 sentences, 1,999,301 words and 2,665,436 tokens (read more)
Link: https://doi.org/10.5281/zenodo.16969241

MIGRANTE.PT

Description: Resulting from the project EXPRIMI, MIGRANTE.PT is an European Portuguese corpus for specific purposes with around 1,5 million tokens, of institutional texts concerning the integration of migrants in Portugal and directed to these migrants, collected from sites and materials freely available online (read more)
Link: https://clunl.fcsh.unl.pt/en/online-resources/corpora/migrante-pt/

Parallel sense-annotated corpus ELEXIS-WSD 1.0

Description: ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene (read more)
Link: http://hdl.handle.net/11356/1674

CLUNL