
We are grateful to all organisations and individuals who have provided/licensed corpus texts for use at the Institute of Language and Communication (ISK) at the University of Southern Denmark.
Credits: For your publications or other references, please use the text and provider details listed below. For annotation and site credits, see also our work credits page.
Please note that corpus search engines are meant to provide researchers with language data and statistics, not running text. Thus, ordinary copyright still holds. This implies for instance that you mustn't try to extract larger, contiguous text portions from any of the corpora.
Danish corpus-sources:
LOKE, an online news and literature magazine, copyright Arne Herlřv Petersen
Parliamentary debates, from the Danish Folketing, kindly provided by webmaster Benny Hřyer
Udklipsbureauet, prose fiction by Ole Dalgaard
Bar el Gazel, prose fiction by Ole Dalgaard
Litteraturvidenskaben siden nykritikken, by Ole Sauerberg (2000)
Ret og pligt i det 17. ĺrhundrede, by Knud E. Korff (1996)
Skalk is a Danish journal of archaeology (ISSN 0560-1894)
Munk-korpus, by Ulrik Petersen (2006)
Europarl is the Danish part (both original and translated) of the European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
Wikipedia is the Danish part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
Korpus 90/2000, a mixed genre "quote corpus", has been compiled by DSL (Det Danske Sprog- og Litteraturselskab), and grammatically annotated with VISL tools in a joint venture framework. The text corpus, Den Danske Ordbogs Citatkorpus (DDOC-korpus), is a subset of Den Danske Ordbog (DDO) korpus. In the construction of the DDOC corpus the following steps were used: 1. automatic orthographic sentence chunking (though with some errors due to ambiguous full stops, in particular, 2. removal of a random third of the sentences, 3. randomised ordering of the remaining sentences. Another subset of the DDO corpus is Korpus 90, which contains all DDO texts from 1988-92 (25 million running word forms). Korpus 90/2000 is accessible at the website of the Korpus 2000 project.
The Leipzig corpus is the Danish section of the Leipzig Corpora Collection, compiled from Internet sources at the University of Leipzig.
Information is a newspaper corpus, consisting of 14.780 articles from the publicably searchable archive of the Danish newspaper daily Information (1996-2008)
Europarl is the Portuguese part (both original and translated) of the European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
Wikipedia is the Portuguese part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
CETEMPúblico: A large corpus of European Portuguese (1991-1998), containing articles and other material from the Público newpaper (180 million words). The corpus was compiled by Linguateca and is freely available online. The corpus was morphosyntactically annotated with the PALAVRAS parser as part of theAC/DC project, a joint venture between VISL and the Processamento computacional do portuguęs initiative.
CETENFolha: A corpus of Brazilian Portuguese, containing one year's collection of the Folha de Săo Paulo newspaper (1994), about 25 million words. Like the CETEMPúblico, this is a Linguateca corpus, PALAVRAS-annotated within the AC/DC project, and available online
Some other Portuguese texts at this site are corpus samples that have been tagged with the VISL-tools for testing and evaluation purposes, in cooperation with the following research teams:
speech data: The CORDIAL-SIN project
historical texts: The TYCHO BRAHE Corpus of Historical Portuguese
modern texts: The NILC project