In recent years there has been a growing demand for teaching materials that present authentic English. However, for many aspects of English grammar we still know relatively little about how speakers and writers actually use grammatical features.
At the same time, conferences on applied linguistics and teacher development, as well as published material such as books, articles and newsletters, frequently refer to developments and findings in the field of corpus linguistics – the study of language as expressed in samples (corpora) or "real world" text. An increasing number of materials and resourses for use in language teaching and learning now boast that they are “corpus-based” or “corpus-informed”
In simple words – a corpus ( plural corpora/ corpuses) is a collection of texts, written or spoken, which is stored on a computer. In the past the term was more associated with a body of work, for example all of the writings of one author. However, since the advent of computers large amounts of texts can be stored and analysed using analitical software.
Corpus-based approach to English Grammar means that the grammatical descriptions are based on the patterns of structure and use found in a corpus (a large collection of spoken and written texts, stored electronically, and searchible by computer. )
Corpora describe the ways in which speakers and writers actually use the grammatical resources available, document the frequency and discourse function of each grammatical feature, identify the most important lexico-grammatical associations in English, that is how a grammatical feature tends to co-occur with a particular set of words. Also they enable us to understand that there are important grammatical differences among the various types of speech and writing.
In other words we can look at grammatical constructions, look at words and meanings and how they are changing, and at how we use phrases and groups of words. We can look at frequency of words and see which words are used most commonly in different contexts. We can compare spoken and written English in a corpus to find out whether a particular word or phrase is used more commonly in speech or writing.
So, in summation, corpus ( plural corpora) is a collection of text, usually stored in computer-readable form. It is processed by computer so as to serve as a basis for linguistic analysis and description. Many of the examples ( if not all of them) in modern grammar reference books ate taken from multimillion word corpora of spoken and written English. More and more grammars now are corpus-based.
There are many corpora available and some can be bought, some are free and some are not publicly available. Examples of various corpora:
The Cambridge International Corpus
The Longman Spoken and Written English Corpus
Collins Cobuild Bank of English
The British National Corpus
Let's take for example Cambridge International Corpus. The corpus is international in that it draws on different national varieties of English. The corpus has been put together over many years and is composed of real texts taken from everyday written and spoken English. The corpus contains a wide variety of different texts with examples drawn from contexts as varied as – newspapers, popular journalism, advertising, letters, literary texts, debates and discussions, service encounters, university tutorials, formal speeches, friends talking in restaurants, families talking at home, etc.
The benefit for teachers and learners of corpus data is that it provides them with easily accessible information about real language use, frequency and collocation. Before the advent of corpora , teachers had to rely largely
Corpus information is typically presented in the form of concordances. A concordance displays the result of a word search as individual lines of text, with the targeted words aligned in the centre. Information on the minimal context is also presented.
Concordance – a computer technique that allows searches to be conducted in a corpus for specific target words or phrases in their original contextual environments. The most standard concordance type is called KWIC DISPLAY ( KWIC= Key Words in Context), which highlights the chosen keyword in the centre of a line of words with its surrounding context on each side.
Concordancing is a core tool in corpus linguistics.
We can look at a language feature in a corpus in different ways. For example, using a corpus of newspapers, we could examine how many times the words “fire” and “blaze” occur. This will give us quantitative results, that is, number of occurences, which we can then compare with frequencies in other corpora, such as casual conversation or general written English. This might lead us to conclude that the word “blaze” is more frequently used in newspaper articles than in general English conversation or writing, when talking about destructive outbreaks of fire. However, another approach is to look more qualitatively at how a word or phrase is used across a corpus. To do this, we need to look beyond the frequency of the word's occurrence. Looking at concordance lines can help us do this and to see qualitative patterns of use beyond frequency.