Lancaster University Corpus Linguistics Summer School 2017 – Day 3

You know how some people go to music festivals like Glastonbury and they don’t know which stage to go to first? Well, today is like that for me.

Dr Paul Baker


Nothing short of seeing Rammstein (or insert your favourite band here), Dr Paul Baker lectures on corpus-based discourse analysis. The lecture hall is full and I manage to grab a good seat in the middle.

Dr Baker’s research focuses on analysing discourse in corpora, developing new methods and trying to be critical and reflective in the evaluation of the research methods. He looks at health leaflets, various social groups particularly at dis-empowered groups such as gays, feminists, but also religion and ethnicity.

What is discourse? Dr Baker begins by defining the slippery term used by many disciplines almost to oblivion. Following Stubbs’ definition “Language above the sentence or above the clause” (1983:1),  Dr Baker continues by expanding the definition to that of Burr who said that: “One event can have different discourses each with a different story to tell about the world” (1995: 48).

Naming discourses/spotting and discussing which ones are minority or dominant discourses. For example, relating to gender and gender relations. To illustrate discourse naming, Baker gives an example from the BNC spoken discussion:

David: what would John Major have said?

Patricia: Who’s John Major?

David: The Prime Minister, you dope! <pause 8 secs> Typical woman!

What we can say about this exchange is that the discourse that women don’t know about politics but men do leads us to think of the relations as unequal. There is also a dominance discourse going on where the man dominates the woman.

Critical Discourse Analysis (CDA)

Language as a social practice is interested in how ideologies and power relations are expressed through language and different types of texts working on different levels in a text, for example at word level but also intertextuality, interdiscursivity, looking at who created and how they created the text, the reception of the text and the wider context of the text considering political, historical, social and economic contexts.

Due to this large scope of things to take into account, CDA analysts must choose, or ‘cherry pick’ texts that illustrate the point the researcher is trying to make. This brings up a lot of criticism for not being neutral in text selection. Therefore, CL is useful for providing a solution to this criticism.

Rationale for using CL in CDA

Having a large corpus can help to spot multiple discourses because we can see minority discourses as well as big ones. Additionally, a large size means patterns are representative and this avoids cherry picking texts. Also, procedures are unbiased and we can pay attention to unpredicted patterns.

Research questions that can be asked using CL can encompass the following:

  • How is a particular group or identity represented in the data? What discourse surround that group?
  • How are discourses legitimated (=become the dominant)?

Who benefits from the analysis? Is it good or bad for a certain group? Baker claims this is not a necessary thing to have to answer this question.

There are a lot of ways to approach CL: corpus driven and corpus based approach.

  1. Corpus driven approach

We don’t know what we’re hoping to find. We use frequency lists to give us a focus, going into the patterns that come up. You ask questions you weren’t thinking to ask.

  1. Corpus based approach

You have a theory, you have a set of hypotheses, may have certain words to look at. In other words, targeted searches to explore hypotheses.

3. A combination of both

These two approaches are both difficult to achieve. The best approach is to combine both, as Dr Baker suggests.

A model for going about research: 

Dr Baker suggests a skeleton by which to approach research, enumerating several steps:

  • Description
  • Interpretation
  • Explanation
  • Evaluation


Examples of research

Example 1: Collocates

Collocates are words which associate with others and “show the associations and connotations they have, and therefore the assumptions which they embody” (Stubbs, 1996:172). In other words, we can draw on the concept of lexical priming: we get primed, triggered to think of the word that collocates. If we see immigrant by itself we may think of illegal.e.g. illegal immigrant

Collocates can give ideological assumptions in society. For example, the collocation working mother, does not apply to ‘working dad’ and this has an implication within it: what mothers do is not work. We are downgrading childcare in society, implying this is not work. As Stubbs says, “if collocations and fixed phrases are repeatedly used as unanalysed units in media discussion and elsewhere, then it is very plausible that people will come to think about things in such terms” (1996: 195). Obviously, this brings up Sapir and Whorf hypothesis that language determines the way we see the world, but of course this is a different discussion.

Looking at discourses on marriage, particularly about people who do not get married, the word ‘bachelor’ collocates. Looking at eligible collocates brings up quite positive meaning, of a the kind of man people would like to marry.

However, bachelor also collocates with old and lonely. So the important thing to look at is how these interact with each other. Baker suggests that collocates are a good start but then looking at concordance lines is better.

So looking at the concordance lines, the first couple suggest that men can’t take care of themselves, can’t cook. Furthermore, we can also see reasons for why a bachelor is one, because of eccentric habits, or shyness. There is also a consequence or judgement of bachelors who must be lonely.  The concordance line: Falconer was a bachelor but a man in love with life reveals that the word but alone gives an interesting discourse of the view of bachelors.

The conclusion is that the contradictory discourses are working together. To be a bachelor when you are young is good. However, when you are a bachelor at an older age, society does not accept this state and paints bachelorhood as a negative state.


Example 2: Keywords

A word is key if it occurs statistically more often in one corpus when compared against its frequency in another corpus.  On his study of fox hunting, against the background of parliament discussion of legalising foxhunting, Baker looked at political debates of pro and anti hunting. He compiled frequency lists for these two corpora to get the keywords.

For example:

Keep hunting keywords: criminal, fellow, Mr, people.

Baker found these words were connected to create a larger discourse. In concordance lines we can see that the strategy of speakers was to speak on behalf of the majority’s view, talking about Britain and people and speak on behalf of Britain, constructing it as a good place. Nationalistic discourse.  There is chain of keywords as a good place and banning hunting was presented as against people’s freedoms.

Ban hunting keywords: barbaric (although this seems obvious, looking at concordance lines, speakers use more than one word together with barbaric to emphasise their stance. We can also use ‘sport’ to show this is not a necessary thing. What does barbaric modify? The practice is modified, not people. The people are missing from the concordance, Baker suggests this is to avoid being seem less accusatory) I (this was used a lot when people conveyed their stance on the situation. This connected the person to the argument), clause, bill, house (parliament), issue, dogs.

Example 3: Incorporating techniques from CDA

Intertextuality: mentioning a text within a text,  parodying or quoting from it is difficult to spot in a corpus. Corpus techniques may help to identify cases. Baker looks at collocates of Muslims +Offence (Baker et al 2013).

Looking at concordance lines, we can see that it is about Muslims being or potentially being offended, rather than offending other people. Baker suggests this is still a negative representation making Muslims seen as taking offence easily. He focuses on this line:

Change r way of life just 2 stop offending muslims. They ain neva gonna change theirs. This was surprising given this comes form a newspaper, but looking closely we can see this comes from a reader text message opinion section in the Star paper (25 Oct 2005). The readers seem to be misdirecting their anger. Why was this misrepresentation published?


CL and CDA are two fields combined. It is still a developing field creating new methods of how things should be done. However, CL does not replace human analysis and human input is crucial to supplement and interpret.

After the lecture portion, we headed over to the computer lab for the corpus based discourse analysis workshop. In this workshop we used the 100-million-word British National Corpus (BNC) to answer this question: How are refugees represented in British English? We looked at collocates and concordances. This was a very useful exercise, supported all along by Dr Baker and some of the organisers of the summer school, answering my questions which made me feel empowered for having mastered the basic use of BNCweb!

A Lunch break of fresh vegetables and salads awaited us, and while I was still excited from the earlier session, I managed to learn a little about the NLP (Natural Language Processing) summer school that was taking place at the same time.


Heading over to the next lecture, I’m a little worried because I know nothing about the subject matter. But I need not worry, as Dr Michael Barlow is patient and supportive and his presentation skills keep you interested and alert.  The topic is Translation and Parallel Corpora which refers to translated texts such as a corpus of translated newspapers or books.

From a translation studies point of view, you can have the source texts and three translations to see how different translators dealt with certain issues in the source texts. This aids with translation style.

For Barlow, the interest lies in language focus where large corpora aids in comparing, for example, European Parliament output.

  1. Translation focus asks questions such as:

What are the properties of translated texts? Or looking at what are some difficulties associated with translation for particular language pairs. For example, you (pl) or you (sg). The software created by Dr Barlow ParaConc, can help show how different translators dealt with the different problems.

A key question is how translated texts differ from non-translated texts? The translation can be influenced by the source language.

  1. Language/culture focus

Larger texts overcome the individual translator style accompanied by contrastive analysis including grammatical or function analysis.

Multilingual Concordancing vs Bilingual dictionary

One advantage of using the corpora is that it could help by creating a specialized corpus e.g. architectural corpus where there may not be a bilingual dictionary of the particular field. With the corpus there are many examples, collocations and co-text that present both advantages and disadvantages. A dictionary is easier to use, while the corpus is messier. Another difficulty is finding the translations and aligning the texts, sentence by sentence, a process that is semi-manual in the program. This in turn, takes time.

A parallel corpus gives a summation of many individual decision of what is equivalent. Each translator considers all the particular factors associated with any individual word. Using the corpora helps to look for congruence and non-congruence for particular language features such as passive, prepositions and spatial adjectives.

Barlow, a tool designer, demonstrated in the practical session the use of his ParaConc how ParaConc provides a window on translation data. Good software design makes the tool invisible, but the tool highlights some views of the data and obscures others.

We then head on to the computer lab to try out ParaConc! We work on a small sample of English-Spanish data taken from the European Parliament.


After a short break where I get a chance to hear a little bit more from Dr Barlow about ParaConc, I head over to the plenary, the last session of the day where Ian Gregory working with Chris Donaldson and Joanna Taylor, gives a compelling lecture about a fascinating area that is completely new for me: Corpus Linguistics and GIS (Geographical Information Systems): Landscapes in Lake District writing.

GIS is thought about as a database that is able to cope with geographical information. You can create maps based on underlying data in a table. Looking at historical census, GIS is good with numbers but until they started the work, it didn’t cope with texts.

Ian Gregory discusses in this session the writing of the Lake District. They digitized 80 texts from 1622 to 1900, about 1.5 million words and included a variety of texts such as Daniel Defoe, Wordsworth, etc. What they did with this is was to answer the question of how to take a text that can be put in a GIS? Previously, this was done by hand, scanning the text noting where a place was mentioned – coding names in XML.

What is the point of putting a text in GIS?

Develop place-centered reading. By Close reading of texts associated with particular places. We can also ask the question of what does the corpus say about these places?We can find out why the theme identified in the corpus is connected with the places in the text?

They searched three search terms: majestic, sublime, and beautiful. They find that they all are used in different way to talk about different places in the Lake District.

You can learn about this more here:


This was one of the longest but most enriching day so far. I can’t wait to see what’s in store for tomorrow. Stay tuned.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s