Lancaster University Corpus Linguistics Summer School Day 4

Good morning in Lancaster University is not just a set greeting phrase, although today it’s bitter sweet – last day. Going down to the Market Place where we have our breakfast, I’m handed my trusty soya milk the staff kindly prepared for me and sit down with fellow summer schoolers to hear about their experience this week. Most participants I spoke to expressed how much they learnt this week and how useful it has been for them. (although Dr Gries, I will not be able to make any generalisation about this claim until I mastered my statistics!)


The first session (available online here) given by the brilliant Vaclav Brezina is titled: Collocation networks in discourse. One of the corner stones of corpus linguistics (CL) is collocations. Collocation is really about between relationships between words.

What is discourse? For Brezina, when we talk about discourse from the lingustic perspective, we talk about communicating meaning, not form. We also focus on connections between words not single words. We focus on repeated patterns not one-off occurrences that we can see in corpus and concordance lines. Brezina goes on to specify that in order to see the larger picture, in order to connect the patterns we see to society we need both quantitative and qualitative appraoches.

So the definition Brezina uses is that “discourse is a network of meaning (linguistic, social, scientific, religious”. Blommaert (2005:3) sees discourse as a diachronic social, cultural and historical pattern and development of use.

An approach that has become really popular is culturonomics which analyses discourse through the lens of ‘big data’. For example, Google books n-gram viewer. Some problems with the curve produces by the ngram presents several problems. The first is that we cannot see examples in detail. What about the term ‘immigrant’ itself, how do we refer to immigrants, what naming strategies are used? The graph does not provide options to explore this further. Brezina notes that we need to go beyond the graph in order to interpret discourse in society.

Brezina’s approach in response to the ngram based on Guardian readers’ comments about waves of immigrants in the UK, maps collocation networks of the associations that the readers have across the corpus to immigrant. E.g. illegal, influx, large, blame. Eastern European, flocking.

From Brezina’s perspective this is much more interesting than looking at graphs that looks at words in isolation because it is interesting to look at the co-occurance where meaning is created. Meanings are not something that you can look up in the dictionary, that happen in real life, in real time.

One discourse that comes up in relation to immigrant is that rarely is the distinction made between asylum seekers, immigrants and illegal immigrants.

Looking at the word ‘blaming’ that collocates with immigrant, makes you think that people blame immigrants for problems. But when you look closer at the sentence we can see that readers are challenging this.

Overall, Brezina found that the Guardian portrays immigrants as 65% positive/neutral.

How do we arrive at this? What methodology to we use?

What is going on in the text?

In Wikipedia we look at the word ‘love’ (the node), the social construct of love and the historical construct of love. This word appears many times and we are interested in the associations, the collocations, of love. We need the collocation ‘window’. We go through the texts, looking at the words that appear around the word ‘love’ – 2 words to the right, 2 words to the left.  In BNC, love appears 22,265 hits in 1983 different texts in 100m words.  The top frequency we get many function words as most frequent and from the discourse perspective this does not offer any information beyond the working of the language.

Using statistical association measure called MI score, gives more meaningful results.  Some collocates we can see: affair, fallen, falling, god, fall, making, story, true. This is more socially interesting.  Looking at the most frequent result: affair, we think about what’s behind it.  To calculate the value of affair we need:


  1. Tokens in corpus: 100M
  2. Node (love): 22k
  3. Collocate (affair): 3k
  4. Node (love)+ collocate (affair): 200

We need to calculate the random co-occurrence baseline: node x collocate / tokens in the corpus. If love and affair were to co-occur randomly less than the number we get.

The definition of collocation is ‘actual words in habitual company’ (Firth, 1957:14). Criteria for collocations:

  1. distance (span),
  2. frequency and exclusivity (statistics, MI, LL, Log-dice) Words behave in the same way as human relationships. Love and affairs are exclusive, they are closely attached. Therefore there are different aspects to the relations between the collocations, depending on what you focus on.
  3. Dispersion
  4. Directionality: Delta P (a-symetrical collocation) one word doesn’t love one as the other way around. E.g. red herring, red can go with many word, red and other words don’t have a very strong relationship. But the other way, if you look at herring, it predics red in BE more strongly that red predicts herring.
  5. Type-token distribution – what’s the competition between candidates for collocations
  6. Connectivity – how individual colocations are connected in language and discourse.

How do we make sense of the individual association measures? Which association measures should we use?

Well, it depends, says Dr Brezina. If we imagine one scale od frequency and one of exclusivity we can place the individual association measures that are typically used in corpus research along the scale.


Back to the idea of connectivity. Our entry point is the node. We then produce more and more collocates. How do we visualize the patterns in the data? Using GraphColl on Lancsbox can produce the graphs. It is a tool for systematic and transparent study of collocation. You can analyse multiple corpora, and see an overview of the uploaded file, words with any language have UT8 support. You select your statistics measure of interest. It is also an experimental tool that can help push the boundaries of the field.

Applications of the tool:

If you are interested in real life application you can look at the paper: Brezina, V., McEnery, T., and Wattam, S (2015) Collocations in context A new perspective on collocation networks International Journal of Corpus Linguistics, 20(2), 139 -173. Looking at pamphlets of the Society for the Reformation of Manners.

After this part of the session we head over to the lab where we try out GraphColl. GraphColl is a free multi-platform tool for the analysis of language.

I tear myself away from the hypnotizing collocation generator (you should try it – it’s addictive!) to have some tea and coffee and head over to the second session.

Professor Elena Semino discusses corpus approaches to metaphor in discourse over the second lecture. She is energetic and passionate and I am fully absorbed.

20170630_113042.jpgStarting with the definition of metaphors, Prof Semino gives her definition which says that metaphor involves “talking, and potentially thinking, about one thing in terms of another. The two ‘things’ are different but we can perceive some similarities or sets of correspondences between them.”

An example from Semino’s own research comes from cancer patients and how they socialise in their treatment. One patient says: “I am fast becoming a chemo veteran” – one word from ‘war’ is used in relation to cancer support. It is a metaphorical statement that pushes us to look at the similarities. This also relates to a larger pattern in which being ill is perceived to being ‘at war’, especially prevalent in the US and the UK.

Metaphors are important because each metaphor frames the topic we talk about in different ways in that some aspects are highlighted and others are foregrounded and evaluations become important, as topics may be evaluated differently. In the chemo treatment, something that is supposed to be good for you, i.e., the treatment, is framed as something negative involving fighting.

An example from communication about science: In 2010 a team of scientists at Newcastle University produced an embryo using material from the fertilised eggs of two different women: the mitochondria in the egg of a woman who carries a mitochondrial genetic disease was replaced with the mitochondria from the egg of a healthy donor who does not carry the genetic disease. This controversial technique led to alarmed reports in the media about embryos having three parents. The press office at Newcastle University produced its own account of the research, in which a particular metaphor is used to help defuse the potential moral panic.

The university, Professor Semino argues, did a good job at arguing for the ‘neutrality’ of the endeavour: What we’ve done is like changing the battery on a laptop. The energy supply now works properly…”. Using a simile, Dr Turnbull explained the process in terms of more understandable way. However, the way this is framed, undermines the ethical and moral arguments. This is a clear example for a metaphor being chosen for framing a discourse.

What do we use metaphors for? We use it for a whole range of things:

  • Express abstract idea
  • Explain complex things
  • Express subjective experience
  • Persuade

Metaphor theory

Over the last 30 years focus on pervasive conventional patterns of metaphorical expressions in language use generally and on their implications for the role of metaphor. Most people looked at metaphors in literature or political discourses. In 1980, George Lakoff and Johnson argued in their famous book Metaphors we Live By that metaphors are a matter of thinking or cognition as well as a matter of language.

The example is of LIFE IS A JOURNEY metaphors located in various examples:

He’s without direct in his life

I’m at a crosswords in my life

These are conceptual metaphors whereby one domain of life is partly structured by another domain. In the example above, Journey is the source domain and Life is the target domain. (watch Lakoff talking about this in the video here).

From this point of view we can explain the framing power of metaphors. If you see life as a journey, moving forward is positive and standing still is negative because in a journey you want to remain moving. This metaphor foregrounds achieving our aims, change, constant driving forward.

The relevance of corpus methods to the study of metaphors

The amount of attention that has recently been devoted to metaphor is based on the claim that metaphorical expressions are pervasive in language

The problem is that Lakoff’s examples belong to a tradition where data does not come from naturally occurring language but are made up. Methodologically, their approach is being criticised because they don’t provide evidence of frequency. This is where corpus methods are helpful to test Lakoff’s claim and arrive at generalisations.


In 1993 Lakoff wrote a chapter where he talks about several metaphors he says are conventionally used in English to talk about life: a purposeful life is a business  – here is  a quote “ He has a rich life. It’s an enriching experience. I want to get a lot out of life. He’s going about the business of everyday life. It’s time to take stock of my life.” (Lakoff, 1993: 227).

This is a made-up example. A cursory glance makes us argue a few things: when we read this, we may impose the metaphor on the example. If we focus on ‘he has a rich life’, we can say that rich is polysemous which can be extended to different area and domains.

Prof. Semino was doubtful and sees rich as a metaphor. But is rich related to business? Looking in the BNC, and the node ‘rich’ in the concordance we see several uses of quality and quantity. How could you come up with a conceptual metaphor for this? Prof. Semino didn’t find life in the frequency list of collocates with rich. So the way she went about it is that the concordance of rich in the BNC reveals that it is conventionally used metaphorically to express abundance, intensity, and variety, not just one domain: ABUNDANT/INTENSE/VARIED IS RICH

So, conceptual metaphor theory bases its claims on linguistic evidence and corpus methods provide large amounts of linguistic evidence that can be used to confirm, challenge and refine the claims made by metaphor theory.

Corpus methods and metaphor in health communication: metaphors for cancer

Now we are interested to see metaphors in relation to health. Illness is an individual, personal state of being, which is associated with physical discomfort or pain and feelings of anxiety, fear and shame.

The use of metaphor in communication about illness can both help and hinder well being. It is well known that in English illness is talked about as an enemy we need to fight and defeat.

Looking at three collocates: battle, war, and fight. Collective vs individual enterprise, battle tends to go with losing it, fight tends to suggest that there is a chance to win. We can see that there isn’t just one military battle metaphor. By looking at corpus data we can begin to see that there is a war/military rhetoric, there are different metaphors to talk about illness. For example has a collective societal meaning, but battle is almost always an individual experience that tends to go with losing. Fight has both an individual enterprise and a collective one, and it isn’t as negative as battle.

A few years ago the government funded the project ‘Metaphor in End of Life Care (MELC) A project on metaphors for cancer and the end of life, funded by ESRC . They looked at the language of patients, carers and HCP (Health care professionals) with semi-structured interviews and online forum posts to compile a corpus of about 1 million words.

They created a sample corpus of 92,000 word corpus and manually highlighted the metaphors using e-margin. The method also using Wmatrix to look at how patients talk about the HCP metaphorically. They also concordance signals of figurativeness, e.g. ‘like’ as a preposition to introduce a simile. Wmatrix automatically tags for semantic fields therefore it is useful to look at how a particular topic is talked about metaphorically. For example, how bereaved carers express grief metaphorically.


One of the widely criticised metaphor is violence in relation to cancer (e.g. Sontag, 1979) for their associations with aggression, the implication that the enemy is inside and that not getting better is defeat. Journey metaphors have been proposed as a suitable alternative. This affected policy documents in the UK which discuss cancer in a journey metaphors as a model of care.

Can we find evidence to support the preference for one metaphor than another?

By using semantic concordances, they identified relevant metaphorical expressions within the semantic domains we studies systematically in the whole corpus.

In the journey metaphors using lexical concordances, identifying relevant metaphorically used instances of the following lexical items: journey, route, road, path, go/come through, way, step, run, move, etc.

Violence metaphors are used mostly by patients, least by HCP. Journey metaphors are used most by patients, least by carers. How do they use them? The researchers were particularly interested in agency. Violence metaphors can be disempowering for patients. This validates the idea that HCP shouldn’t use this metaphor.

However, some of these metaphors are used in an empowering way: e.g. ‘I am such a fighter’. It is focusing on the person rather than the enemy.

Journey metaphors can be empowering: ‘This river is my path for now but I’m quite excited about the next one. My journey may not be smooth but it certainly makes me look up and take notice of the scenery!’.  This metaphor does more than fighting as it takes into account companions that accompany you on the journey.

However, not all journey metaphors are positive. From an online forum a writer talks about other patients as passengers on a journey that patients don’t choose their way.

Both metaphors can be used in positive and negative, empowering and disempowering.

See more at:



The third session is about CLARIN: Infrastructural support for the study and use of language as social and cultural data, where Darja Fiser shows us the resources and tools that are out there to use for research.

What’s CLARIN?

It is the Common Language Resources and Technology Infrastructure that provides easy and sustainable access for scholars in the humanities and social sciences to digital language data in written , spoken, video or multimodal form.

CLARIN collects resources of different kinds: newspaper archives, literary texts, Parliamentary records, literary text, social media data in many fields.

Services offered are:

CLARIN portal, depositing service, Virtual Language Obervatory – access data there, easy access to protected resources



Sadly, we arrive at the last session: the panel challenge, consisting of the following distinguished academics:

Karin Aijmer (university of Gothenburg, Sweden)

John Flowerdew (visiting professor at Lancaster University)

Yukio Tono (Tokyo university of Foreign Studies)

Each panel member prepared three discussion topics:

  1. Challenges
  2. Opportunities
  3. Recommendation


  1. Topic: Challenges: what are some of the challenges when using corpus linguistics in research?

J.F: One of the challenges is that if you don’t use corpus all the time you may forget how to use. He recommends to practice from time to time. Regarding the corpus, most of his work deals with small corpora, which can be adequate depending on your task. One advantage is that you can read it and not only looking at concordance lines, it’s important to know the corpus. It has become more important to include metadata, who is speaking and you can incorporate this into the corpus. Triangulation with ethnography and other context related investigations are possible.

KA: Has been working mostly on spoken corpora that provide their own challenges. Depending whether you are interested in prosody, that will affect the transcription and the corpus, what kind of software  and you can listen to the texts in the corpus. Although KA claims this is a difficult thing to do. Generally speaking when you work on the basis on spoken corpora is methodological. The problems that are presented  for example, speech acts which are function based, with spoken corpora you need to start with form. However, you may miss out on more indirect ways of expressing, for example asking someone to do something. The interest in using spoken corpora is in moving away from looking at isolated markers to studying function but that means that you start from a list of elements you are interested in.

YT:  Quotes Stefan Gries’ comments: lack of data or drowning in data, sometimes not enough or too much. Too many factors determine the results. Difficulty to make good annotation decision.

(YT) you have to keep learning about statistics, NLP, it’s important to learn how to do it in a corpus based way and your research need to be reshuffled or reconsidered. Continuous learning will be involved.

  1. Opportunities: How can CL contribute to other disciplines?

KA: we contributed to applied linguistics in general terms the advantages and disadvantages. The interdisciplinary approach has been positively accepted in terms of publishing in various journals. The first advice is to read an introductory book and leave time to meeting people to discuss methodological problems.

YT: working in English language teaching in Japan and gave talks about using corpora in teaching and people working in the educational sector are keen to know what corpora can provide to educational materials. Now it’s not so unusual but if you work in different areas of humanity there may be niche areas of corpus information that is innovative and news which can shed new light on aspects in your field.

JF: I’m an applied linguist, most of my work has been trying to solve educational problems for second language users of English. Also with regard to spoken data, I work with a corpus of lecture studying how lecturers define in lecture which can feed in lecturer training. I’ve also been interested in signalling noun and their particular function in academic language. My other side of my work in CDA, my own approach is that the discourse analyst can provide first interpretation of socio-political events as they happen. Eg. The politics of Hong Kong.

  1. Recommendations: What are your recommendations to (novice) researchers using CL?

KA: I’d like to come back to the triangulation methods. CL is not the only be the only method and can be fruitfully combined with other methods: interviews, eye-tracking, ethnographic methods to be combined with the use of CL. As a young researcher have fun and be innovative.

JF: Karin told me about the culture in Sweden you should only work 9-5 and you shouldn’t work at weekends and take long holidays. More seriously though, as Stefan Gries says: go study programming and statistics. My approach to statistics is to find someone who knows about this and help me. However, you should be careful because they’re often not linguistics and they don’t understand the question often. Generally, be prepared for rejections and learn from reviews helping you to improve your work. Look at published work, but don’t replicate what others have done. Originality is very important and also familiarize yourself with the journal.

YT: I’m from Japan so *laughs* said to be hard working so I’m really enjoying how Karin is working in Sweden. Stefan also said something like: you have to prepared for annotating 10,000 lines. If you work in chemistry or physics and they spend a lot of time in the lab, so in the same way we have to do the same. I think we need to take his point, in order to tackle some of the problems in our areas. We need to learn certain technologies to our data, the knowledge of statistics could help. But sometimes statistics can be beyond my understanding. Some collaboration with colleagues and efficiently you have to understand what you want to do. You have to have a basic knowledge of how text processing should be done to communicate with NLP people.

And at that, the wonderful experience of the Corpus Linguistics Summer School concluded. I would like to thank all the organisers and lecturers for making this such an enriching experience. I hope to be back next year, trying a different strand!

What’s next?

The free online MOOC course: Corpus Linguistics: Method, Analysis, Interpretation offers a practical introduction to the methodology of corpus linguistics for researchers in social sciences and humanities

Starts in September and you can register your interest now (CL MOOC)


3 thoughts on “Lancaster University Corpus Linguistics Summer School Day 4

  1. Hi Mira, I absolutely loved reading this. Many unanswered inquiries from pre-internet days are now ready for a refreshing inquiry. We had tools like Pascal and databases like dBase III that required a lot of hard coding. Looking forward to trying out GraphColl now. Intriguing stuff. Thanks much 🙂

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s