27 June 2017
This week, I have been privileged enough to attend the Lancaster University Corpus Linguistics Summer School for Language Studies. What does that mean? Well, here’s what LU has to say:
Corpus Linguistics for Language Studies is aimed at students studying Linguistics or Language studies, who want to develop skills in corpus methods. Sponsored by the ESRC Centre for Corpus Approaches to Social Science (CASS) and UCREL at Lancaster University – one of the world’s leading and longest-established centres for corpus-based research – this summer school is the seventh event in a highly successful series that began in 2011.
The programme consists of a series of intensive two-hour sessions, involving lectures, discussion-oriented sessions and practical sessions in computer labs where partcipants are trained in the use of corpus tools.
However, I don’t think this description does the program justice, because it is so much more. This post (and the next three) report about each day of the summer school, the accommodation, the food and the incredibly interesting and dense schedule carefully put together by the team (thanks to Abi Hawtin and Dr. Gabalsova amongst many dedicated organisers).
The Linguini arrived the day before after a four hour train journey through sunny and lush northern countryside. In Lancaster, I walked to the town centre to get the bus to the university, where I was warmly greeted by the helpful and smiley accommodation receptionists. My room on campus deserves to be complimented: I opted for the single-bed room which was clean and comfortable with an excellent large desk, and lots of light and fluffy towels. It gave me the opportunity to relax before the start of the first session.
The next day, breakfast (included in the accommodation package) was served in the Market Place, the campus cafeteria. Of particular interest for me as a vegan linguist, was the abundance of fresh fruit, vegan options for a full English Breakfast and the staff even had prepared a personalized soya milk for me!
Off to registration and the welcome talk, given by Dr. Dana Gabalsova. The organisers welcome us with a warm smile, giving out name tags and programs, and gave us the opportunity to mingle and get to know other students. The turnout is universally international (see what I did there? 🙂 ) and multidisciplinary. Over juice, coffee and fresh fruit provided while we wait, we talk, compare backgrounds and swap arrival train-plane-bus stories and a supportive atmosphere of academic camaraderie is forming.
The first session given in one the beautiful lecture halls of the LU management school is an introduction to understanding statistics for corpus analysis given by the acclaimed Dr Vaclav Brezina.
In his session, Dr Brezina discusses why statistics is useful for CL, and says that:
“Statistics guides us through the analytic process, it is a discipline that makes sense of quantitative information”. (Brezina, 2017, forthcoming).
“You can think of statistics as learning from experience”. (Tufte)
Asking ‘What do we do with statistics?’, Dr Brezina enumerates a few areas of use:
- Generalise – see the overall picture
- e.g. use of adj. by fiction writers – the type of language produced . What does the sample mean?
- Getting the mean: sample/number of samples= mean
- Median – the middle number in an ordered set of data, this number is claimed to represent the whole set. This is also a useful tool when the sample is skewed.
- We can also find relationships in the data: e.g. fiction writer’s use of adj and verbs. What is the relationship between the number sets? We can draw a regression line that creates a tendency: we can see that the more descriptive the style the fewer verbs are used and vice versa.
- Building models: factoring in different variables: e.g. what’s the area of GB? Trying to find the best model that represent the area as best possible. The model is never perfect, deviation can be always be found in the form of an error that can be captured mathematically that can be applied to the data
- Other things we can do: describe and infer.
- Descriptive statistics: talk about data sets, frequencies, distributions, collocations, graphs
- Inferential: statistical tests, p-value, null hypothesis, confidence intervals.
- 1) the first step in analysis is data exploration: what are the main tendencies in the data? We use graphs, means and standard deviation as the basic tools.
- 2) Inferential statistics, second questions asked: do we have enough evidence to reject the null hypothesis? Is the effect that we see in the sample due to chance (sampling error) or does it reflect something true about the population? We need to collect more samples in order to see the larger picture. It needs to be a large enough sample to become meaningful about the use of language.
- 3) Effect sizes: an area where we try to capture the strength of the effect. How large is the effect in the sample? (standardization measure). How strongly are the variables related? There is a shift on focusing on effect sizes as opposed to simply looking at the p value.
Corpora and research design GIGO warning (Garbage in Garbage Out)
If we don’t start with data, no matter how sophisticated our analysis is, we need to start by thinking about research design. With that, Dr Brezina continues to discuss other aspects of statistics such as statistical terminology, frequency, variables and dispersions.
Dr Brezina has engaging and stimulating presentation skills, and I am riveted.
Next is lunch which is laid out for us on white tablecloth-covered tables. This is another excellent opportunity to get to know each other and hear about the exciting research everyone is doing. The other summer school participants come from different disciplines ranging from applied linguistics, phonetics, pragmatics, but also history, geography and environmental science.
- Other things we can do: describe and infer.
The following session is a practical one given by Dr. Dana Gablasova on corpus methods in language learning and teaching.
Dr Gablasova discusses two corpus methods:
- Frequency information
- Contextualising use of language
Indeed, what information can we get from a corpus?
- Description of language use, detailed and richer
- Description is flexible and provides the opportunity to revisit earlier findings with new data
Mentioned in Dr. Brezina’s session, the theme is picked up by Dr Gablasova: What is frequency and why is it important?
- It is important to know which items are frequent because of implications for:
- Processing: what sort of words are encountered more often by language users
- Production: what sort of words are produced more often by language users
The example given by Dr Gablasova looks at the lexical item ‘Lovely’ , and asks: What does the dictionary definition not tell us? Who uses the word? Which gender uses it more? (hint: used more by women) Who is the word used about? (you guessed it – used most about women and boys up to the age of 10)
We also discussed types of frequency, annotation and pragmatic tagging.
We then moved on to the practical session portion in the computer lab where we tried out using LancsBox (which you can try for free!). We created our own corpus and looked at the concordancing lines. It was very useful to try out this tool with the support of Dr Gablasova and have the opportunity to ask questions.
This exciting session was followed by a tea and coffee break, with fresh fruit and pastry to munch on while sharing our newly acquired skills with others.
The plenary lecture presented by Jonathan Culpeper attempted to debunk myths about Shakespeare’s language with corpus methods. Culpeper addresses myths about Shakespeare’s language, particularly the one that Shakespeare coined more words than any other writers. Culpeper’s project aims to produce the first systematic and comprehensive account of Shake’s language using methods derived from CL.
Creating an encyclopaedia, comparing Shake’s words with the words of Shake’s contemporaries, including semantic patterns in Shake’s writings that create different linguistic thumbprints.
In their book Language myths Laure Bauer and Peter Trudgill discuss myths and conceptions defined by Culpeper as “Beliefs about language that are produced and reproduced within particular communities and become part of a culture ideology that is used to evaluate language and account for how it is”. Culpeper argues that they are inconsistent with empirically observed linguistic research and goes on to demonstrate or ‘debunk’ four myths about Shakespear’s language. (I will only discuss two here)
Myth 1: Shake’s language is (wholly) Shake’s language.
Problems with this myth:
- he had no authorial oversight of his work. He didn’t control what was printed
- 18 of the plays had been previously published as Quartos
- Early play texts were fragmented
- In those days plays were collaborated on
- Plagiarism, re-using parts were seen as complimentary
- Therefore, shake’s language is only what survived in literary work, not every day talk with friends and family
Myth 2: Shake had a larger vocabulary than any other writer
It is claimed that an educated monolingual adult today has about 9,000 -18:000 words (Treffers-Daller, J. and Milton, J. 2013, Applied Linguistics, Review 4 (1): 151:172) But… Shakespeare is claimed to have over 25,000 words, (Greenblatt), while Crystal goes more for 20,000 (Crystal, D, 2008).
The problem, Culpeper suggests, is with counting – what do you count as a word? Hugh Craig claims that the number of words is relative to the number of publications, producing different words in the process.
Finally, Culpeper concludes by suggesting that while Shakespeare was and is an influential literary icon, he may not be quite the language innovator many take him to be.
The first day of the program left me thinking about all the wonderful things CL can add as an additional tool to my qualitative projects. I also identified areas (such as statistics for CL) that I need to learn in order to perform a meaningful study using CL. All that’s left to do it go out and gather some data!
(Day 2 in the next post!)