On Saturday, 12th May, I had the excellent opportunity to join the CASS team and dedicated volunteers in contributing to the building of the new #BNC 2014 corpus, which will be sampling the new emerging language of British English. This corpus will be an updated version of the first BNC built in 1994 in collaboration between Oxford University, Lancaster University and the British Library.
Dr. Carmen Dayrell and Dr. Vaclav Brezina, with Prof. Elena Semino
What is the BNC2014?
The British National Corpus 2014 is a project led by home of corpus linguistics, Lancaster University that will be open to all and used in the next 20 years by researchers and anyone interested in describing how language, ‘real-life’ language, is used and how it changes over time.
The corpus, modelled after BNC 1994, will maintain the same size of 100 million words, with 90% of the corpus comprised of written sources such as fiction, academic prose, verse, biographies, newspapers, unpublished materials, and of course, spoken materials such as conversations which will comprise 10% of the corpus.
Book entries are a priority at the moment and we began the day by scanning fiction books. On the training day, we went to the Lancaster University library and looked for books that matched the new BNC criteria:
- Books written by British authors (of various origins – Pakistani British, Indian British or any other ethnic origin is acceptable as long as they have received their education in English and can be considered ‘native’ speakers.
- Fiction: both poetry and prose spanning subgenres: children, teens, fantasy, romance, crime, thriller -any!
- Non-fiction – any!
- Date of publication: This is important! For fiction, look for first published in or after 2010 and for non-fiction books published before 2010 are acceptable as long as the selected edition was published in or after 2010.
How much to get from each book?
To make the samples representatives, we need between 20,000 – 50,000 running words from each book. Taking copyright restrictions into account, we scanned about a third of each book. We divided into 3 groups, each group assigned to scan either the first, middle or end sections of a book.
We then uploaded the scanned pages onto the system: (Head over on to Cass project website – there are links to click to contribute to the BNC). We could see the corpus growing in real time!
What’s next? The New #BNC2014 is now hungry for your contributions! Here’s how you can feed the corpus:
You can choose to contribute to a collection of Tweets, SMS messages, emails and books. Remember that we are interested in British English variety and so participants in any of the data you upload must fit this criteria.
Remember that for tweets, emails and text messages you will need to obtain consent from those you interact with!
Get involved and read more about the project at the CASS Blog !