Corpus and CLIL: a possible integration?

Our posts this week all come from the Forum on Research informing materials writing at the MaWSIG Showcase at IATEFL in Liverpool 2019. In today’s post, Andréa Geroldo dos Santos shares her research findings about the extent to which corpora should be used in the design of CLIL materials.

In my talk at the MaWSIG Showcase, I presented a snippet of my academic research as a PhD student at the University of São Paulo, Brazil, based on how CLIL (Content and Language Integrated Learning) materials may be designed taking Corpus Linguistics principles into consideration. I also demonstrated how the results of this research have been put into practice by providing examples of lexical and grammatical activities integrated into relevant content topics.

The task of applying research data to the design of teaching materials may sound like a dream to my fellow material writers/editors and applied linguistics researchers – the best of both worlds … but it isn’t! Challenges come from all sides.


First, many researchers have noted the importance of using authentic texts and corpora for developing language teaching materials (Tomlinson, 2003; Mishan, 2005). However, Burton (2012) points out that Corpus Linguistics has powerfully influenced the production of ELT reference materials such as dictionaries and grammars, but holds less influence on the development of ELT coursebooks.

Why is that? According to Burton, it is because there would be no motivation (or demand from the potential consumers – students and, more specifically, teachers) for publishers to innovate in such a way.

The second challenge is a local one, yet it may be true in many other countries: the Brazilian publishing market is very conservative, with most of the locally produced coursebooks still relying on short author-made texts, lists of words to be memorised and long grammar sections with conjugation tables and mechanical exercises.

The third challenge is more specific, as it relates to the CLIL approach. Designing CLIL coursebooks based on corpora is a brand-new area in Brazil, where the CLIL approach has only recently gained ground. In fact, corpus-driven activities are not often used in CLIL environments in European countries either. In these countries, materials are mostly designed using dictionaries and texts translated from the mother tongue (Arhire; Gheorghe; Talabă 2014).

CLIL environments may benefit from corpus-based materials because they provide the possibility of not only searching for specialised vocabulary (language), but also for obtaining meaningful information (content) in contextualised authentic texts. In other words, they afford the study of the language and the content, which is the very nature of teaching CLIL (Coyle; Hood; Marsh 2010). Moreover, it may promote autonomous learning for both teachers and students, inside and outside the classroom.

CLIL and corpus-based activities

Relying on my experience of designing ELT corpus-based materials (Santos 2015), I have focused on developing CLIL materials based on the observation of concordance lines (Johns, 1991; Tribble & Jones, 1997; Berber Sardinha, 2004; Gavioli, 2005) taken from COCA (Corpus of Contemporary American English[1]; Davies, 2008) available online, to approach both lexical and grammatical items. The activities proposed ask students to infer the rules and/or usage of these items as well as practise the relevant content studied.

To illustrate, I briefly demonstrated how I designed a unit of a CLIL coursebook for 13-year-old Brazilian students. This unit focused on travel, and my aim was to approach not only the students’ experiences (very likely, trips to the beach in Brazil, or to Disneyworld in the United States), but also the reasons why people used to travel in the past (mainly to trade goods).

One of the authentic texts of this unit, The Oldest Amusement Park in the World, talks about a Danish amusement park, founded in the 19th century; it provides discussions on history, geography and culture. Lexically speaking, this text helps Brazilian students become aware of how to describe an amusement park using vocabulary such as ride, rollercoaster and big/Ferris wheel, all words that are totally different from the ones in the students’ first language – Portuguese.

My next aim was to present the verbs that collocate with these nouns. A quick search of COCA online showed that the verb ride is the most frequent one with the nouns rollercoaster and big/Ferris wheel, as seen in Figure 1.

Figure 1: COCA interface displaying results (collocates) for ‘Ferris wheel’

The other most common verbs found in COCA in the context of amusement park rides were get on and go down. To help students notice the three most commonly used verbs this context, the coursebook presents two activities. The first features gapped sentences describing rides (informed by COCA); the students try to complete them with the correct collocates (ride/get on/go down + nouns). The other involves students writing these collocations into collocation forks, as in Figure 2.

Figure 2: Example of collocation forks used in the coursebook.

As a follow-up, students are invited to share their experiences in amusements parks in their country. They then build their own rollercoaster using recyclable materials and discuss centripetal force.

Addressing the challenges

Presented with the challenges mentioned above, I decided to address them by following McCarthy’s advice (2008): invest in teacher training, helping teachers to become familiar with Corpus Linguistics. In this way, they will better understand how they can benefit from using corpus tools that are available on the internet, both for their lessons and themselves. Our pilot group consisted of 50 tutors – all teacher trainers from a CLIL education system in Brazil. First, they took a six-hour online course in Corpus Linguistics. They then attended a practice session on the coursebook and the corpus-based activities (November 2019).

At the end of the practice session, 39 tutors assessed the whole course. Almost all (38) of them stated that the activities proposed could help students understand the content. Most of them justified their statements pointing out that ‘the use of language and content are integrated in the activities’ or ‘they emphasise use and usage, not the structure’, for instance. Only one tutor seemed not to have approved of them, declaring that ‘they were confusing and the steps were not very clear. Furthermore, the students need visual support of the contents.’

Overall, the partial results obtained would appear to show that, despite the challenges encountered, the development of CLIL teaching materials can be achieved more effectively by using corpora. However, there is plenty of work to do, both in the areas of publishing and teacher training.

Andréa Geroldo dos Santos has been teaching English for 25 years. She has also been a freelance ELT writer and editor for eight years. She holds an MA in English Language and Literature from the University of São Paulo, Brazil, and is a doctorate student at the same university, with focus on developing ELT and CLIL materials for Brazilian students, based on the principles of Corpus Linguistics. She holds CPE and ‘Train the Trainer’ certificates.


Arhire, M., Gheorghe, M., & Talabă, D. (2014). A corpus-based approach to content and language integrated learning. Conference proceedings. ICT for language learning. 22–25.
Berber Sardinha, A. P. (2004). Linguística de Corpus. Barueri, SP: Manole.
Burton, G. (2012). Corpora and coursebooks: destined to be strangers forever? Corpora Vol. 7 (1): 91–108.
Coyle, D., Hood, P., & Marsh, D. (2010). CLIL Content and Language Integrated Learning. Cambridge, Cambridge University Press.
Davies, M. (2008–) The Corpus of Contemporary American English (COCA): 560 million words, 1990–present. Available online at
dos Santos, A.G. (2015). Developing ELT coursebooks with corpora: the case of ‘Sistema Mackenzie de Ensino’. In: Corpus Linguistics 2015. Lancaster: UCREL, 2015. pp. 293–294.
Gavioli, L. (2005). Exploring Corpora for ESP Learning. John Benjamins Publishing. Studies in Corpus Linguistics, Vol.21.
Johns, T. (1991). Should you be persuaded: two samples of data-driven learning materials. In: Johns, T. & King, P. (eds.) Classroom Concordancing. In: ELR Journal 4. University of Birmingham. 1–16.
Mishan, F. (2005). Designing authenticity into language learning materials. Bristol: Intellect Books.
McCarthy, M. (2008). Lang. Teach. (2008), 41:4, 563–574.
Tomlinson, B. (2003). Developing materials for language teaching. London: Continuum.
Tribble, C. & Jones, G. (1997). Concordances in the classroom. A resource guide for teachers. Houston: Athelstan Publications.
[1] By Mark Davies. Available at:

, ,

No comments yet.

Leave a reply

© 2016 IATEFL MaWSIG All Rights Reserved.