Wmatrix corpus analysis and comparison tool
Wmatrix is a software tool for corpus analysis and comparison. It provides
a web interface to the USAS and
CLAWS corpus annotation tools, and
standard corpus linguistic methodologies such as frequency lists and
concordances. It also extends the keywords method to key grammatical
categories and key semantic domains.
Wmatrix allows the user to run these tools via a web browser such as Opera, Firefox or Internet Explorer,
and so will run on any computer (Mac, Windows, Linux, Unix) with a web browser and
a network connection.
Wmatrix was initially developed by Paul Rayson
in the
REVERE project,
extended and applied to corpus linguistics during PhD work
and is still being updated regularly. Earlier versions were available for Unix via
terminal-based command line access (tmatrix) and Unix via Xwindows (Xmatrix),
but these only offer retrieval of text pre-annotated with USAS and CLAWS.
Introduction to Wmatrix (click images to enlarge)
Folders
Wmatrix users can upload their own corpus data to the system,
so that it can be automatically
annotated and viewed within the web browser.
Each file is stored in a folder (equivalent to a folder in Windows
or directory on Unix).
Input format guidelines
The analysis may be improved with some pre-editing of the input text,
although pre-editing is not normally required. There are
guidelines
provided for texts to be tagged by CLAWS. Most important is the replacement
of less-than (<) and greater-than (>) characters by the corresponding SGML entity
references (<) and (>) respectively.
The text may contain well-formed HTML, SGML or XML tags. If the text
contains less-than or greater-than symbols in formulae, for example,
then CLAWS may mistake large quantities of the following text for SGML tags,
or fail to POS tag the file.
The guidelines mention start and end text markers, but these are not required
since they are inserted for you by Wmatrix.
Tag wizard
Wmatrix users can upload their file and complete the
automatic tagging process by clicking on the tag
wizard. Once the file has been uploaded to the web server, it is POS tagged by
CLAWS
and semantically tagged by
USAS.
This process can be carried out step by step starting
with the 'load file without tagging' option in the advanced interface.
As a shortcut you can simply upload frequency profiles
if you have them.
The format for a frequency list is a very simple two column format
with a total line at the head of the file. You can
see an example of this. The column widths are not
significant.
My Tag Wizard
My Tag Wizard is a variant of the tag wizard which allows you to
override or extend the system dictionaries for your own data. There are
two main uses. First, you can override the current most likely tag for any
word or MWE. Second, you can extend the dictionaries in terms of coverage
of vocabulary and tagset. For example, you can create a new tag by
listing the words and MWEs that you wish to be tagged with it.
Viewing folders
By clicking on the folder name, the user can see its contents.
Following the application
of the tag wizard, the folder contains the original text, POS and semantically tagged
versions of that text, and a set of frequency profiles.
Simple and advanced interfaces
The user can toggle between simple and advanced interfaces in Wmatrix.
The advanced interface offers more options and more control over the data.
Frequency profiles
From the folder view, the user can click on a frequency list to see the
most frequent items in their corpus.
Frequency lists are available for words in the simple interface, and in the advanced interface
for POS tags and semantic tags.
The lists can be sorted alphabetically or by frequency.
Concordances
From the frequency list view, the user can click on 'concordance' and see standard
concordances. These can show the usual word based concordance as well as
all occurrences for words in one POS or semantic category.
Key words, key POS and key domains: comparison of frequency lists
From the folder view, the user can click on compare frequency list to
perform a comparison of the frequency list for their corpus against another larger
normative corpus such as the BNC sampler, or against another of their own texts
(once that text has been loaded into Wmatrix). This comparison can be carried out
at the word level to see keywords, or at the POS (in the advanced interface), or at the
semantic level (to see key concepts or domains). The log-likelihood statistic is employed by
Wmatrix. For more details, see the log-likelihood calculator.
In the simple interface, word and tag clouds are shown
which visualise the more significant differences in the larger font sizes.
In the advanced interface more detailed frequency information is
also displayed in table form.
Then the key comparison shows the most significant key items
towards the top of the list since the result is sorted on the LL
(log-likelihood) field which shows how significant the difference is.
You should just look at items with a '+' code since this shows overuse
in your text as compared to the standard English corpora. To be
statistically significant you should look at items with a LL value
over about 7, since 6.63 is the cut-off for 99% confidence of
significance.
N-grams and c-grams
Recurrent sequences of words are called n-grams in Wmatrix. These are similar
to clusters in WordSmith and lexical bundles in Biber's work. You can calculate
n-grams of length 2 to 5 for each text. Collapsed-grams (or c-grams) are
a merged version of these lists. They show you which 2-grams are subsets of
3-grams, which 3-grams are subsets of 4-grams, and so on. The resulting c-gram
list is a tree structure with the longest n-grams on the left and
shortest n-grams on the right.
Acknowledgements:
Wmatrix was initially developed within the
REVERE project
(REVerse Engineering of Requirements)
funded by the EPSRC, project number
GR/MO4846.
Lancaster University Proof of concept funding in July 2006
provided support for a new server and continued software development.
In December 2006, further interface design using XHTML/CSS was carried out by
Andrew Foote (InfoLab21 Knowledge Business Centre) funded under support from
the European Regional Development Fund. Through a Lancaster University small grant
(Towards an Online Conceptual Database of the Latin Vulgate Bible)
a 'reader' interface is being developed for pre-tagged corpora.
Please reference Wmatrix as one of the following:
Rayson, P. (2008).
From key words to key semantic domains.
International Journal of Corpus Linguistics.
13:4 pp. 519-549.
DOI: 10.1075/ijcl.13.4.06ray
Rayson, P. (2008) Wmatrix: a web-based corpus processing environment,
Computing Department, Lancaster University. http://ucrel.lancs.ac.uk/wmatrix/
Rayson, P. (2003).
Matrix: A statistical method and software tool for linguistic analysis through
corpus comparison.
Ph.D. thesis, Lancaster University.
(abstract or full text
)
Publications and applications:
- Systems engineering: see the publications listed under the
REVERE project. For example:
Sawyer, P., Rayson, P. and Cosh, K. (2005)
Shallow Knowledge as an Aid to Deep Understanding in Early Phase Requirements Engineering.
IEEE Transactions on Software Engineering. Volume 31, number 11, November, 2005, pp. 969 - 981.
ISSN 0098-5589.
doi: http://doi.ieeecomputersociety.org/10.1109/TSE.2005.129
- Aspect oriented requirements engineering: identification of early aspects. See, for example:
Chitchyan, R., Sampaio, A., Rashid, A. and Rayson, P. (2006).
Evaluating EA-Miner: Are Early Aspect Mining Techniques Effective?
In proceedings of Towards Evaluation of Aspect Mining (TEAM 2006).
Workshop Co-located with ECOOP 2006, European Conference on Object-Oriented Programming, 20th edition,
July 3-7, Nantes, France, pp. 5-8.
- Corpus-based impact analysis of academic research:
Francois Taiani, Paul Grace, Geoff Coulson and Gordon Blair (2008)
Past and future of reflective middleware: Towards a corpus-based
impact analysis.
The 7th Workshop On Adaptive And Reflective Middleware (ARM'08)
December 1st 2008,
Leuven, Belgium, collocated with Middleware 2008.
- Ontology learning:
Gacitua, R., Sawyer, P., Rayson, P. (2008). A flexible framework to
experiment with ontology learning techniques. In Knowledge-Based
Systems, 21, 3, April 2008, pp. 192-199. DOI:
10.1016/j.knosys.2007.11.009
- Frequency profile comparison of written and spoken English: See
Leech, G., Rayson, P., and Wilson, A. (2001).
Word Frequencies in Written and Spoken English: based on the British National Corpus.
Longman, London.
(see the companion website for more details)
- Political science research:
Beigman Klebanov, B., Diermeier, D., and Beigman, E. 2008.
Automatic annotation of semantic fields for political science research.
Journal of Language Technology and Politics 5(1):95-120.
http://www.cs.huji.ac.il/~beata/publications.html
- Corpus stylistics (1):
Murphy, S. (2007). Now I am alone: A corpus stylistic approach to Shakespearian soliloquies.
Papers from the Lancaster University Postgraduate Conference in
Linguistics & Language Teaching, Vol. 1. Papers from LAEL PG 2006
Edited by Costas Gabrielatos, Richard Slessor & J.W. Unger.
- Corpus stylistics (2):
A number of papers were presented at the PALA 2007 conference
(29-30 July 2007, Kansai Gaidai University, Osaka, Japan)
including those by Geoffrey Leech, Yu-fang Ho, Dan McIntyre, Haruko Sera, Brian Walker.
Mick Short and Brian Walker also ran a Workshop: Using Wmatrix to compare scenes from Harold Pinter's Betrayal.
See the book of abstracts on the conference website for more details.
- Training chatbots: comparison of human-human and human-machine dialogues. See
Abu Shawar, Bayan; Atwell, Eric. Using dialogue corpora to train a chatbot. In Archer, D, Rayson, P, Wilson, A & McEnery, T (editors) Proceedings of CL2003: International Conference on Corpus Linguistics, pp. 681-690 Lancaster University. 2003.
- Computer content analysis: analysis of interview transcripts.
- Computer content analysis of political discourse. See
Xin Huang (2003) A Computer-aided Diachronic Content Analysis of Twentieth Century
Political Discourse in China. MA dissertation in Language Studies, Lancaster University.
- Key word analysis (1):
See Marilyn Deegan, Harold Short, Dawn Archer, Paul Baker,
Tony McEnery, Paul Rayson (2004)
Computational Linguistics Meets Metadata, or the Automatic Extraction of
Key Words from Full Text Content.
RLG Diginews,
Vol. 8, No. 2.
ISSN 1093-5371.
- Key word analysis (2):
Walkerdine, J. and Rayson, P. (2004)
P2P-4-DL: Digital Library over Peer-to-Peer.
In Caronni G., Weiler N., Shahmehri N. (eds.)
Proceedings of Fourth IEEE International Conference on Peer-to-Peer Computing
(PSP2004)
25-27 August 2004, Zurich, Switzerland.
IEEE Computer Society Press, pp. 264-265. ISBN 0-7695-2156-8.
- Key word-class analysis for EAP: See
Jones, M., Rayson, P. and Leech, G. (2004)
Key category analysis of a spoken corpus for EAP.
Presented at The 2nd Inter-Varietal Applied Corpus Studies
(IVACS)
International Conference on "Analyzing Discourse in Context"
The Graduate School of Education, Queen’s University, Belfast, Northern
Ireland, 25 - 26 June, 2004.
- Phraseology:
Magali Paquot, Sylviane Granger, Paul Rayson and Cédrick Fairon (2004)
Extraction of multi-word units from EFL and native English corpora:
The phraseology of the verb 'make'.
Presented at
Europhras, European Society of Phraseology,
26-29 August 2004, Basel, Switzerland.
- Comparison of political party manifestos: (Labour versus LibDem UK 2001 General Election)
Paul Rayson (2004).
Keywords are not enough.
Invited talk for JAECS (Japan Association for English Corpus Studies)
at Chuo University, Tokyo, Japan, 27th November 2004.
(
slides)
- Key domain analysis (1):
Rayson, P. and Smith, N. (2006)
The key domain method for the study of language varieties.
The Third Inter-Varietal Applied Corpus Studies (IVACS) group International Conference on
"LANGUAGE AT THE INTERFACE".
University of Nottingham, UK, 23-24 June 2006.
- Key domain analysis (2):
Archer, D., Culpeper, J. and Rayson, P. (2005)
Love - a familiar or a devil? An exploration of key domains in Shakespeare’s
Comedies and Tragedies.
Presented at the AHRC ICT Methods Network Expert Seminar on Linguistics.
Lancaster University, 8 September 2005.
- Key domain analysis (3):
Yufang Ho. (2007) Investigating the key concept differences between the two
editions of John Fowles's The Magus - a corpus semantic approach.? The
27th International Conference of the Poetics and Linguistics Association
(PALA), Kansai Gaidai University, Hirakata, Osaka, Japan, 31 July - 4
August 2007.
- Key domain analysis (4):
Vincent B.Y. Ooi, Peter K.W. Tan & Andy K.L. Chiang (2007)
Analyzing personal weblogs in Singapore English: the Wmatrix approach.
Studies in Variation, Contacts and Change in English.
Volume 2. Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki.
http://www.helsinki.fi/varieng/journal/volumes/02/ooi_et_al/
- Key domain analysis (5):
Afida Mohamad Ali (2007). Semantic fields of problem in business English:
Malaysian and British journalistic business texts.
Corpora, 2, 2, pp. 211-239.
- e-learning materials development: Nakano, T. and Koyama, Y. (2005).
e-Learning Materials Development Based on Abstract Analysis Using Web Tools.
Knowledge-Based Intelligent Information and Engineering Systems.
9th International Conference, KES 2005, Melbourne, Australia, September 14-16, 2005, Proceedings, Part I,
LNCS 3681, Springer, pp. 794-800. DOI 10.1007/11552413_113
- Linguistic modality study:
Gabrielatos, C. and McEnery, T. (2005). Epistemic modality in MA dissertations.
In. Fuertes Olivera, P.A. (ed.) Lengua y Sociedad: Investigaciones recientes en
lingüística aplicada. Lingüística y Filología no. 61. Valladolid: Universidad de Valladolid, pp. 311-331.
- Entrepreneurship studies: Doherty, N., Lockett, N., Rayson, P. and Riley, S. (2006).
Electronic-CRM: a simple sales tool or facilitator of relationship
marketing? 29th Institute for Small Business & Entrepreneurship
Conference. International Entrepreneurship - from local to global
enterprise creation and development. 31 October - 2 November 2006,
Cardiff-Caerdydd, UK.
- Knowledge Transfer: see the EPSRC
InfoLab21 Knowledge Transfer Study Report and the
ICT
Knowledge Transfer Research Project