Automated Plagiarism
Detection: Start to Finish
Draft
Report,
Note: not for distribution without prior author consent
Angie Chandler and Lynne
Blair
Computing Department,
Email contact:
lb@comp.lancs.ac.uk
Abstract
This document discusses work done on enabling technologies to make the use of common plagiarism detection systems more practical to lecturers and teachers, who are likely to be short of time. The simple stages required to submit the work are shown and described from start to finish.
Introduction
Although plagiarism is not a new issue, the recent use of the internet for information is increasingly making plagiarism more difficult for teachers to recognise. It is no longer possible to simply recognise the text from which the student may have copied material, or to detect that two students have similar work, the teacher must now be able to detect work which may have been taken from any of millions of potential web sites simply by noticing a change in the student’s style of writing.
Naturally, as the students’ use of such plagiarism methods increases, so does the authorities’ use of combative methods, in this case programs designed to detect plagiarism from the web. Again, a good deal of research has been done in this area, and computers exist which are dedicated to the purpose of tracking down web sites which may potentially supply students with resources.
The topic of this document is not the well known system of plagiarism detection, but the effective use of the supervisor’s time when it comes to submission to plagiarism systems [6]. Although in recent months the largest web plagiarism detection system, Turnitin, has provided a new batch submission system, working prior to this there was no way to collect and effectively submit documents without consuming unreasonable amounts of staff time. Similar difficulties also apply to other submission systems.
There are a number of separate stages to the submission of student work to a plagiarism detection system, ranging from signed permission forms from the students to software and storage concerns. This document will describe the stages of work submission, collection and analysis and step through the processing of a document from start to finish.
Step
1: Submission of Work
The primary hurdle any electronic plagiarism detection system must meet is the submission of students’ work electronically. Even for computer literate students this can cause problems, and instructions on how to go about the submission of work must be completely clear or they will cause the students unnecessary anxiety. There are currently three feasible methods for the submission of work electronically:
- via email
- into directories
- web submission form
In fact, the ideal would probably be the use of a completely accurate document scanner, which could take in a stack of hard copies of work and convert them into electronic form, this avoiding any discrepancies between electronic and paper form, but given that this is not yet possible, the best of these is probably a web submission form.

Figure 1 A simple web submission form
For the purposes of this work each of the three methods mentioned above has been trialled. The web submission form, however, has yet to reach full trial as there have been difficulties with security of documents as they are transmitted. Initial trials are nonetheless encouraging, as users would only need minimal familiarity with the computer and the format of the documents downloaded could be guaranteed. The structure of their storage is likely to be similar to that found in the directory method described below.
Copying of documents into preset directories has been trialled on several occasions, although only on computer science students, who can obviously be assumed to be highly computer literate. Each of the students is given his or her own directory to copy work into, on a per exercise basis, with each student only able to access his or her own directory.

Figure 2 The submission directories for coursework "www"
The supervisor must then have complete access rights to the directory structure in order to collect the data from the students. To collect the data manually would still remain a considerable task, so we have developed a program which will check through all relevant directories (including subdirectories within the students’ individual folders) and save them with unique file names to a single folder.

Figure 3 Fetch Files
This program can either create separate files for each of a given student’s files, or combine the files into a single document, depending on the method of plagiarism testing to be used on the documents.
The final method of electronic collection is via email. Initially, this may sound like a good, simple solution, and being more familiar to non-computing students possible advantageous. However, there are a number of alternative overheads associated with the use of an email submission system, primarily on the member of staff required to collect the documents and save them to a folder as they arrive, which can be a time consuming process. Attention must also be paid to whether the document has arrived intact, and to the appropriate naming of the document so that the student’s work can be recognised later. Similarly, unlike the directory system, where a student could view their work, the students are unable to determine whether the work has been received or not, and non-computing students were noticeably more likely to attempt to submit work twice to ensure its safe arrival, thus increasing staff overheads further.
Ultimately, the method by which the submission of work is done must rely on the department by which the work is set, because of set up overheads which academic staff would not expect to have to deal with. Ideally, it would be best to set up a system which could be uniform across all departments and courses and which would be saved to a central server for staff usage, but how practical this would be is difficult to determine.
Step
2: Processing of Work for Plagiarism Detection Devices
Once all work has been successfully obtained and arranged, it must then be processed appropriately for the type of system to which it must be submitted. Of the three systems used in this trial: Turnitin (plagiarism from the web) [3], CopyCatch (collusion checking) [4] and JPlag (program source code) [5], each of them requires different initial processing. At this stage little else would be need to be done for the use of CopyCatch or JPlag, so we will focus on the use of Turnitin.
In order to make use of Turnitin without the significant overhead of submitting each piece of work individually, we have developed a system through which work can be combined into a single submission, then post-processed afterwards to separate the work again. This is done in two stages.
Initially, the work must be converted into text format so that the program can manipulate it, this is done by first detecting the files to convert using the program shown below (simply select a directory) and then running a program called “Antiword”.

Figure 4 Convert to Text
This process will anonymise the files, but
if post-processed on the same computer will be able to rematch the results with
the students. If post-processing is done on a different computer, the details
relating students to documents will be stored in a file called “matchdocs”
which can be used manually.
With the files converted into text, the text must then be sampled and combined. The sampling process, which will be necessary even with the use of the new batch submission system on Turnitin itself, allows the supervisor to chose which sections of the text should be submitted, but also automatically removes the first section of text to protect the student’s identity.

Figure 5 Sampling
A file containing all of the documents is then submitted to the Turnitin system, and the supervisor must wait for the results.
In the cases of CopyCatch and JPlag the requirements are slightly different. CopyCatch can collect files from different folders, but experience shows it works best with all files within the same directory and, if the number of files or size of the files is large, the files converted into text format. JPlag requires all files to be within the same directory, but as it is intended to process program code, it is designed to handle the format which the code will already be found in.
Step
3: Post-Processing of Work
Each of the systems also requires a degree of post-processing, whether it simply requires interpretation, or further software processing. Again, the first system we will discuss is Turnitin. Results from Turnitin, if they are submitted as a batch, will be impossible to interpret manually, so they must first be run through the program shown below.

Figure 6 Decode Results
This process will not only separate and rank the students according to the degree of plagiarism detected, but can also download the source web documents from which the students may have copied. These web documents can then in turn be used through CopyCatch to detect similarities between documents which are not exact. This can be done by downloading all web documents referred to and them combining them for use in CopyCatch using the sampling tool shown earlier (taking 100% sample). Later versions of this anticipate ensuring that only the web sources shown to have been used within the same document will be combined to improve results.
Although CopyCatch requires little or no pre or post processing it does require a degree of interpretation. Results shown between students writing essays based on a given title are likely to have a degree of similarity, so the best approach to this is to take note of the works which are only used once (marked in red) in each document. This is part of the process of determining the similarities between documents used by CopyCatch, and is easily the best measure to use. If a significant number of the “only used once” words are written in the same context then the likelihood of the documents being copied rather than coincidentally similar increases.
JPlag, used only for the detection of source code plagiarism is far simpler. The results are provided as a list of pairs of code samples, with links to each matching section of more than a certain length. The difficulty, as with CopyCatch, is determining to what extent the apparent plagiarism is due to a shared environment rather than deliberate plagiarism. From the layout available to this system, it is a relatively simple task for the course supervisor to pick out any suspect documents, particularly as with code assignments there are often multiple submissions in a relatively short time and so trends can be detected prior to action if necessary.
Conclusions
In conclusion, this document does not consider the results obtained by the different submission systems, which have already been discussed in two separate documents by the authors [1][2], instead it focuses on the work required to process the submissions from the students and interpret the results produced. The programs written to assist in this process have proved invaluable in this study, and are readily available for use by course supervisors in other departments.
The final hurdle to be overcome by this system as a complete electronic submission processing system, is the issue of ensuring that students submit electronically the same work which is submitted on paper. Without significant overheads, either in staff time comparing work or in the cost of paper as the university undertakes to printout every piece of student work, and assuming that staff would largely prefer not to mark any quantity of work online it is difficult to see an effective solution.
Our current procedure is to check a few random samples, to ensure that they do match up to the hard copies. However, we propose that the odds of finding a mismatch may be greatly increased with the use of CopyCatch to detect documents which are notably less similar than the others. As we proceed on to perform increasing numbers of trials and put the system into full use, we plan to test this assertion in the hope that it will strengthen the submission system as a whole.
Refs
[1] Batch Plagiarism Detection with
Turnitin, Angie Chandler and Lynne Blair, 2002 http://www.comp.lancs.ac.uk/computing/users/angie/plagiarism/batchTurnitin.htm
[2] Plagiarism Detection versus Copy and
Paste, Angie Chandler and Lynne Blair, 2002
http://www.comp.lancs.ac.uk/computing/users/angie/plagiarism/overall.htm
[3] Turnitin, iParadigms, 2003. http://www.turnitin.com
[4] CopyCatch, David Woolls, 2003. http://www.copycatchgold.com
[5] JPlag, Guido Malpohl, 2002. http://www.jplag.de/
[6] Plagiarism, Prevention, Deterrence and
Detection, Fintan Culwin and Thomas Lancaster, 2001. http://www.ilt.ac.uk/resources/Culwin-Lancaster.htm (currently offline)
Try Centre for Interactive Systems Engineering, South Bank University
Appendix
Summary
of trials carried out and statistics of results.
Turnitin
CSc 311 Detected 7 cases of plagiarism (of over 30%) compared to 3 detected manually. There were 4 over 50% but these didn’t correspond to the ones caught by hand.
MSc projects one case found, student was allowed to resubmit.
MSc AI course some plagiarism found and agreed on by lecturer involved
MSc Management essay known plagiarism. Many sources found but not a large enough percentage for concern.
MSc AISD, SE no significant plagiarism found.
Econ 205 no plagiarism found.
CopyCatch
CSc311 No evidence of plagiarism
MSc AI inconclusive
MSc Management essay – increased concern to 38% using downloaded URLs.
MSc SE no significant plagiarism found.
MSc AISD suspicious case of two students with 80% match (often get 50-60% on similar set essays). Awaiting verdict from lecturer.
Econ 205 one student had work found to be similar to a number of different students which may have been due to plagiarism from a book, however, it was deemed acceptable by the lecturer.
JPlag
CSc210 Week 4 inconclusive
Com120 QBasic – tested as text files, but still clearly showing the plagiarism between groups of people known to work together. Very similar to hand caught plagiarism.
CSc210 Chat program – again inconclusive
CSc210 Chat program with both this year and last year’s students. Plagiarism found between last year’s students (effects of warning about checking?) and some possible between years.
CSc210 WWW program. Cumulative effect of testing showed two students consistently slightly higher than everyone else. Further investigation showed significant plagiarism and one student lost marks for the work involved.
Statistics
of results
Turnitin
|
|
50-100% |
30-50% |
10-30% |
1-10% |
<1% |
Total |
|
CSc311 |
4 |
3 |
11 |
12 |
43 |
73 |
|
MSc Proj |
0 |
1 |
1 |
0 |
18 |
20 |
|
MSc AI |
1 |
1 |
1 |
2 |
9 |
14 |
|
MSc SE |
0 |
0 |
0 |
1 |
4 |
5 |
|
MSc AISD |
0 |
0 |
0 |
0 |
14 |
14 |
|
Management essay |
0 |
0 |
0 |
1 |
0 |
1 |
|
Econ 205 |
0 |
0 |
0 |
0 |
66 |
66 |
All MSc courses listed are part of the computing MSc in multimedia and distributed systems. It is known that the plagiarised management essay was largely plagiarised from books, and this will be retested once a full implementation of Turnitin (capable of detecting matches from books) is available. The total absence of plagiarism of any kind in the economics course also implies that if there is any plagiarism it is likely to be from books, unlike essays written for the computing department where the students are more likely to be comfortable with the internet.
CopyCatch
|
Submission |
No. Pairs >=
60% |
Highest match |
Comparing to
URLs (highest) |
|
CSc311 |
12 |
67% |
29% |
|
MSc Proj |
0 |
0% |
N/A |
|
MSc AI |
15 |
71% |
35% |
|
MSc SE |
10 |
78% |
6% |
|
MSc AISD |
18 |
80% |
21% |
|
Econ 205 |
2 |
60% |
23% |
The management essay could not be compared to any other documents for collusion, however, against the web sources from which Turnitin matched it at 3% it matched 38%, which as can be seen from the table is quite high for that type of match.
JPlag
CSc210 results – current exercises, last year’s (old) exercises and combined. Potential cases of cross-year plagiarism are outlined in blue.
|
Exercise |
90% |
80% |
70% |
60% |
50% |
40% |
30% |
20% |
10% |
0% |
|
Week3 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
2 |
0 |
36 |
|
Old Week3 |
1 |
0 |
2 |
0 |
0 |
2 |
8 |
12 |
5 |
30 |
|
Both
Week3 |
1 |
0 |
2 |
1 |
0 |
3 |
8 |
15 |
- |
36 |
|
Week4 |
0 |
0 |
0 |
2 |
2 |
1 |
2 |
1 |
22 |
70 |
|
Old Week4 |
1 |
0 |
0 |
2 |
0 |
0 |
1 |
1 |
25 |
30 |
|
Both
Week4 |
1 |
0 |
0 |
4 |
2 |
1 |
4 |
10 |
8 |
130 |
|
Chat |
0 |
0 |
0 |
0 |
1 |
5 |
9 |
12 |
0 |
70 |
|
Old Chat |
2 |
2 |
0 |
2 |
0 |
4 |
16 |
4 |
0 |
28 |
|
Both
Chat |
2 |
2 |
0 |
2 |
5 |
15 |
3 |
- |
- |
128 |
|
www |
0 |
0 |
0 |
0 |
0 |
1 |
2 |
30 |
0 |
66 |
|
Old www |
0 |
0 |
0 |
2 |
4 |
3 |
1 |
15 |
3 |
27 |
|
Both
www |
0 |
0 |
0 |
2 |
4 |
4 |
6 |
14 |
- |
126 |
As results for last year were noticeably higher than last year we wondered if perhaps simply the threat of plagiarism checking (the students were warned) was enough to act as a deterrent.