Plagiarism Detection versus Copy and Paste
Draft
Report,
Note: not for distribution without prior author consent
Angie Chandler and Lynne
Blair
Computing Department,
Email contact:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Abstract
As the ease of and inclination to copy and paste
sections of text and present it as original increases, the software available
to combat this cycle of laziness is also on the rise. This report discusses measures
put in place for both written reports and source code for the academic year
2002/03 in the Computing Department at
Introduction
Plagiarism, although defined in a number of different
ways [5], is in essence presentation of someone else’s work as your own. It can
take many forms, varying from word for word copying from the internet to
approximate paraphrasing from a friend’s essay, and the level at which it is a
problem may vary between different disciplines [2]. However, in some form it
can be unequivocally agreed that plagiarism by a student will affect the degree
to which the student has learned the material in question, alongside any
assessment of the student’s work. As such, plagiarism is clearly a problem
which must be tackled to maintain the integrity of the degree scheme, or other
course, in question.
As can be seen in any of the literature [1][3], as the facility to copy and plagiarise increases, so
does the incidence of plagiarism, leading to a new plagiarism detection
requirements. Over the past few years, work into this area has increased
dramatically, and it is at this stage that the Computing Department have chosen
to make use of these new resources available. The aim of this report is that
perhaps the work done within the Computing Department will be able to assist
other departments within the university to do the same.
Requirements
Naturally, as previously touched upon, the definition
of plagiarism can vary between different subjects. In Computing, essays are
less required to have an argument in the terms that, perhaps, an English
department essay must, but they must nonetheless be written competently by the
student themselves rather than copied verbatim from the source. Similarly, the
programming elements of the course require that the students can be shown to have
done programming exercises themselves rather than copied from a friend.
Finally, and perhaps most importantly, the testing for these various forms of
plagiarism must not take too much staff
time or other department resources! The time of academic staff is already
at a premium, and must be considered best used in the teaching of students,
from the students’ point of view, rather than catching those few who may be
cheating the system. Similarly, financial resources are rarely readily
available to departments, and would be best aimed at the provision of student
facilities.
Our Solution
The solution of the Computing Department is currently
based on three separate programs for plagiarism detection. The first is Turnitin, which is at this time in the process of creating
a free access version for all British universities for at least two years via
Each of these solutions relies on the electronic
submission of student work, either in tandem with paper based submission (in
which case there must be means to check that the work is the same) or the
department’s staff must either agree to mark online or to print out vast
numbers of student documents. There must also be the disk space on a secure
server to store the submitted documents, and in accordance with the data
protection act, students must agree to their work being both stored and
submitted to external sources. None of these problems are insurmountable, with
the help of Computer Support on the university, but they do provide an initial
barrier to the setting up of such a system, which may dramatically delay
general university wide plagiarism detection.
Once the server containing the work is set up, and
permission has been obtained from the students, the remainder of the proposed
system is simple to set up. CopyCatch and JPlag each rely on the use of a Java [7] program known as a
Jar file. Jar files are Java programs
which can be executed in the same way as other executable programs (ie double clicking on an icon) once a free program called
the JRE (Java Runtime Environment [8])
has been installed on the system. The JRE is also free and will run on any
platform. Turnitin has a web based interface, but as
the department wishes to submit documents its rather than requiring students to
submit and return a Turnitin report alongside their
main report, this will also make use of 3 Jar files to enable this process to
take place quickly and painlessly. These, home-grown, Jar files are already
operational and will be made readily available to members of the university
once they have been more thoroughly tested [6].
Results
Initial results of the three plagiarism detection
programs have been favourable, although there remain a few problems to be
ironed out.
Turnitin
Using the batch submission system for Turnitin, the results from a sample of students from the
previous year, have effectively found approximately double the number of
plagiarised documents than the lecturer in question was able to establish
during marking. Most notably, this system made the process of plagiarism
detection less reliant on determination of a change in writing style, which
unfairly disadvantaged students for whom English was a second language.

Figure
1 Turnitin - Batch Submission Results
Figure 1 shows a listing of students, ranked in order
of plagiarism found within their documents, following the use of the batch Turnitin Jar files. Further details of this, Turnitin submission, and standard Turnitin
procedures can be seen in [10].
CopyCatch [11]
Using the same batch of students as the initial Turnitin test, the CopyCatch
program analysed the documents and found a good number of them to be over 50%
similar. However, a cursory examination of these results showed the anomalies to
be nothing more than the students using the same words to describe technical
elements of the essay. In fact, this test showed that none of the students had
plagiarised from one another, those who were so inclined preferring to
plagiarise from the internet. In order to test the display of documents which
were plagiarised, a test was performed including some of the plagiarised
internet files found by Turnitin for comparison.

Figure 2 CopyCatch - Selecting Files

Figure
3 CopyCatch - Marked up files (plagiarised)
In Figure 3 (although the words themselves are too
small to read) the red text represents words which are shared only once by each
document, whilst blue text represents words which are shared multiple times.
The words which are shared only once (except if they are technical words such
as, for instance, telecommunications) give a very clear indication of the
degree of plagiarism between the documents.

Figure 4 CopyCatch - Marked up non plagiarised files
Figure 4 shows a pair of files which exceeded the
initial 50% threshold given to determine whether files are plagiarised, but are
clearly not plagiarised – this can be determined both from the relatively few
red words shown in the text and by studying the “phrases” section, which
displays all matching phrases and their location.
The CopyCatch program
features a large number of ways to view information about every file submitted
in the class, including giving details of any matching phrases, the words used
only once in each document (a significant measure of the degree of plagiarism)
and a host of other statistics, the largest drawback of which at this time
would appear to be the accurate interpretation of these results. However, this
is not to say that the results are incomprehensible, far from it; merely that
the threshold for determining whether documents are plagiarised or simply on
the same subject in a field such as computing must be carefully set, and in the
event of a match that the information available must be examined carefully.
Intuitively, this program provides a means of highlighting documents which the
lecturer would themselves have detected as plagiarised provided they were
marked in close enough succession.
JPlag [9]
Without data on which to test this program
conclusively, remarks on its success are difficult to justify. Nonetheless,
following limited testing on a number of both similar and dissimilar code
samples, initial results have been shown to be both accurate and easy to read,
with any matching sections of code within a group of code submissions clearly
highlighted and linked to one another.

Figure 5 JPlag submission
Code is submitted to JPlag
through a Jar file (Figure 5), as shown in previous examples,
results can then be obtained later from a secure web page (Figure 6).

Figure 6 JPlag results
Any coloured sections in the JPlag
results documents show identical, plagiarised sections. They can be matched
using a pair of arrows in either document to search for the equivalent section
in the partner code. Each pair of matches is ranked according to the degree of
plagiarism. Early experiments have shown that even relatively similar, unplagiarised, source code will remain within the below 3%
boundary on JPlag.
Conclusions
As the plagiarism detection system nears its first
year in service, with one successful “live” test already behind it, the hope is
that the use of this system will discourage the students from taking what up
until now must have seemed like an easy way out. Perhaps awareness of the
existence of the system will be all that is necessary. Certainly in future
years if it is successful, and given the overall aim to give the students the
education they come to the university for, this would be the most effective use
of the system possible. As the situation stands at the moment, the department’s
hopes are high that this goal can be achieved, but the focus remains heavily on
the procedure which must be followed once a plagiarised document has been
discovered.
References
[1] Plagiarism, can technology help? Gill
http://ctiwebct.york.ac.uk/LTSNPsych/PLAT2002/Programme/Full_Programme/full_programme.html#p18
[2] Plagiarism, Prevention, Deterrence &
Detection, Fintan
Culwin & Thomas Lancaster, South Bank University,
[3] JISC (Joint Information Systems Committee)
Plagiarism
http://www.jisc.ac.uk/plagiarism
[4] The JISC Plagiarism Advisory
http://online.northumbria.ac.uk/faculties/art/information_studies/Imri/JISCPAS/site/jiscpas.asp
[5] Plagiary and the Art of Skillful Citation, John
Rodgers, 1996
http://www.bcm.tmc.edu/immuno/citewell/
[6] Batch Turnitin, Angie
Chandler and Lynne Blair, 2002
http://www.comp.lancs.ac.uk/computing/users/angie/plagiarism/batchTurnitin.htm
http://www.comp.lancs.ac.uk/computing/users/angie/computer.htm
[7] Java, 2002 http://java.sun.com
[8] Java Runtime Environment, 2002 http://java.sun.com/products/jdk/1.1/jre/
[9] JPlag, Guido Malpohl, 2002, http://www.jplag.de/
[10] Turnitin, 2002. http://www.turnitin.com/
[11] CopyCatch, David Woolls, 2002. http://www.copycatchgold.com/