Batch Plagiarism Detection with Turnitin
Draft
Report,
Note: not for distribution without prior author consent
Angie Chandler and Lynne
Blair
Computing Department,
Email contact:
lb@comp.lancs.ac.uk
Abstract
Turnitin.com is one of the
most widely recognised plagiarism detection sites on the market, now used by
approximately “5 million students and educators” [Turnitin 02]. The
effectiveness of this service is not in question however, the subject of this
paper is the interface through which all documents must be submitted. For
educators who wish to submit large quantities of student work without relying
on the students themselves to submit, the use of Turnitin would quickly become
so time consuming as to be completely ineffectual, unless it was limited to
either random samples of student text or already suspicious documents. The
subject of this paper is the submission and interpretation of multiple student
texts as a single entity in order to encourage thorough plagiarism testing.
Introduction
Turnitin.com is a widely
renowned plagiarism detector for free text that will not only detect plagiarism
from the web but also collusion between students. This can lead indirectly to
the detection of not only students copying but also the plagiarism of a
separate external source such as a text book. The system provides a simple cut
and paste interface where the user must paste their document into a text area
and then submit it for testing (Figure 1).

Figure 1 Turnitin submission
The testing process itself
can then take up to 24 hours, but will return with all plagiarised areas of the
text clearly highlighted and a comprehensive list of all links from which
documents may have been copied (Figure 3). Each document tested will be listed
on the user’s personal documents page, with an overall similarity rating and a
link to the document itself (Figure 2).

Figure 2 Turnitin list of results

Figure 3 Turnitin links and document
The links shown can be both
links to web pages or links to documents written by other students, either from
within this group or from another set of unrelated submissions. For documents
which have been submitted by another user however they must first give explicit
permission before their documents can be studied.
Batch Submission
After even a very few uses
of the Turnitin system it becomes clear that the system is not intended for
batch submission of a class of students. Turnitin itself gets round this by
giving each student on a course a login name and enabling them to submit their
own work directly. This provides its own problems, requiring that the educator
must then view the class results and ensure that the work submitted matches the
final work submitted for marking. Whilst in the right circumstances this may
still provide an effective method of processing for the purposes of the
authors, this method was unsatisfactory. Turnitin itself was clearly an
effective choice of plagiarism detector, but the submission process was far too
time consuming; to the extent that the plagiarism detection service itself may
not have been worthwhile. However, with a simple series of pre and post
submission processing programs it is possible to submit an entire batch of
student documents simultaneously. This will inevitably make collusion checking
through Turnitin impossible, this is discussed further in conclusions and
future work.
The processing for batch
submission must be done in a series of simple stages. First of all the
documents must all be converted to text format in order to allow the program
itself to read them, rather than requiring the user to edit the documents by
hand through Word or other word processors. For the purposes of this initial
trial, the conversion is only carried out on Word documents, for our samples
these comprise the vast majority of the texts submitted, leaving the remaining
2% to either be converted by hand or ignored until the initial processing
program has been extended.
Following the conversion of
the documents to text, the user must then run a second program to concatenate
the files into a single text document. This is done by taking anonymised
samples of every document and marking the separation point between each
student. The size of sample taken can be easily adjusted to range between
submitting the entire document and just a tiny sample of the available text.
The optimal size of the sample will necessarily depend on the type of
submission to be processed. For example final year projects will largely only
require plagiarism testing on the background section of the report, whereas a
report on a given subject may be plagiarised throughout and could be tested at
any point. Anonymisation is done automatically, and necessary for legal
purposes.
Once the joint file has been
generated for submission, the user must then login to Turnitin and submit the
entire text file as any normal file would be submitted, ensuring that an
appropriate title is also added. This will then be processed over the course of
24 hours by Turnitin. The results will appear exactly as for any other
submission, with a link to the document maintained on the user’s documents
page.
The final stage of the
user’s task is to save the results in order to process them back into an easily
interpretable format, as they would have been had the documents all been
submitted separately. This simply requires the user to save the source file
(i.e. the html itself) of the results page to their computer and to run the
final processing program to reformat the results. If all of the available
results are stored within a single directory, the program will also generate a
simple list of all available sets of results, mimicking the content of Turnitin
as closely as possible (Figure 4).

Figure 4 Batches of results
Each of these links will
then lead to an entire set of results, all of which will be listed in order of
the percentage of plagiarism detected in that document (Figure 5), with an
additional column providing information on the longest plagiarised string,
which is of more significance in the documents with lower percentages where a
short string length may potentially imply a series of one line quotes rather
than an entirely plagiarised paragraph.

Figure 5 A set of results
As with any of the Turnitin
results, these links are colour coded according to the level of plagiarism,
with the red links here being more than 50% plagiarised. The user may then
follow individual links to each separate student document and easily interpret
the computer’s assessment of the results given. Links outlined in yellow are at
least 10% plagiarised, which may or may not require further action, depending
on the nature of the plagiarism.
Results
The results obtained from
early experiments with this batch processing system have been highly
favourable. A sample of approximately 70 students was used and checked in these
tests, each of which featured a different sample of text from the documents
being selected, ranging in both size and position from 10% of the document near
the start to 50% at the start or 10% towards the end of the document, with the
100% test acting as an experimental control.
The variation appeared to
have little effect on the overall outcome (Table 1), with a consistent number
of seriously plagiarised documents found in each test, roughly twice the number
detected by the lecturer marking the documents originally. Looking at the data
as a whole it would appear that the odds of catching a student who had
significantly plagiarised a document would be quite high.
Table 1 Proportions of plagiarised documents
|
|
Percentage text used |
Pstn of text |
Plagiarised: over 50% |
Plagiarised: 30%-50% |
Plagiarised: 10%-30% |
Plagiarised: 1%-10% |
Plagiarised: 0% - 1% |
|
1 |
10 |
10% in |
3 |
1 |
8 |
5 |
57 |
|
2 |
30 |
10% in |
6 |
0 |
7 |
8 |
53 |
|
3 |
50 |
10% in |
5 |
1 |
6 |
12 |
50 |
|
4 |
10 |
80% in |
4 |
1 |
10 |
11 |
48 |
|
5 |
100 (control) |
- |
4 |
3 |
11 |
13 |
43 |
Despite the relatively
favourable results shown in Table 1, the students as individuals must also be
considered. Whilst it may be acceptable to some extent to catch a consistent
proportion of the worst offenders, it would be preferable to catch all of the
very worst documents. Following the initial successful tests, further tests
were done keeping track of the records of the worst of the plagiarised
documents seen here (Table 2).
Table 2 The results for the worst plagiarised documents
|
Test |
Stud53 |
Stud9 |
Stud25 |
Stud17 |
Stud39 |
Stud8 |
Stud72 |
Stud4 |
|
1 |
98 |
49 |
29 |
82 |
54 |
29 |
0 |
0 |
|
2 |
54 |
86 |
84 |
70 |
66 |
74 |
9 |
0 |
|
3 |
83 |
88 |
17 |
15 |
65 |
60 |
51 |
0 |
|
4 |
91 |
98 |
94 |
23 |
0 |
6 |
11 |
54 |
|
5 |
86 |
83 |
79 |
23 |
39 |
63 |
32 |
11 |
Each of the 8 students shown
here at some point appeared in the top (over 50%) boundary of at least one of
tests 1-4. It can be seen from test 5 (the control) that there were no further
students with over 50% of their document plagiarised. There was also only one
student not included here who plagiarised over 30% of their document and
ultimately scored higher than several of the students shown here, this student
consistently appeared in the 30-50% boundary and would also have been detected.
Interestingly, the results shown for student 25 on test 3 (50% of the document)
and the control are clearly impossible, retesting of the results gave the same
verdict, even with manual verification of the sampling program, so it can only
be assumed that there are small variations in the testing of documents within
Turnitin itself.
Figure 6 shows a graph of
the worst documents’ results for the 30% sample, compared to the actual results
from the control. A diagonal line would have implied that the 30% sample was
completely accurate, and as can be seen there a clear dip in this graph where
the 30% result would have shown the documents to have been plagiarised far
worse than they had in reality.

Figure 6 Comparing results for the worst plagiarised documents
Results for the remaining
tests when compared to the control showed similar anomalies, with one or two
documents notably varied from their expected value and it is clear that with
only one sample taken, it is difficult to guarantee that detection is
completely accurate. Most of these students show a single result that is far
removed from their actual plagiarism score. Nonetheless, as a guide for the
educators, any one of these results will highlight documents for later
scrutiny. Taking more than one sample regularly would require the educator to
put in more time than should be required of them, and taking a sample of 100%,
even for relatively short documents such as these, would require a
prohibitively long testing period (in this case 60 hours compared to an average
of 3 or 4).
To follow on from these
tests, a selection of the worst documents was taken based on the results of the
30% sample, but taking students with results as low as the 1% boundary, rather
than merely the 50% boundary. These students, making up approximately 1/3 of
the total number of students, were then taken and a 100% sample was resubmitted
to Turnitin. From this sample a total of 59 links were found to have been used
from the web, one greater than the number found in the control sample, which
must by definition have contained all of the possible links.
Table 2 Data from the worst students
|
Number of students |
Plagiarised: over 50% |
Plagiarised: 30%-50% |
Plagiarised: 10%-30% |
Plagiarised: 1%-10% |
Plagiarised: 0% |
|
26 |
5 |
1 |
7 |
10 |
3 |
|
74 |
4 |
3 |
11 |
13 |
43 |
As can be seen in Table 2,
the students selected made up the entire high end of the spectrum, and this
second test enabled the weeding out of those few students which may have been
either missed or falsely accused. A more detailed examination of these results
shows that all of the students with plagiarism results of greater than 12% were
detected.
The small variations in numbers
and links found are actually as a result of small changes in Turnitin’s
detection itself, as has previously been seen for student 25 in the 50% sample.
A closer study of this phenomenon has revealed that results from Turnitin can
be clearly seen to be occasionally inconsistent. Following the discovery of the
results for student 25, the student’s document was submitted again, separately
as both a 50% and a 100% sample, exactly as before. Once again, the results
were shown to be impossible, with a value of 16% plagiarised found in the 50%
sample and 91% for the 100% sample. An enquiry submitted to Turnitin on the
nature of this discrepancy, which we originally assumed to be due to a slow
increase in the number of links available, was attributed to glitches in the
webcrawler. The webcrawler in question is now being upgraded.
Conclusions and Future Work
From the experiments
performed here, it can be concluded that the use of this batch submission
system would indeed be worthwhile, saving a great deal of time for educators
not wishing to rely on student submission. For comparison, for the class of 70
submitted here it is estimated that the educator would have taken approximately
2 minutes to prepare the work, 1 minute to submit, and 1-2 minutes later to process
the data into a readable form, once a reasonable degree of familiarity had been
achieved with the system. For an equivalent submission to be performed by hand
it can be assumed that it would take the educator at least 1 minute per student
to select the required text from the document and submitted it with an
appropriate title, amounting to a minimum of 70 minutes and 490 mouse clicks
for this class, compared to 5 minutes and 20 mouse clicks with batch
submission. The results obtained are also listed in order of percentage
plagiarised, starting with the highest and grouped exclusively by class, making
it easy for the educator to make decisions on the relative levels of plagiarism
and determine any work which required further investigation or action.
Testing accuracy with, for
example, a 30% sample maintains a good deal of accuracy. For the purposes of
more rigorous trials, the second round of tests could also be optionally
applied, assuming a higher sample size with fewer students. The automation of
this process has yet to be implemented, but will be included in the next
version of the system, adding only a small amount of extra effort to the
educator, who will only be required to set a testing threshold and wait for the
time taken for a second Turnitin test in return for increased reliability in
the testing for the few students who have plagiarised work prior to any
accusations.
Once the Turnitin batch
submission system is in place, the next step of the project will be to put this
into use alongside CopyCatch [Woolls 02] for the purposes of detecting the
collusion which was overlooked as a result of the batch submission, and ensure
that the steps to be taken in order to batch submit the documents do not
outweigh the benefits of submitting the documents separately. With prolonged
use, this should become clear.
Turnitin is now available
for free (to
References
[Turnitin 02] http://www.turnitin.com
[Woolls 02] How CopyCatch
Works, David Woolls. http://www.copycatch.freeserve.co.uk