Batch Plagiarism Detection with Turnitin

 

Draft Report, 9/8/02

Note: not for distribution without prior author consent

 

Angie Chandler and Lynne Blair

Computing Department, Lancaster University, Lancaster, LA1 4YR

 

Email contact: lb@comp.lancs.ac.uk

 

Abstract

Turnitin.com is one of the most widely recognised plagiarism detection sites on the market, now used by approximately “5 million students and educators” [Turnitin 02]. The effectiveness of this service is not in question however, the subject of this paper is the interface through which all documents must be submitted. For educators who wish to submit large quantities of student work without relying on the students themselves to submit, the use of Turnitin would quickly become so time consuming as to be completely ineffectual, unless it was limited to either random samples of student text or already suspicious documents. The subject of this paper is the submission and interpretation of multiple student texts as a single entity in order to encourage thorough plagiarism testing.

 

Introduction

Turnitin.com is a widely renowned plagiarism detector for free text that will not only detect plagiarism from the web but also collusion between students. This can lead indirectly to the detection of not only students copying but also the plagiarism of a separate external source such as a text book. The system provides a simple cut and paste interface where the user must paste their document into a text area and then submit it for testing (Figure 1).

 

Figure 1 Turnitin submission

The testing process itself can then take up to 24 hours, but will return with all plagiarised areas of the text clearly highlighted and a comprehensive list of all links from which documents may have been copied (Figure 3). Each document tested will be listed on the user’s personal documents page, with an overall similarity rating and a link to the document itself (Figure 2).

 

 

 

Figure 2 Turnitin list of results

 

Figure 3 Turnitin links and document

The links shown can be both links to web pages or links to documents written by other students, either from within this group or from another set of unrelated submissions. For documents which have been submitted by another user however they must first give explicit permission before their documents can be studied.

 

Batch Submission

After even a very few uses of the Turnitin system it becomes clear that the system is not intended for batch submission of a class of students. Turnitin itself gets round this by giving each student on a course a login name and enabling them to submit their own work directly. This provides its own problems, requiring that the educator must then view the class results and ensure that the work submitted matches the final work submitted for marking. Whilst in the right circumstances this may still provide an effective method of processing for the purposes of the authors, this method was unsatisfactory. Turnitin itself was clearly an effective choice of plagiarism detector, but the submission process was far too time consuming; to the extent that the plagiarism detection service itself may not have been worthwhile. However, with a simple series of pre and post submission processing programs it is possible to submit an entire batch of student documents simultaneously. This will inevitably make collusion checking through Turnitin impossible, this is discussed further in conclusions and future work.

 

The processing for batch submission must be done in a series of simple stages. First of all the documents must all be converted to text format in order to allow the program itself to read them, rather than requiring the user to edit the documents by hand through Word or other word processors. For the purposes of this initial trial, the conversion is only carried out on Word documents, for our samples these comprise the vast majority of the texts submitted, leaving the remaining 2% to either be converted by hand or ignored until the initial processing program has been extended.

 

Following the conversion of the documents to text, the user must then run a second program to concatenate the files into a single text document. This is done by taking anonymised samples of every document and marking the separation point between each student. The size of sample taken can be easily adjusted to range between submitting the entire document and just a tiny sample of the available text. The optimal size of the sample will necessarily depend on the type of submission to be processed. For example final year projects will largely only require plagiarism testing on the background section of the report, whereas a report on a given subject may be plagiarised throughout and could be tested at any point. Anonymisation is done automatically, and necessary for legal purposes.

 

Once the joint file has been generated for submission, the user must then login to Turnitin and submit the entire text file as any normal file would be submitted, ensuring that an appropriate title is also added. This will then be processed over the course of 24 hours by Turnitin. The results will appear exactly as for any other submission, with a link to the document maintained on the user’s documents page.

 

The final stage of the user’s task is to save the results in order to process them back into an easily interpretable format, as they would have been had the documents all been submitted separately. This simply requires the user to save the source file (i.e. the html itself) of the results page to their computer and to run the final processing program to reformat the results. If all of the available results are stored within a single directory, the program will also generate a simple list of all available sets of results, mimicking the content of Turnitin as closely as possible (Figure 4).

 

Figure 4 Batches of results

Each of these links will then lead to an entire set of results, all of which will be listed in order of the percentage of plagiarism detected in that document (Figure 5), with an additional column providing information on the longest plagiarised string, which is of more significance in the documents with lower percentages where a short string length may potentially imply a series of one line quotes rather than an entirely plagiarised paragraph.

 

Figure 5 A set of results

As with any of the Turnitin results, these links are colour coded according to the level of plagiarism, with the red links here being more than 50% plagiarised. The user may then follow individual links to each separate student document and easily interpret the computer’s assessment of the results given. Links outlined in yellow are at least 10% plagiarised, which may or may not require further action, depending on the nature of the plagiarism.

 

Results

The results obtained from early experiments with this batch processing system have been highly favourable. A sample of approximately 70 students was used and checked in these tests, each of which featured a different sample of text from the documents being selected, ranging in both size and position from 10% of the document near the start to 50% at the start or 10% towards the end of the document, with the 100% test acting as an experimental control.

 

The variation appeared to have little effect on the overall outcome (Table 1), with a consistent number of seriously plagiarised documents found in each test, roughly twice the number detected by the lecturer marking the documents originally. Looking at the data as a whole it would appear that the odds of catching a student who had significantly plagiarised a document would be quite high.

Table 1 Proportions of plagiarised documents

 

Percentage text used

Pstn of text 

Plagiarised: over 50%

Plagiarised: 30%-50%

Plagiarised:

10%-30%

Plagiarised:

1%-10%

Plagiarised:

0% - 1%

1

10

10% in

3

1

8

5

57

2

30

10% in

6

0

7

8

53

3

50

10% in

5

1

6

12

50

4

10

80% in

4

1

10

11

48

5

100 (control)

-

4

3

11

13

43

 

Despite the relatively favourable results shown in Table 1, the students as individuals must also be considered. Whilst it may be acceptable to some extent to catch a consistent proportion of the worst offenders, it would be preferable to catch all of the very worst documents. Following the initial successful tests, further tests were done keeping track of the records of the worst of the plagiarised documents seen here (Table 2).

 

Table 2 The results for the worst plagiarised documents

Test

Stud53

Stud9

Stud25

Stud17

Stud39

Stud8

Stud72

Stud4

1

98

49

29

82

54

29

0

0

2

54

86

84

70

66

74

9

0

3

83

88

17

15

65

60

51

0

4

91

98

94

23

0

6

11

54

5

86

83

79

23

39

63

32

11

 

Each of the 8 students shown here at some point appeared in the top (over 50%) boundary of at least one of tests 1-4. It can be seen from test 5 (the control) that there were no further students with over 50% of their document plagiarised. There was also only one student not included here who plagiarised over 30% of their document and ultimately scored higher than several of the students shown here, this student consistently appeared in the 30-50% boundary and would also have been detected. Interestingly, the results shown for student 25 on test 3 (50% of the document) and the control are clearly impossible, retesting of the results gave the same verdict, even with manual verification of the sampling program, so it can only be assumed that there are small variations in the testing of documents within Turnitin itself.

 

Figure 6 shows a graph of the worst documents’ results for the 30% sample, compared to the actual results from the control. A diagonal line would have implied that the 30% sample was completely accurate, and as can be seen there a clear dip in this graph where the 30% result would have shown the documents to have been plagiarised far worse than they had in reality.

 

Figure 6 Comparing results for the worst plagiarised documents

Results for the remaining tests when compared to the control showed similar anomalies, with one or two documents notably varied from their expected value and it is clear that with only one sample taken, it is difficult to guarantee that detection is completely accurate. Most of these students show a single result that is far removed from their actual plagiarism score. Nonetheless, as a guide for the educators, any one of these results will highlight documents for later scrutiny. Taking more than one sample regularly would require the educator to put in more time than should be required of them, and taking a sample of 100%, even for relatively short documents such as these, would require a prohibitively long testing period (in this case 60 hours compared to an average of 3 or 4).

 

To follow on from these tests, a selection of the worst documents was taken based on the results of the 30% sample, but taking students with results as low as the 1% boundary, rather than merely the 50% boundary. These students, making up approximately 1/3 of the total number of students, were then taken and a 100% sample was resubmitted to Turnitin. From this sample a total of 59 links were found to have been used from the web, one greater than the number found in the control sample, which must by definition have contained all of the possible links.

Table 2 Data from the worst students

Number of students

Plagiarised: over 50%

Plagiarised: 30%-50%

Plagiarised:

10%-30%

Plagiarised:

1%-10%

Plagiarised:

0%

26

5

1

7

10

3

74

4

3

11

13

43

As can be seen in Table 2, the students selected made up the entire high end of the spectrum, and this second test enabled the weeding out of those few students which may have been either missed or falsely accused. A more detailed examination of these results shows that all of the students with plagiarism results of greater than 12% were detected.

The small variations in numbers and links found are actually as a result of small changes in Turnitin’s detection itself, as has previously been seen for student 25 in the 50% sample. A closer study of this phenomenon has revealed that results from Turnitin can be clearly seen to be occasionally inconsistent. Following the discovery of the results for student 25, the student’s document was submitted again, separately as both a 50% and a 100% sample, exactly as before. Once again, the results were shown to be impossible, with a value of 16% plagiarised found in the 50% sample and 91% for the 100% sample. An enquiry submitted to Turnitin on the nature of this discrepancy, which we originally assumed to be due to a slow increase in the number of links available, was attributed to glitches in the webcrawler. The webcrawler in question is now being upgraded.

 


Conclusions and Future Work

From the experiments performed here, it can be concluded that the use of this batch submission system would indeed be worthwhile, saving a great deal of time for educators not wishing to rely on student submission. For comparison, for the class of 70 submitted here it is estimated that the educator would have taken approximately 2 minutes to prepare the work, 1 minute to submit, and 1-2 minutes later to process the data into a readable form, once a reasonable degree of familiarity had been achieved with the system. For an equivalent submission to be performed by hand it can be assumed that it would take the educator at least 1 minute per student to select the required text from the document and submitted it with an appropriate title, amounting to a minimum of 70 minutes and 490 mouse clicks for this class, compared to 5 minutes and 20 mouse clicks with batch submission. The results obtained are also listed in order of percentage plagiarised, starting with the highest and grouped exclusively by class, making it easy for the educator to make decisions on the relative levels of plagiarism and determine any work which required further investigation or action.

 

Testing accuracy with, for example, a 30% sample maintains a good deal of accuracy. For the purposes of more rigorous trials, the second round of tests could also be optionally applied, assuming a higher sample size with fewer students. The automation of this process has yet to be implemented, but will be included in the next version of the system, adding only a small amount of extra effort to the educator, who will only be required to set a testing threshold and wait for the time taken for a second Turnitin test in return for increased reliability in the testing for the few students who have plagiarised work prior to any accusations.

 

Once the Turnitin batch submission system is in place, the next step of the project will be to put this into use alongside CopyCatch [Woolls 02] for the purposes of detecting the collusion which was overlooked as a result of the batch submission, and ensure that the steps to be taken in order to batch submit the documents do not outweigh the benefits of submitting the documents separately. With prolonged use, this should become clear.

 

Turnitin is now available for free (to UK universities) from http://www.jisc.ac.uk - arranged through Northumbria University.

 

References

[Turnitin 02] http://www.turnitin.com

[Woolls 02] How CopyCatch Works, David Woolls. http://www.copycatch.freeserve.co.uk