In 2014 I started re-writing Plagiarism Guard in Python 3 using the Django framework (after starting off writing it in PHP). Plagiarism Guard was a plagiarism detection service. I mainly did this to teach myself Python (kudos to Learn Python the Hard Way) and also implement some (very!) basic NLP.
I recently came across the source code and since this was always more of a learning experience for me, I decided to write-up broadly how this works and push the whole source code to GitHub in-case anyone finds any use in it.
The basic premise for the plagiarism detection was to accept resources in a few different formats (URL being the most popular, but also text files and Office-type documents). The resources were then periodically scanned, and a few fairly unique phrases were pulled out of them. These phrases were then searched online (using Bing’s search engine, since it was significantly cheaper than Google’s search API!), and Bing’s results would be the ‘plagiarism-detected’ candidates. Finally these candidates were scanned to rule out any false positives, and discover an approximate duplication score.
The files for this project are organised into three main folders:
- /plag/ – this is the bulk of the Django application, hence it contains the models, forms, routes, services etc.
- The /plag/templates/ folder contains the HTML pages covering both the public (unauthenticated) website pages such as the order form and legal documents (under static/), and the account (authenticated) pages (under dynamic/)
- /plag/templatetags/custom_tags.py contains the Django custom template tags used in various parts of the HTML frontend
- /plag/management/commands contains the custom management commands:
- scan_resources.py chooses ProtectedResource entries which are due to be scanned, and then calls the relevant ‘utility’ methods in /util/getqueriespertype/ to get a few (hopefully distinct) queries from the document/resource. Bing’s search engine API is then called in /util/handlequeries.py to get any potential plagiarism matches for each query. These results are saved back to the DB.
- post_processing.py then looks at each potential plagiarism match URL, loads up the URL and parses the text content to see whether this is a false positive or not. If it’s a real match, a duplication percentage score is calculcated. This then appears on the user’s account.
- recent_blog_posts.py this parses a blog’s RSS feed and saves the latest results to the database, so that the blog results can be shown in a cached/efficient way.
- /PlagiarismGuard/ – these are the standard Django files used to configure and power the application.
- /util/ – as covered a little above, these are a set of ‘utilities’ which perform the bulk of the plagiarism detection work.
Getting the text out of docx, pptx and txt files was fairly easy as they’re relatively open formats. Webpages weren’t too bad either, since the ‘real’ content tends to be grouped together – hence the url.py approach of looking at the text around it to also choose the best candidate snippets.
PDF and .doc were trickier, but the
antiword utilities proved very useful:
pdftotextcan be installed by
yum install poppler-utilson Unix systems, whilst the .exe called
xpdfbin-win-3.04.zipcan be found online and installed for Windows.
antiwordcan be installed from winfield for Unix based systems, however the Windows source for it now returns a 404.
The Github repo now contains the requirements.txt file for the project, naturally with the 5+ year old dependencies which ran on Python 3.4.
This was a fairly fun project overall, and I liked the automatic Django admin panel and also the forms.py idea of generating forms which could be tied directly to models. I’m sure some of the code here isn’t perfect, especially the layout of some of the Python, the lack of tests.py and also the order form jQuery in order.js. But I hope there’s something interesting in this project for you.
Finally, I had previously only worked with PHP (on the scripting language front) but I found Python a significantly better language, in terms of syntax, consistency and syntactic sugar.
A user’s account dashboard, showing general scan results in graphs (powered by Chart.js):
A set of plagiarism results from scanning a particular PDF: