VSM Based Models and Integration of Exact and Fuzzy Similarity For Improving Detection of External Textual Plagiarism


Nasreen J. Kadhim, Mohannad T. Mohammed,




VSM,TF-IDF, TF-ISF, exact similarit, Jaccard similarity, fuzzy similarity,


A rapid growing has occurred for the act of plagiarism with the aid of Internet explosive growth wherein a massive volume of information offered with effortless use and access makes plagiarism − the process of taking someone else’s work (represented by ideas, or even words) and representing it as his own work − easy to be performed. For ensuring originality, detecting plagiarism has been massively necessitated in various areas so that the people who aim to plagiarize ought to offer considerable effort for introducing works centered on their research. In this paper, a work has been proposed for detecting textual plagiarism focused on proposing models for both candidate retrieval and detailed comparison phases. Firstly, for the candidate retrieval, two models have been proposed established on adopting the vector space method VSM as a retrieval model wherein these models base on offering different representations for text documents. The first model centers on representing documents as vectors consisting of average term 𝑡𝑓 − 𝑖𝑠𝑓 weights instead of representing them as vectors of term 𝑡𝑓 − 𝑖𝑑𝑓 weight. Whereas, the second retrieval model assigns for each term constituting the document a weight resulted from a weighted sum equation that sums this term 𝑡𝑓 − 𝑖𝑑𝑓 weight with its average 𝑡𝑓 − 𝑖𝑠𝑓 weights and considers it as a query for retrieval. The detailed comparison task comes as the second phase wherein a method has been proposed that cores at the integration of two diverse similarity measures and the introduction of one similarity measure involving them; Exact similarity and Fuzzy similarity. Experiments have been conducted using PAN-PC-10 as an evaluation dataset for evaluating the proposed system. As the problem statement in this paper is restricted to detect extrinsic plagiarism and works on English documents, experiments have been performed on the portion dedicated for extrinsic detection and on documents in English language only. These documents have been randomly separated into training and testing dataset. The training data has been used for parameters tuning whereas evaluating the performance of the proposed system and comparing it against the existing methods have been performed using testing dataset. For evaluating performance of the models proposed for the candidate retrieval problem, Precision, Recall, and F-measure have been used as an evaluation metrics. The overall performance of the proposed system has been assessed through the use of the five PAN standard measures Precision, Recall, F-measure, Granularity and 𝑃𝑙𝑎𝑔𝑑𝑒𝑡 . The experimental results has clarified that the proposed system either comparable or outperforms the other state-of-the-art methods.


I.A.Abdi, et al., A linguistic treatment for automatic external plagiarism detection. 2017. 135: p. 135-146.
II.A.Sarkar, U. Marjit, and U. Biswas. A conceptual model to develop an advanced plagiarism checking tool based on semantic matching. in 2014 2nd International Conference on Business and Information Management
(ICBIM). 2014. IEEE.
III.A.Abdi, et al., PDLK: Plagiarism detection using linguistic knowledge.2015. 42(22): p. 8936-8946.
IV.B.Gipp, Citation-based plagiarism detection, in Citation-based plagiarism detection. 2014, Springer. p. 57-88.
V.D.E.J.A.C.Appelt, Introduction to information extraction. 1999. 12(3): p.161-172.
VI.G.Oberreuter and J.D.J.E.S.w.A. VeláSquez, Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. 2013. 40(9): p. 3756-3763.
VII.K.Vani and D. Gupta. Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system. in 2015 International Conference on Advances in Computing,
Communications and Informatics (ICACCI). 2015. IEEE.
VIII.L.Prechelt, G. Malpohl, and M.J.J.U. Philippsen, Finding plagiarisms among a set of programs with JPlag. 2002. 8(11): p. 1016-.
IX.M .Alzahrani, S, et al., Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity model. 2015. 27(3): p. 248-268.
X.M.Roig, Avoiding plagiarism, self-plagiarism, and other questionable writing practices: A guide to ethical writing. 2006.
XI.M.Potthast, et al., Cross-language plagiarism detection. 2011. 45(1): p. 45-62.
XII.R.Lukashenko, V. Graudina, and J. Grundspenkis. Computer-based plagiarism detection methods and tools: an overview. in Proceedings of the 2007 international conference on Computer systems and
technologies. 2007. ACM.J. Mech. Cont.& Math. Sci., Vol.-14, No.-3, May-June (2019) pp 555-578
Copyright reserved © J. Mech. Cont.& Math. Sci.Nasreen J. Kadhim et al.578
XIII.S.Wang, et al. Combination of VSM and Jaccard coefficient for external plagiarism detection. in 2013 International Conference on Machine Learning and Cybernetics. 2013. IEEE.
XIV.S.Rao, et al., External & Intrinsic Plagiarism Detection: VSM &Discourse Markers based Approach Notebook for PAN at CLEF 2011.2011.
XV.S. Alzahrani and N. Salim, Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection Lab Report for PAN at CLEF 2010.2010.
XVI.S.M.Alzahrani, et al., Understanding plagiarism linguistic patterns,textual features, and detection methods. 2012. 42(2): p. 133-149.
Nasreen J. Kadhim, Mohannad T. Mohammed View Download