Moving Beyond Keyword Matching using AI
Objective: Build & compare a traditional Lexical Checker with an advanced AI-powered Semantic Checker
from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') similarity = cosine_similarity(emb_a, emb_b) # Result: 73.45% semantic match ✅
Lexical vs. Semantic Plagiarism
Direct copy-pasting of text. Traditional systems catch this easily via exact string matching.
Paraphrasing, synonym replacement, restructured sentences — same meaning, different words.
"The quick brown fox jumps over the lazy dog."
"A fast dark canine leaps above a sleepy hound."
The AI backbone of our system
NLP is the branch of AI that gives computers the ability to understand text and spoken words in much the same way human beings can.
Cleaning text — removing punctuation, lowercasing, stop-word removal, tokenization
"Hello, World!" → "hello world"Converting words into mathematical numbers — dense or sparse vector representations
"hello" → [0.23, -0.81, 0.44…]Calculating relationships between vectors — similarity scores, clustering, classification
sim(A, B) → 0.7345TF-IDF — Term Frequency · Inverse Document Frequency
Counts how often a word appears in a document (Term Frequency)
Weighs it against how rare the word is across all documents (Inverse Document Frequency)
TF-IDF(t,d) = TF(t,d) × log(N / df(t))
Sentence-BERT — Deep Semantic Understanding
Sentence-Bidirectional Encoder Representations from Transformers
Pre-trained on millions of documents — understands language at a human level
Maps entire sentences into a dense, multi-dimensional vector space where meaning = proximity
Passes Doc A and Doc B through identical networks simultaneously to compare outputs
Cosine Similarity — Measuring Vector Relationships
A · BDot product of vectors‖A‖Magnitude of vector A‖B‖Magnitude of vector BWhy Cosine? It measures the angle between vectors, not the magnitude. A 5-page essay and a 1-page summary can still match based on direction/meaning.
End-to-end pipeline — from raw text to plagiarism report
Python Source Code — PlagiarismChecker Class
import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity from sentence_transformers import SentenceTransformer class PlagiarismChecker: def __init__(self): # ✨ AI Model (Semantic) self.sbert_model = SentenceTransformer('all-MiniLM-L6-v2') # Baseline Model (Lexical) self.tfidf_vectorizer = TfidfVectorizer(stop_words='english') def check_tfidf_baseline(self, source_docs, suspicious_doc): all_docs = [suspicious_doc] + source_docs tfidf_matrix = self.tfidf_vectorizer.fit_transform(all_docs) return cosine_similarity( tfidf_matrix[0:1], tfidf_matrix[1:] ).flatten() def check_semantic_ai(self, source_docs, suspicious_doc): suspicious_embedding = self.sbert_model.encode([suspicious_doc]) source_embeddings = self.sbert_model.encode(source_docs) return cosine_similarity( suspicious_embedding, source_embeddings ).flatten()
Loads the pre-trained all-MiniLM-L6-v2 model — 80MB, 384-dim embeddings, optimized for semantic similarity
Sklearn's TF-IDF implementation with English stop-word removal for the baseline comparison
Applied to both models — returns a float in [0, 1] representing semantic or lexical overlap
Testing a heavily paraphrased sentence against the original
No overlapping keywords detected. Completely blind to paraphrasing.
Semantic embeddings capture meaning — synonyms and paraphrasing detected.
Verdict: SBERT successfully identifies plagiarism that TF-IDF completely misses — a 73.45% semantic similarity score triggers the plagiarism alert.
What we built, what comes next
Traditional keyword matching is obsolete for modern academic integrity. Deep learning models like SBERT provide robust, context-aware semantic analysis that catches what rule-based systems miss entirely.
Detect if a student translates an article from Kannada or Hindi to English using multilingual SBERT models
Add an adversarial network to detect text generated by LLMs like ChatGPT — the next frontier of academic dishonesty
Deploy as a web API that institutions can integrate directly into their submission portals