1 / 10
Introduction to AI 6th Semester · ECE

Semantic Plagiarism Detection

Moving Beyond Keyword Matching using AI

Objective: Build & compare a traditional Lexical Checker with an advanced AI-powered Semantic Checker

🔍NLP
🤖SBERT
📐Cosine Similarity
TF-IDF
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
similarity = cosine_similarity(emb_a, emb_b)
# Result: 73.45% semantic match ✅
02

The Problem Statement

Lexical vs. Semantic Plagiarism

📋

Lexical Plagiarism

Direct copy-pasting of text. Traditional systems catch this easily via exact string matching.

Easy to Catch ✅
🧠

Semantic Plagiarism

Paraphrasing, synonym replacement, restructured sentences — same meaning, different words.

Hard to Catch ❌
Live Example
Source

"The quick brown fox jumps over the lazy dog."

Submission

"A fast dark canine leaps above a sleepy hound."

TF-IDF Match 0% ❌ Fails
AI Semantic Match ~75% ✅ Detected
03

Natural Language Processing

The AI backbone of our system

💬

NLP is the branch of AI that gives computers the ability to understand text and spoken words in much the same way human beings can.

The NLP Pipeline
01
🧹

Preprocessing

Cleaning text — removing punctuation, lowercasing, stop-word removal, tokenization

"Hello, World!" → "hello world"
02
🔢

Vectorization

Converting words into mathematical numbers — dense or sparse vector representations

"hello" → [0.23, -0.81, 0.44…]
03
📊

Analysis

Calculating relationships between vectors — similarity scores, clustering, classification

sim(A, B) → 0.7345
04

The Baseline Method

TF-IDF — Term Frequency · Inverse Document Frequency

Algorithm: TF-IDF

How it works

Counts how often a word appears in a document (Term Frequency)

Weighs it against how rare the word is across all documents (Inverse Document Frequency)

TF-IDF(t,d) = TF(t,d) × log(N / df(t))

✅ Pros

  • Very fast execution
  • Computationally cheap
  • Excellent for keyword search
  • No training required

❌ Cons

  • Blind to context & meaning
  • Cannot handle synonyms
  • Word order ignored
  • "Bank" (river) = "Bank" (money)
🏦 "Bank" (money)
=
🌊 "Bank" (river)
TF-IDF treats these as identical — a critical flaw
05

The Advanced AI Method

Sentence-BERT — Deep Semantic Understanding

Algorithm: SBERT

Sentence-Bidirectional Encoder Representations from Transformers

🧬

Deep Neural Network

Pre-trained on millions of documents — understands language at a human level

🗺️

Semantic Embeddings

Maps entire sentences into a dense, multi-dimensional vector space where meaning = proximity

🔀

Siamese Architecture

Passes Doc A and Doc B through identical networks simultaneously to compare outputs

Document A
BERT Encoder
[0.23, -0.81, 0.44…]
Cosine
Similarity
73.45%
Document B
BERT Encoder
[0.19, -0.76, 0.51…]
06

The Mathematical Engine

Cosine Similarity — Measuring Vector Relationships

Core Formula
$$\text{Cosine Similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}$$
A · BDot product of vectors
‖A‖Magnitude of vector A
‖B‖Magnitude of vector B
A B θ Origin
Angle θCosineMeaning
1.0Identical
45°0.71Similar
90°0.0Unrelated
💡

Why Cosine? It measures the angle between vectors, not the magnitude. A 5-page essay and a 1-page summary can still match based on direction/meaning.

07

System Architecture

End-to-end pipeline — from raw text to plagiarism report

📄
Input Source Documents & Student Submission (Raw Text)
✏️
Text Pre-processing Tokenization · Lowercasing · Stop-word Removal
Baseline Model
🔗
TF-IDF Vectorization Statistical Word Frequency
📐
Cosine Similarity Calculation
Advanced AI Model
🤖
Sentence-BERT Encoding Contextual Semantic Embeddings
📐
Cosine Similarity Calculation
⚖️
Comparison Logic Threshold evaluation & scoring
📊
Plagiarism Report Displays both TF-IDF % and SBERT % scores with color-coded alerts
08

Core Implementation

Python Source Code — PlagiarismChecker Class

plagiarism_checker.py
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

class PlagiarismChecker:

    def __init__(self):
        # ✨ AI Model (Semantic)
        self.sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
        # Baseline Model (Lexical)
        self.tfidf_vectorizer = TfidfVectorizer(stop_words='english')

    def check_tfidf_baseline(self, source_docs, suspicious_doc):
        all_docs = [suspicious_doc] + source_docs
        tfidf_matrix = self.tfidf_vectorizer.fit_transform(all_docs)
        return cosine_similarity(
            tfidf_matrix[0:1], tfidf_matrix[1:]
        ).flatten()

    def check_semantic_ai(self, source_docs, suspicious_doc):
        suspicious_embedding = self.sbert_model.encode([suspicious_doc])
        source_embeddings = self.sbert_model.encode(source_docs)
        return cosine_similarity(
            suspicious_embedding, source_embeddings
        ).flatten()
SentenceTransformer

Loads the pre-trained all-MiniLM-L6-v2 model — 80MB, 384-dim embeddings, optimized for semantic similarity

TfidfVectorizer

Sklearn's TF-IDF implementation with English stop-word removal for the baseline comparison

cosine_similarity()

Applied to both models — returns a float in [0, 1] representing semantic or lexical overlap

09

Demonstration & Results

Testing a heavily paraphrased sentence against the original

Original Source
"The quick brown fox jumps over the lazy dog."
VS
Student Submission
"A fast dark colored canine leaps above a sleepy hound."
Baseline (TF-IDF) ❌ FAILS
0%

No overlapping keywords detected. Completely blind to paraphrasing.

AI Semantic (SBERT) ✅ DETECTED
0%

Semantic embeddings capture meaning — synonyms and paraphrasing detected.

🎯

Verdict: SBERT successfully identifies plagiarism that TF-IDF completely misses — a 73.45% semantic similarity score triggers the plagiarism alert.

10

Conclusion & Future Scope

What we built, what comes next

🏁 Conclusion

Traditional keyword matching is obsolete for modern academic integrity. Deep learning models like SBERT provide robust, context-aware semantic analysis that catches what rule-based systems miss entirely.

TF-IDF Lexical Only
SBERT Semantic + Context

🚀 Future Upgrades

01

Cross-Lingual Plagiarism

Detect if a student translates an article from Kannada or Hindi to English using multilingual SBERT models

02

AI-Generated Text Detection

Add an adversarial network to detect text generated by LLMs like ChatGPT — the next frontier of academic dishonesty

03

Real-time Web Integration

Deploy as a web API that institutions can integrate directly into their submission portals

Thank You
Introduction to AI · 6th Semester ECE
Notes