Introduction to AI 6th Semester · ECE

Semantic Plagiarism Detection

Moving Beyond Keyword Matching using AI

Objective: Build & compare a traditional Lexical Checker with an advanced AI-powered Semantic Checker

🔍NLP

🤖SBERT

📐Cosine Similarity

⚡TF-IDF

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
similarity = cosine_similarity(emb_a, emb_b)
# Result: 73.45% semantic match ✅

02

The Problem Statement

Lexical vs. Semantic Plagiarism

📋

Lexical Plagiarism

Direct copy-pasting of text. Traditional systems catch this easily via exact string matching.

Easy to Catch ✅

🧠

Semantic Plagiarism

Paraphrasing, synonym replacement, restructured sentences — same meaning, different words.

Hard to Catch ❌

Live Example

Source

"The quick brown fox jumps over the lazy dog."

→

Submission

"A fast dark canine leaps above a sleepy hound."

TF-IDF Match 0% ❌ Fails

AI Semantic Match ~75% ✅ Detected

03

Natural Language Processing

The AI backbone of our system

💬

NLP is the branch of AI that gives computers the ability to understand text and spoken words in much the same way human beings can.

The NLP Pipeline

01

🧹

Preprocessing

Cleaning text — removing punctuation, lowercasing, stop-word removal, tokenization

"Hello, World!" → "hello world"

02

🔢

Vectorization

Converting words into mathematical numbers — dense or sparse vector representations

"hello" → [0.23, -0.81, 0.44…]

03

📊

Analysis

Calculating relationships between vectors — similarity scores, clustering, classification

sim(A, B) → 0.7345

04

The Baseline Method

TF-IDF — Term Frequency · Inverse Document Frequency

Algorithm: TF-IDF

How it works

Counts how often a word appears in a document (Term Frequency)

Weighs it against how rare the word is across all documents (Inverse Document Frequency)

TF-IDF(t,d) = TF(t,d) × log(N / df(t))

✅ Pros

Very fast execution
Computationally cheap
Excellent for keyword search
No training required

❌ Cons

Blind to context & meaning
Cannot handle synonyms
Word order ignored
"Bank" (river) = "Bank" (money)

🏦 "Bank" (money)

=

🌊 "Bank" (river)

TF-IDF treats these as identical — a critical flaw

05

The Advanced AI Method

Sentence-BERT — Deep Semantic Understanding

Algorithm: SBERT

Sentence-Bidirectional Encoder Representations from Transformers

🧬

Deep Neural Network

Pre-trained on millions of documents — understands language at a human level

🗺️

Semantic Embeddings

Maps entire sentences into a dense, multi-dimensional vector space where meaning = proximity

🔀

Siamese Architecture

Passes Doc A and Doc B through identical networks simultaneously to compare outputs

Document A

BERT Encoder

[0.23, -0.81, 0.44…]

↓

Cosine
Similarity

↓

73.45%

Document B

BERT Encoder

[0.19, -0.76, 0.51…]

06

The Mathematical Engine

Cosine Similarity — Measuring Vector Relationships

Core Formula

$$\text{Cosine Similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}$$

A · BDot product of vectors

‖A‖Magnitude of vector A

‖B‖Magnitude of vector B

Angle θCosineMeaning

0°1.0Identical

45°0.71Similar

90°0.0Unrelated

💡

Why Cosine? It measures the angle between vectors, not the magnitude. A 5-page essay and a 1-page summary can still match based on direction/meaning.

07

System Architecture

End-to-end pipeline — from raw text to plagiarism report

📄

Input Source Documents & Student Submission (Raw Text)

↓

✏️

Text Pre-processing Tokenization · Lowercasing · Stop-word Removal

Baseline Model

🔗

TF-IDF Vectorization Statistical Word Frequency

↓

📐

Cosine Similarity Calculation

Advanced AI Model

🤖

Sentence-BERT Encoding Contextual Semantic Embeddings

↓

📐

Cosine Similarity Calculation

⚖️

Comparison Logic Threshold evaluation & scoring

↓

📊

Plagiarism Report Displays both TF-IDF % and SBERT % scores with color-coded alerts

08

Core Implementation

Python Source Code — PlagiarismChecker Class

plagiarism_checker.py

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

class PlagiarismChecker:

    def __init__(self):
        # ✨ AI Model (Semantic)
        self.sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
        # Baseline Model (Lexical)
        self.tfidf_vectorizer = TfidfVectorizer(stop_words='english')

    def check_tfidf_baseline(self, source_docs, suspicious_doc):
        all_docs = [suspicious_doc] + source_docs
        tfidf_matrix = self.tfidf_vectorizer.fit_transform(all_docs)
        return cosine_similarity(
            tfidf_matrix[0:1], tfidf_matrix[1:]
        ).flatten()

    def check_semantic_ai(self, source_docs, suspicious_doc):
        suspicious_embedding = self.sbert_model.encode([suspicious_doc])
        source_embeddings = self.sbert_model.encode(source_docs)
        return cosine_similarity(
            suspicious_embedding, source_embeddings
        ).flatten()

SentenceTransformer

Loads the pre-trained all-MiniLM-L6-v2 model — 80MB, 384-dim embeddings, optimized for semantic similarity

TfidfVectorizer

Sklearn's TF-IDF implementation with English stop-word removal for the baseline comparison

cosine_similarity()

Applied to both models — returns a float in [0, 1] representing semantic or lexical overlap

09

Demonstration & Results

Testing a heavily paraphrased sentence against the original

Original Source

"The quick brown fox jumps over the lazy dog."

VS

Student Submission

"A fast dark colored canine leaps above a sleepy hound."

Baseline (TF-IDF) ❌ FAILS

0%

No overlapping keywords detected. Completely blind to paraphrasing.

AI Semantic (SBERT) ✅ DETECTED

0%

Semantic embeddings capture meaning — synonyms and paraphrasing detected.

🎯

Verdict: SBERT successfully identifies plagiarism that TF-IDF completely misses — a 73.45% semantic similarity score triggers the plagiarism alert.

10

Conclusion & Future Scope

What we built, what comes next

🏁 Conclusion

Traditional keyword matching is obsolete for modern academic integrity. Deep learning models like SBERT provide robust, context-aware semantic analysis that catches what rule-based systems miss entirely.

TF-IDF Lexical Only

→

SBERT Semantic + Context

🚀 Future Upgrades

01

Cross-Lingual Plagiarism

Detect if a student translates an article from Kannada or Hindi to English using multilingual SBERT models

02

AI-Generated Text Detection

Add an adversarial network to detect text generated by LLMs like ChatGPT — the next frontier of academic dishonesty

03

Real-time Web Integration

Deploy as a web API that institutions can integrate directly into their submission portals

Thank You

Introduction to AI · 6th Semester ECE