Academic Positions

  • Present Dec-2013

    Assistant Professor

    Computer Sciences Department, COMSATS University Islamabad, Lahore Campus, Pakistan.

  • Dec-2013 Feb-2011

    Lecturer

    Computer Sciences Department, COMSATS Institute of Information Technology, Lahore, Pakistan.

  • Mar-2011 Feb-2009

    Lecturer

    Computer Sciences Department, Univerity of Lahore, Lahore, Pakistan.

  • Feb-2007 Oct-2006

    Visiting Lecturer and Web Developer

    Benchmark Education Academy, Peshawar, Pakistan.

  • Aug-2005 Mar-2005

    Information Technology Professional

    Peshawar1.com, Peshawar, Pakistan

Education

  • Ph.D. 2020

    Doctor of Philosophy in Computer Science

    Lancaster University, Lancaster, United Kingdom.

  • M.Sc2009

    Master of Science in Computer Science

    Swinburne University of Technology, Melbourne, Australia.

  • B.C.S2006

    Bachelor of Science in Computer Science

    University of Peshawar, Pakistan.

  • H.S.S.C2001

    F.Sc (Pre-Engineering)

    P.E.F Model Degree College for Boys, Peshawar, Pakistan.

  • S.S.C1999

    Matric (Science)

    F.G. Boys High School for Boys, Peshawar, Pakistan.

Honors, Awards, and Grants

  • TA 2016
    SCC130 - Information Systems at Lancaster University
  • HEC-FDP 2015
    Split-site PhD Scholarship Award
  • RPA 2013/17
    CIIT Research Productivity Award
  • ACS 2009
    Australian Computer Society Member
  • CCNP 2007
    Cisco Certified Network Professional [BSCI, BCMSN]
  • CCNA 2006
    Cisco Certified Network Associate
  • DataSciMI 2026
    Best Session Chair Award - IEEE DataSciMI
  • AbjadNLP 2026
    Organizer and Review Committee Member
  • AbjadNLP 2025
    Organizer and Review Committee Member
  • MAPonSMS 2018
    Organizer Multilingual Author Profiling Task

Great Personnel

Dr. Paul Clough

Research Mentor

+ Follow

Dr. Mark Stevenson

Research Mentor

+ Follow

Dr. Rao Muhammad Adeel Nawab

PhD Supervisor

+ Follow

Dr. Paul Rayson

PhD Supervisor

+ Follow

Dr. Alberto Barrón-Cedeño

Research Mentor

+ Follow

Jawad Shafi Mian

Postdoctoral fellow

+ Follow

Touseef Tahir

Postdoctoral fellow

+ Follow

The list on the left displays my supervisors, research mentors in the field of NLP and few colleages that inspires and motivates me to work hard everyday.

Research Projects

  • image

    COUNTER

    Corpus Of Urdu News TExt Reuse

    COUNTER - Corpus Of Urdu News TExt Reuse is a Urdu text reuse corpus developed at CIIT Lahore in partnership with Lancaster University. The corpus is released with an intention that it will foster the research in mono-lingual text reuse detection systems specifically for Urdu language. The corpus has 600 source and 600 derived (suspicious) documents. It contains in total 275,387 words (tokens), 21,426 unique words and 10,841 sentences. It has been manually annotated at document level with three levels of reuse: wholly derived (135), partially derived (288) and non derived (177).

    Click here for details

  • image

    TRUE

    Text Reuse Urdu English

    TRUE - Text Reuse English Urdu is a research project between NLPT at CIIT Lahore, Pakistan and UCREL at Lancaster University, Lancaster, UK. It aims to develop cross script cross language corpora and methods to detect text reuse at both document and sentence level. An initial corpus is under development that contains 2,500 source derived document pairs. The source and derived documents are from the field of journalism and contain real example of text reuse.

    Click here for details

  • image

    UPlag

    Urdu Plagiarism

    UPlag is a project that aims to contribute benchmark Urdu Plagiarism corpus with simulated as well as artificial examples of plagiarism. Moreover, the project has a secondary focus on developing (or modifying) state-of-the-art techniques for Urdu plagiarism detection system.

  • image

    UPPC

    Urdu Paraphrase Plagiarism Corpus

    UPPC is a corpus that contains 160 documents (20 source documents and 140 suspicious ones). The source documents are original Wikipedia articles on 20 personalities while the set of suspicious documents are either manually paraphrased versions produced by applying different rewriting techniques or set of independently written (non-plagiarised) documents. The resource is the first of its kind developed for the Urdu language and we believe that it will be a valuable contribution to the evaluation of paraphrase plagiarism detection systems. The corpus can be used for: (1) the development, analysis and evaluation of automated paraphrase plagiarism detection systems for Urdu language, (2) identifying which types of obfuscations (paraphrase strategies) are easy or difficult to detect and (3) would be a valuable resource for Urdu paraphrase identification task.

Filter by type:

Sort by year:

Urdu Sentential Paraphrased Plagiarism Detection Using Large Language Models

Hafiz Rizwan Iqbal, Muhammad Sharjeel, Jawad Shafi, Usama Mehmood, Agha Ali Raza
Journal PaperACM Transactions on Asian and Low-Resource Language Information Processing, IF: 2.0

Abstract

Plagiarism, the unauthorized reuse of text, fueled by the ease of access to online content, is a pressing concern for academia, publishers, and authors. Paraphrasing, a common tactic in textual plagiarism, compounds the problem further. The automatic detection of paraphrased plagiarism in text documents is a fundamental task in Natural Language Processing (NLP), crucial for maintaining academic integrity and authenticity. This article presents an extensive investigation into Urdu sentential paraphrased plagiarism detection leveraging advanced Deep Neural Networks (DNNs) and Large Language Models (LLMs). The study builds upon the foundational work and proposes modifications to the Deep Text Reuse and Paraphrased Plagiarism Detection (D-TRaPPD) architecture to incorporate state-of-the-art pre-trained LLMs. The proposed approach, SELLM-D-TRaPPD, integrates various language models, including contextualized sentence embedding-based LLMs, language-agnostic and multilingual transformer-based LLMs, and multilingual knowledge-distilled transformer-based LLMs. We evaluated these models against three benchmark Urdu sentential paraphrase corpora—Urdu Sentential Paraphrase Corpus, Urdu Short Text Reuse Corpus, and Semi-automatic Urdu Sentential Paraphrase Corpus. The results demonstrate the effectiveness of SELLM-D-TRaPPD with LLMs, achieving F1 scores of 92.09%, 96.70%, and 98.23%, respectively. A comparative analysis with existing state-of-the-art methods shows significant performance improvements, establishing SELLM-D-TRaPPD as the new leading approach for Urdu sentential paraphrased plagiarism detection. These findings highlight the value of leveraging advanced neural network architectures and pre-trained LLMs in improving the accuracy and effectiveness of paraphrased plagiarism detection in Urdu, addressing a crucial gap in Urdu NLP research.

Urdu Paraphrased Text Reuse and Plagiarism Detection using Pre-trained Large Language Models and Deep Hybrid Neural Networks

Hafiz Rizwan Iqbal, Muhammad Sharjeel, Jawad Shafi, Usama Mehmood, Saeed Ul Hassan, Agha Ali Raza
Journal PaperMultimedia Tools and Applications, IF: 3.0

Abstract

The growing prevalence of text reuse and plagiarism in various fields has led to an urgent need for reliable computational methods for detection. However, current commercial plagiarism detection systems are ineffective in identifying paraphrased cases of text reuse, highlighting the need for improvement. Previous research on paraphrased text reuse and plagiarism detection has mainly focused on English, European, Persian, and Arabic languages, and very few studies have been reported on the under-resourced Urdu language. This study aims to overcome this research gap by using a Deep Neural Network (DNN) based architecture and pre-trained Large Language Models (LLMs) for the task of Urdu paraphrased text reuse and plagiarism detection. The architecture called Deep Text Reuse and Paraphrased Plagiarism Detection (D-TRaPPD), relies on LLMs for input and utilizes CNN and LSTM to extract essential textual features. Moreover, we have proposed and evaluated two D-TRaPPD variants, Word Embeddings-D-TRaPPD (WE-D-TRaPPD) and Sentence Embeddings-D-TRaPPD (SE-D-TRaPPD), using two gold standard document-level corpora containing both real and simulated cases of Urdu paraphrased text reuse and plagiarism. The results demonstrate the effectiveness of the D-TRaPPD architecture, with SE-D-TRaPPD achieving the highest F1 scores of 91.77 for real cases and 95.15 for simulated cases. Furthermore, the results highlight the superiority of our approaches over the state-of-the-art methods for Urdu paraphrased text reuse and plagiarism detection.

Automated NLP‐Based Classification of Nonfunctional Requirements in Blockchain and Cross‐Domain Software Systems Using BERT and Machine Learning

Touseef Tahir, Bilal Hassan, Hamid Jahankhani, Nimra Zia, Muhammad Sharjeel
Journal PaperIET Software, IF: 1.3

Abstract

Automated nonfunctional requirements (NFRs) classification enhances consistency and traceability by systematically labeling requirements, saving effort, supporting early architectural and testing decisions, improving stakeholder communication, and enabling quality across diverse software domains. While prior work has applied natural language processing (NLP) and machine learning (ML) to NFR classification, existing datasets are often limited in size, domain diversity, and contextual richness. This study presents a novel dataset comprising over 2400 NFRs spanning 269 software projects across 26 software application domains, including nine blockchain projects. The raw requirements are standardized using Rupp’s boilerplate to reduce vagueness and ambiguity, and the classification of NFRs types follows ISO/IEC 25,010 definitions. We employ a range of traditional ML, deep learning (DL), and a transformer-based model (i.e., BERT-base) for automated classification of NFRs, evaluating performance across cross-domain and blockchain-specific NFRs. Results highlight that domain-aware adaptation significantly enhances classification accuracy, with traditional ML and DL models showing strong performance on blockchain requirements. This work contributes a publicly available, context-rich dataset and provides empirical insights into the effectiveness of NLP-based NFR classification in both general and blockchain-specific settings.

Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair

Muhammad Sharjeel, Iqra Muneer, Sumaira Nosheen, Rao Muhammad Adeel Nawab, Paul Rayson
Journal PaperACM Transactions on Asian and Low-Resource Language Information Processing, IF: 2.0

Abstract

In recent years, the problem of Cross-Lingual Text Reuse Detection (CLTRD) has gained the interest of the research community due to the availability of large digital repositories and automatic Machine Translation (MT) systems. These systems are readily available and openly accessible, which makes it easier to reuse text across languages but hard to detect. In previous studies, different corpora and methods have been developed for CLTRD at the sentence/passage level for the English-Urdu language pair. However, there is a lack of large standard corpora and methods for CLTRD for the English-Urdu language pair at the document level. To overcome this limitation, the significant contribution of this study is the development of a large benchmark cross-lingual (English-Urdu) text reuse corpus, called the TREU (Text Reuse for English-Urdu) corpus. It contains English to Urdu real cases of text reuse at the document level. The corpus is manually labelled into three categories (Wholly Derived = 672, Partially Derived = 888, and Non Derived = 697) with the source text in English and the derived text in the Urdu language. Another contribution of this study is the evaluation of the TREU corpus using a diversified range of methods to show its usefulness and how it can be utilized in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse at the document level. The best evaluation results, for both binary (F1 = 0.78) and ternary (F1 = 0.66) classification tasks, are obtained using a combination of all Translation plus Mono-lingual Analysis (T+MA) based methods. The TREU corpus is publicly available to promote CLTRD research in an under-resourced language, i.e., Urdu.

Urdu Short Paraphrase Detection at Sentence Level

Hamza Hafeez, Iqra Muneer, Muhammad Sharjeel, Muhammad Adnan Ashraf, Rao Muhammad Adeel Nawab
Journal PaperACM Transactions on Asian and Low-Resource Language Information Processing, IF: 2.0

Abstract

Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. There have been very few efforts for paraphrase detection in South Asian languages. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. It is mainly due to the unavailability of the corpora that focus on the sentence level. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. N-gram is treated as the baseline technique for our research. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. The best result we obtained using the feature fusion technique (F1 = 0.855). Our corpus is available and free to download for research purposes.

CLEU - A Cross-Language English-Urdu Corpus and Benchmark for Text Reuse Experiments

Iqra Muneer, Muhammad Sharjeel, Muntaha Iqbal, Rao Muhammad Adeel Nawab, Paul Rayson
Journal PaperJournal of the Association for Information Science and Technology, IF: 3.244

Abstract

Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi-lingual content on the Web has increased cross-language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross-language sentence/passage level text reuse corpus for the English-Urdu language pair. The Cross-Language English-Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairsmanually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono-lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f1=0.732 binary, f1=0.552 ternary classification) indicate that it is harder to detect cross-language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross-language text reuse detection systems for the English-Urdu language pair.

Measuring Short Text Reuse For The Urdu Language

Sara Sameen, Muhammad Sharjeel, Rao Muhammad Adeel Nawab, Paul Rayson, Iqra Muneer
Journal PaperIEEE Access IF: 3.244

Abstract

Text reuse occurs when one borrows the text (either verbatim or paraphrased) from an earlier written text. A large and increasing amount of digital text is easily and readily available, making it simpler to reuse but difficult to detect. As a result, automatic detection of text reuse has attracted the attention of the research community due to the wide variety of applications associated with it. To develop and evaluate automatic methods for text reuse detection, standard evaluation resources are required. In this work, we propose one such resource for a significantly under-resourced language - Urdu, which is widely used in day to day communication and has a large digital footprint particularly in the Indian subcontinent. Our proposed Urdu Short Text Reuse Corpus contains 2,684 short Urdu text pairs, manually labelled as verbatim (496), paraphrased (1,329), and independently written (859). In addition, we describe an evaluation of the corpus using various state-of-the-art text reuse detection methods with binary and multi-classification settings and a set of four classifiers. Output results show that Character n-gram Overlap using J48 classifier outperform other methods for the Urdu short text reuse detection task.

COUNTER - COrpus of Urdu News TExt Reuse

Muhammad Sharjeel, Rao Muhammad Adeel Nawab, Paul Rayson
Journal PaperLanguage Resources and Evaluation (LREV) IF: 0.738

Abstract

Text reuse is the process of creating new texts using existing ones. Freely available and easily accessible large on-line repositories are not only making reuse of text more common in society but also harder to detect programmatically. A major hindrance in the development and evaluation of existing mono-lingual text reuse detection methods, especially for South Asian languages, is the unavailability of standardized benchmark corpora. Amongst other things, a gold standard corpus enables researchers to directly compare with existing state-of-the-art methods. In our study, we address this gap by developing a benchmark corpus for one of the widely spoken but under resourced languages i.e. Urdu. The COUNTER corpus contains 1,200 documents with real examples of text reuse from the field of journalism. It has been manually annotated at document level with three levels of reuse: wholly derived, partially derived and non derived. In this paper, we also apply two simple similarity estimation methods (n-gram overlap and longest common subsequence) on our corpus to show how it can be used in the evaluation of text reuse detection systems. The corpus is a vital resource for the development and evaluation of text reuse detection systems in general and specifically for Urdu language.

LLM-Agent Powered Automated Literature Review and Hypothesis Generation for Cancer Biomarker Discovery

Haseeb Younis, Hasnain Younis, Muhammad Sharjeel, Muhammad Azeem, Rosane Minghim
Conference PapersIEEE 19th International Conference on Open Source Systems and Technologies (ICOSST) 2025

Abstract

The rapid expansion of biomedical publications poses a significant challenge for researchers attempting to synthesise up-to-date knowledge and generate testable hypotheses. Especially, cancer biomarker discovery needs continuous monitoring of thousands of articles across multiple databases. Traditional manual curation is time-consuming and prone to subjectivity. In this paper, we present a Retrieval-Augmented Generation (RAG) agent that automates the collection, indexing, and analysis of biomedical literature to propose structured hypotheses for biomarkers. The system integrates EuropePMC literature search, UniProt protein knowledge, and ClinicalTrials.gov data with a large language model (LLM) hosted locally through Ollama. Retrieved documents were encoded using biomedical sentencetransformers, indexed using FAISS, and queried with cosine similarity to return the most relevant abstracts. The agent then synthesises responses in JSON format containing the biomarker symbol, cancer type, rationale, key evidence, and PubMed identifiers. We evaluated the pipeline on cancer biomarker topics, showing effective retrieval quality, grounded hypothesis generation, and citation accuracy. Results demonstrated that domainspecific embeddings outperform general-purpose encoders, and retrieval depth (k) balances evidence coverage with LLM focus. The proposed methodology significantly reduces the burden of manual literature review and provides structured, machinereadable hypotheses that can be used by researchers. This study illustrates the potential of deploying RAG-driven agents as research assistants in oncology and related fields, as well as offering a pathway toward expandable, reliable, and up-to-date biomarker discovery.

MAPonSMS-Overview of the Multilingual SMS-based Author Profiling Task at FIRE'18.

Muhammad Sharjeel, Mehwish Fatima, Saba Anwar, Rao Muhammad Adeel Nawab
Conference Papers10th Annual Meeting of the Forum for Information Retrieval Evaluation 2018

Abstract

This paper presents the overview of 1st International shared task of Multilingual Author Profiling on SMS (MAPonSMS) at Forum for Information Retrieval Evaluation (FIRE’18). The aim of the MAPon-SMS task is to identify the author's gender and age for a given multilingual (Roman Urdu and English) SMS messages profile, where each profile consists of an aggregation of SMS messages from a single author. This paper provides the details of the dataset and its distribution, overview of the submitted approaches and the evaluation framework used for measuring the performance of the submitted multilingual author profiling systems.

Multilingual Author Profiling on SMS Track at FIRE'18

Muhammad Sharjeel, Mehwish Fatima, Saba Anwar, Rao Muhammad Adeel Nawab
Conference Papers10th Annual Meeting of the Forum for Information Retrieval Evaluation 2018

Abstract

UPPC - Urdu Paraphrase Plagiarism Corpus

Muhammad Sharjeel, Paul Rayson, Rao Muhammad Adeel Nawab
Conference PapersLanguage Resource and Evaluation Conference (LREC) 2016

Abstract

Paraphrase plagiarism is a significant and widespread problem and research shows that it is hard to detect. Several methods and automatic systems have been proposed to deal with it. However, evaluation and comparison of such solutions is not possible because of the unavailability of benchmark corpora with manual examples of paraphrase plagiarism. To deal with this issue, we present the novel development of a paraphrase plagiarism corpus containing simulated (manually created) examples in the Urdu language - a language widely spoken around the world. This resource is the first of its kind developed for the Urdu language and we believe that it will be a valuable contribution to the evaluation of paraphrase plagiarism detection systems.

Currrent Teaching

  • Spring 2026

    CSC454 - Pattern Recognition

    CSC103 - Programming Fundamentals

Teaching History

COMSATS University Islamabad, Lahore Campus

  • Fall 2025

    CSC354 - Machine Learning

    CSC668/PCS716 - Special Topics in Machine Learning

    MAI727 - Quantum Machine Learning

  • Spring 2025

    CSC432 - Information Security

  • Fall 2024

    CSC668/PCS716 - Advanced Machine Learning

    CSC103 - Programming Fundamentals

  • Spring 2024

    CSC103 - Programming Fundamentals

    CSC461 - Introduction to Data Science

    CSC354 - Machine Learning

  • Fall 2023

    CSC461 - Introduction to Data Science

    CSC103 - Programming Fundamentals

  • Spring 2023

    CSC103 - Programming Fundamentals

  • Fall 2022

    CSC461 - Introduction to Data Science

    CSC683 - Advanced Algorithm Analysis

  • Spring 2022

    CSC103 - Programming Fundamentals

    CSC354 - Machine Learning

  • Fall 2021

    CSC103 - Programming Fundamentals

    CSC683 - Advanced Algorithm Analysis

  • Spring 2021

    CSC101 - Introduction to Information and Communication Technologies

    CSC461 - Introduction to Data Science

  • Fall 2020

    CSC101 - Introduction to Computing

    SED348 - Data Security and Encryption

  • Spring 2020

    CSC101 - Introduction to Information and Communication Technologies

  • Fall 2019

    CSC101 - Introduction to Information and Communication Technologies

  • Spring 2019

    CSC101 - Introduction to Information and Communication Technologies

  • Fall 2018

    CSC101 - Introduction to Computing

  • Spring 2018

    CSC101 - Introduction to Information and Communication Technologies

  • Fall 2017

    CSC101 - Introduction to Computing

  • Spring 2017

    CSC111 - Algorithms

  • Fall 2015

    CSC101 - Introduction to Computing

  • Fall 2014

    CSC101 - Introduction to Computing

    CSC332 - Network Security

  • Spring 2014

    CSC101 - Introduction to Computing

    CSC332 - Network Security

  • Fall 2013

    CSC101 - Introduction to Computing

    CSC344 - Wireless and Mobile Computing

  • Spring 2013

    CSC101 - Introduction to Computing

    CSC344 - Wireless and Mobile Computing

  • Fall 2012

    CSC401 - Computing for Management

    CSC344 - Wireless and Mobile Computing

  • Spring 2012

    CSC401 - Computing for Management

    CSC141 - Introduction to Computer Programming

  • Fall 2011

    CSC401 - Computing for Management

    CSC101 - Introduction to Computing

  • Spring 2011

    CSC101 - Introduction to Computing

    CSC112 - Algorithms and Data Structures


  • University of Lahore

  • Fall 2010

    CSC1012 - Programming Fundamentals

    CSC3535 - Computer Networks

    ECE3323 - Data Communications

    CS522 - Network Security and Cryptography

  • Winter 2010

    CSC1011 - Introduction to Computing

    CSC3535 - Computer Networks

    ECE3323 - Data Communications

    CS521 - Advanced Computer Networks

At Office (Lahore, Pakistan)

You can find me at my office located at H-Block, cabin # 16.

I am at my office (apart from my scheduled lecture slots) working days from 8:30 am until 6:30 pm, but you may consider a call or drop an email (preferred) to fix an appointment.

At Lab (Lancaster, United Kingdom)

You can find me at InfoLab21, Room # C30.

I am there weekdays from 9:00 am until 8:00 pm.