Assistant Professor
Computer Sciences Department, COMSATS University Islamabad, Lahore Campus, Pakistan.
I am Assistant Professor at COMSATS University Islamabad, Lahore Campus, Pakistan and an active member at research groups, UCREL at Lancaster University and NLPT at CUI Lahore.
A PhD in Computer Science with over a 18 of teaching and research experience, specializing in machine learning, data-driven systems, and applied artificial intelligence. My research contributions include publications across diverse areas such as healthcare analytics, natural language processing, and multimodal learning.
Teaching Interests: Machine Learning, Data Science, Natural Langauage Processing, Pattern Recognition, Computer Programming, Data Structures and Algorithms, Wireless and Mobile Computing, Network Security, Data Security and Encryption, Information Security.
Research Interests: Data Science, Machine Learning, Computational Linguistics, Natural Language Processing, Mono- and Cross-lingual Text, Reuse and Plagiarism Detection, Urdu Paraphrase Generation, Multilingual Author Profiling, Urdu Text Analysis, Development of Resources for Urdu Language, Software Requirement Engineering, Software Risk Prediction.
Computer Sciences Department, COMSATS University Islamabad, Lahore Campus, Pakistan.
Computer Sciences Department, COMSATS Institute of Information Technology, Lahore, Pakistan.
Computer Sciences Department, Univerity of Lahore, Lahore, Pakistan.
Benchmark Education Academy, Peshawar, Pakistan.
Peshawar1.com, Peshawar, Pakistan
Doctor of Philosophy in Computer Science
Lancaster University, Lancaster, United Kingdom.
Master of Science in Computer Science
Swinburne University of Technology, Melbourne, Australia.
Bachelor of Science in Computer Science
University of Peshawar, Pakistan.
F.Sc (Pre-Engineering)
P.E.F Model Degree College for Boys, Peshawar, Pakistan.
Matric (Science)
F.G. Boys High School for Boys, Peshawar, Pakistan.
In recent years, due to the easy access and vast amount of multi-lingual information readily available on the Web, cross language text reuse and plagiarism cases have increased considerably and became a matter of concern. To remedy this issue, its detection becomes equally important. However, research indicates that current systems fail to detect reuse when the source text has been obfuscated, i.e. paraphrased after translation from another language. Due to the implied complexity involved, research in this field is still in its infancy.
To develop and evaluate state-of-the-art methods for cross language text reuse and plagiarism detection, one obstacle is the shortage of benchmark corpora containing real or simulated examples. Majority of the corpora available are for English language (mono-lingual) or English-European languages (cross-lingual) and less focus has been devoted to developing resources for South Asian languages. The methods proposed in the literature for cross language text reuse and plagiarism detection task are based on the language syntax (CL-CNG), parallel (CL-ASA) or comparable corpora (CL-ESA), some require statistical dictionaries or knowledge bases (CL-CTS) and some others (T+MA) imply language normalisation at the preprocessing step. These methods have proven to produce fair results on syntactically similar languages and on verbatim cases of reuse. However, they have not been evaluated on cross-script cross language plagiarism detection, as most of them require supporting resources which are not ample for these under resourced languages (e.g. Urdu).
Therefore, the aim of my research is to develop large scale mono and cross language text reuse and plagiarism detection corpora (for English-Urdu language pair) and develop and evaluate automatic methods that can detect text reuse across languages by overcoming the limitations of existing methods.
The list on the left displays my supervisors, research mentors in the field of NLP and few colleages that inspires and motivates me to work hard everyday.
COUNTER - Corpus Of Urdu News TExt Reuse is a Urdu text reuse corpus developed at CIIT Lahore in partnership with Lancaster University. The corpus is released with an intention that it will foster the research in mono-lingual text reuse detection systems specifically for Urdu language. The corpus has 600 source and 600 derived (suspicious) documents. It contains in total 275,387 words (tokens), 21,426 unique words and 10,841 sentences. It has been manually annotated at document level with three levels of reuse: wholly derived (135), partially derived (288) and non derived (177).
TRUE - Text Reuse English Urdu is a research project between NLPT at CIIT Lahore, Pakistan and UCREL at Lancaster University, Lancaster, UK. It aims to develop cross script cross language corpora and methods to detect text reuse at both document and sentence level. An initial corpus is under development that contains 2,500 source derived document pairs. The source and derived documents are from the field of journalism and contain real example of text reuse.
UPlag is a project that aims to contribute benchmark Urdu Plagiarism corpus with simulated as well as artificial examples of plagiarism. Moreover, the project has a secondary focus on developing (or modifying) state-of-the-art techniques for Urdu plagiarism detection system.
UPPC is a corpus that contains 160 documents (20 source documents and 140 suspicious ones). The source documents are original Wikipedia articles on 20 personalities while the set of suspicious documents are either manually paraphrased versions produced by applying different rewriting techniques or set of independently written (non-plagiarised) documents. The resource is the first of its kind developed for the Urdu language and we believe that it will be a valuable contribution to the evaluation of paraphrase plagiarism detection systems. The corpus can be used for: (1) the development, analysis and evaluation of automated paraphrase plagiarism detection systems for Urdu language, (2) identifying which types of obfuscations (paraphrase strategies) are easy or difficult to detect and (3) would be a valuable resource for Urdu paraphrase identification task.
Plagiarism, the unauthorized reuse of text, fueled by the ease of access to online content, is a pressing concern for academia, publishers, and authors. Paraphrasing, a common tactic in textual plagiarism, compounds the problem further. The automatic detection of paraphrased plagiarism in text documents is a fundamental task in Natural Language Processing (NLP), crucial for maintaining academic integrity and authenticity. This article presents an extensive investigation into Urdu sentential paraphrased plagiarism detection leveraging advanced Deep Neural Networks (DNNs) and Large Language Models (LLMs). The study builds upon the foundational work and proposes modifications to the Deep Text Reuse and Paraphrased Plagiarism Detection (D-TRaPPD) architecture to incorporate state-of-the-art pre-trained LLMs. The proposed approach, SELLM-D-TRaPPD, integrates various language models, including contextualized sentence embedding-based LLMs, language-agnostic and multilingual transformer-based LLMs, and multilingual knowledge-distilled transformer-based LLMs. We evaluated these models against three benchmark Urdu sentential paraphrase corpora—Urdu Sentential Paraphrase Corpus, Urdu Short Text Reuse Corpus, and Semi-automatic Urdu Sentential Paraphrase Corpus. The results demonstrate the effectiveness of SELLM-D-TRaPPD with LLMs, achieving F1 scores of 92.09%, 96.70%, and 98.23%, respectively. A comparative analysis with existing state-of-the-art methods shows significant performance improvements, establishing SELLM-D-TRaPPD as the new leading approach for Urdu sentential paraphrased plagiarism detection. These findings highlight the value of leveraging advanced neural network architectures and pre-trained LLMs in improving the accuracy and effectiveness of paraphrased plagiarism detection in Urdu, addressing a crucial gap in Urdu NLP research.
The growing prevalence of text reuse and plagiarism in various fields has led to an urgent need for reliable computational methods for detection. However, current commercial plagiarism detection systems are ineffective in identifying paraphrased cases of text reuse, highlighting the need for improvement. Previous research on paraphrased text reuse and plagiarism detection has mainly focused on English, European, Persian, and Arabic languages, and very few studies have been reported on the under-resourced Urdu language. This study aims to overcome this research gap by using a Deep Neural Network (DNN) based architecture and pre-trained Large Language Models (LLMs) for the task of Urdu paraphrased text reuse and plagiarism detection. The architecture called Deep Text Reuse and Paraphrased Plagiarism Detection (D-TRaPPD), relies on LLMs for input and utilizes CNN and LSTM to extract essential textual features. Moreover, we have proposed and evaluated two D-TRaPPD variants, Word Embeddings-D-TRaPPD (WE-D-TRaPPD) and Sentence Embeddings-D-TRaPPD (SE-D-TRaPPD), using two gold standard document-level corpora containing both real and simulated cases of Urdu paraphrased text reuse and plagiarism. The results demonstrate the effectiveness of the D-TRaPPD architecture, with SE-D-TRaPPD achieving the highest F1 scores of 91.77 for real cases and 95.15 for simulated cases. Furthermore, the results highlight the superiority of our approaches over the state-of-the-art methods for Urdu paraphrased text reuse and plagiarism detection.
Automated nonfunctional requirements (NFRs) classification enhances consistency and traceability by systematically labeling requirements, saving effort, supporting early architectural and testing decisions, improving stakeholder communication, and enabling quality across diverse software domains. While prior work has applied natural language processing (NLP) and machine learning (ML) to NFR classification, existing datasets are often limited in size, domain diversity, and contextual richness. This study presents a novel dataset comprising over 2400 NFRs spanning 269 software projects across 26 software application domains, including nine blockchain projects. The raw requirements are standardized using Rupp’s boilerplate to reduce vagueness and ambiguity, and the classification of NFRs types follows ISO/IEC 25,010 definitions. We employ a range of traditional ML, deep learning (DL), and a transformer-based model (i.e., BERT-base) for automated classification of NFRs, evaluating performance across cross-domain and blockchain-specific NFRs. Results highlight that domain-aware adaptation significantly enhances classification accuracy, with traditional ML and DL models showing strong performance on blockchain requirements. This work contributes a publicly available, context-rich dataset and provides empirical insights into the effectiveness of NLP-based NFR classification in both general and blockchain-specific settings.
In recent years, the problem of Cross-Lingual Text Reuse Detection (CLTRD) has gained the interest of the research community due to the availability of large digital repositories and automatic Machine Translation (MT) systems. These systems are readily available and openly accessible, which makes it easier to reuse text across languages but hard to detect. In previous studies, different corpora and methods have been developed for CLTRD at the sentence/passage level for the English-Urdu language pair. However, there is a lack of large standard corpora and methods for CLTRD for the English-Urdu language pair at the document level. To overcome this limitation, the significant contribution of this study is the development of a large benchmark cross-lingual (English-Urdu) text reuse corpus, called the TREU (Text Reuse for English-Urdu) corpus. It contains English to Urdu real cases of text reuse at the document level. The corpus is manually labelled into three categories (Wholly Derived = 672, Partially Derived = 888, and Non Derived = 697) with the source text in English and the derived text in the Urdu language. Another contribution of this study is the evaluation of the TREU corpus using a diversified range of methods to show its usefulness and how it can be utilized in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse at the document level. The best evaluation results, for both binary (F1 = 0.78) and ternary (F1 = 0.66) classification tasks, are obtained using a combination of all Translation plus Mono-lingual Analysis (T+MA) based methods. The TREU corpus is publicly available to promote CLTRD research in an under-resourced language, i.e., Urdu.
Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. There have been very few efforts for paraphrase detection in South Asian languages. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. It is mainly due to the unavailability of the corpora that focus on the sentence level. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. N-gram is treated as the baseline technique for our research. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. The best result we obtained using the feature fusion technique (F1 = 0.855). Our corpus is available and free to download for research purposes.
Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi-lingual content on the Web has increased cross-language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross-language sentence/passage level text reuse corpus for the English-Urdu language pair. The Cross-Language English-Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairsmanually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono-lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f1=0.732 binary, f1=0.552 ternary classification) indicate that it is harder to detect cross-language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross-language text reuse detection systems for the English-Urdu language pair.
Text reuse occurs when one borrows the text (either verbatim or paraphrased) from an earlier written text. A large and increasing amount of digital text is easily and readily available, making it simpler to reuse but difficult to detect. As a result, automatic detection of text reuse has attracted the attention of the research community due to the wide variety of applications associated with it. To develop and evaluate automatic methods for text reuse detection, standard evaluation resources are required. In this work, we propose one such resource for a significantly under-resourced language - Urdu, which is widely used in day to day communication and has a large digital footprint particularly in the Indian subcontinent. Our proposed Urdu Short Text Reuse Corpus contains 2,684 short Urdu text pairs, manually labelled as verbatim (496), paraphrased (1,329), and independently written (859). In addition, we describe an evaluation of the corpus using various state-of-the-art text reuse detection methods with binary and multi-classification settings and a set of four classifiers. Output results show that Character n-gram Overlap using J48 classifier outperform other methods for the Urdu short text reuse detection task.
Text reuse is the process of creating new texts using existing ones. Freely available and easily accessible large on-line repositories are not only making reuse of text more common in society but also harder to detect programmatically. A major hindrance in the development and evaluation of existing mono-lingual text reuse detection methods, especially for South Asian languages, is the unavailability of standardized benchmark corpora. Amongst other things, a gold standard corpus enables researchers to directly compare with existing state-of-the-art methods. In our study, we address this gap by developing a benchmark corpus for one of the widely spoken but under resourced languages i.e. Urdu. The COUNTER corpus contains 1,200 documents with real examples of text reuse from the field of journalism. It has been manually annotated at document level with three levels of reuse: wholly derived, partially derived and non derived. In this paper, we also apply two simple similarity estimation methods (n-gram overlap and longest common subsequence) on our corpus to show how it can be used in the evaluation of text reuse detection systems. The corpus is a vital resource for the development and evaluation of text reuse detection systems in general and specifically for Urdu language.
The rapid expansion of biomedical publications poses a significant challenge for researchers attempting to synthesise up-to-date knowledge and generate testable hypotheses. Especially, cancer biomarker discovery needs continuous monitoring of thousands of articles across multiple databases. Traditional manual curation is time-consuming and prone to subjectivity. In this paper, we present a Retrieval-Augmented Generation (RAG) agent that automates the collection, indexing, and analysis of biomedical literature to propose structured hypotheses for biomarkers. The system integrates EuropePMC literature search, UniProt protein knowledge, and ClinicalTrials.gov data with a large language model (LLM) hosted locally through Ollama. Retrieved documents were encoded using biomedical sentencetransformers, indexed using FAISS, and queried with cosine similarity to return the most relevant abstracts. The agent then synthesises responses in JSON format containing the biomarker symbol, cancer type, rationale, key evidence, and PubMed identifiers. We evaluated the pipeline on cancer biomarker topics, showing effective retrieval quality, grounded hypothesis generation, and citation accuracy. Results demonstrated that domainspecific embeddings outperform general-purpose encoders, and retrieval depth (k) balances evidence coverage with LLM focus. The proposed methodology significantly reduces the burden of manual literature review and provides structured, machinereadable hypotheses that can be used by researchers. This study illustrates the potential of deploying RAG-driven agents as research assistants in oncology and related fields, as well as offering a pathway toward expandable, reliable, and up-to-date biomarker discovery.
This paper presents the overview of 1st International shared task of Multilingual Author Profiling on SMS (MAPonSMS) at Forum for Information Retrieval Evaluation (FIRE’18). The aim of the MAPon-SMS task is to identify the author's gender and age for a given multilingual (Roman Urdu and English) SMS messages profile, where each profile consists of an aggregation of SMS messages from a single author. This paper provides the details of the dataset and its distribution, overview of the submitted approaches and the evaluation framework used for measuring the performance of the submitted multilingual author profiling systems.
Paraphrase plagiarism is a significant and widespread problem and research shows that it is hard to detect. Several methods and automatic systems have been proposed to deal with it. However, evaluation and comparison of such solutions is not possible because of the unavailability of benchmark corpora with manual examples of paraphrase plagiarism. To deal with this issue, we present the novel development of a paraphrase plagiarism corpus containing simulated (manually created) examples in the Urdu language - a language widely spoken around the world. This resource is the first of its kind developed for the Urdu language and we believe that it will be a valuable contribution to the evaluation of paraphrase plagiarism detection systems.
I am in the 17th year of my teaching career and have taught diverse (from basic to advanced) subjects. Currently I am involved in full time teaching along side academic research work.
I would be happy to talk to you (though I have limited time) if you need my assistance in your research or whether any of the students need help in studies.
You can find me at my office located at H-Block, cabin # 16.
I am at my office (apart from my scheduled lecture slots) working days from 8:30 am until 6:30 pm, but you may consider a call or drop an email (preferred) to fix an appointment.
You can find me at InfoLab21, Room # C30.
I am there weekdays from 9:00 am until 8:00 pm.