Dr Muhammad Sharjeel

Assistant Professor
COMSATS University Islamabad, Lahore Campus
Pakistan

I am Assistant Professor at COMSATS University Islamabad, Lahore Campus, Pakistan and PhD scholar at Lancaster University, Lancaster, United Kingdom. I am a member and active participant at research groups, UCREL at Lancaster University and NLPT at CIIT Lahore. I am currently working on "Cross-lingual Text Resue and Plagiarism Detection" with Dr. Paul Rayson and Dr. Rao Muhammad Adeel Nawab.

Teaching Interests: Data Science, Natural Langauage Processing, Machine Learning, Network Security, Computer Programming.

Research Interests: Urdu Natural Language Processing, Mono- and Cross-lingual Text Reuse and Plagiarism Detection, Natural Language Processing 4 Requirement Engeering.

In an open relationship with computers for 23 years and counting.

Academic Positions

Present 2013

Assistant Professor

Computer Sciences Department, COMSATS University Islamabad, Lahore Campus, Pakistan.
2013 2011

Lecturer

Computer Sciences Department, COMSATS Institute of Information Technology, Lahore, Pakistan.
2011 2009

Lecturer

Computer Sciences Department, Univerity of Lahore, Lahore, Pakistan.

Education

Ph.D. 2020

Doctor of Philosophy in Computer Science

Lancaster University, Lancaster, United Kingdom.
M.Sc2009

Master of Science in Computer Science

Swinburne University of Technology, Melbourne, Australia.
B.C.S2006

Bachelor of Science in Computer Science

University of Peshawar, Pakistan.
H.S.S.C2001

F.Sc (Pre-Engineering)

P.E.F Model Degree College for Boys, Peshawar, Pakistan.
S.S.C1999

Matric (Science)

F.G. Boys High School for Boys, Peshawar, Pakistan.

Honors, Awards and Grants

RPA 2013 and 2017

CIIT Research Productivity Award

Since its foundation, one of the chief aims of CIIT has been to promote quality research. This has been done by engaging its faculty, students and researchers to challenge existing ideas and by providing a research friendly environment. To encourage its faculty and promote quality research, CIIT-Research Productivity Awards are an annual feature. CIIT RPA for its Faculty, Staff and Students are for research papers published in a calendar year (in Impact factor and ISI indexed journals), the researchers will be awarded a certificate and cash prize.
ACS 2009

Australian Computer Society Member

The Australian Computer Society is the professional association for Australia’s Information and Communication Technology (ICT) sector. ACS is about recognising professionalism, developing ICT skills and building a community with a true sense of belonging. It help members realise their professional ambitions in the global economy, making the most of an era of extraordinary possibility.

Research Summary

In recent years, due to the easy access and vast amount of multi-lingual information readily available on the Web, cross language text reuse and plagiarism cases have increased considerably and became a matter of concern. To remedy this issue, its detection becomes equally important. However, research indicates that current systems fail to detect reuse when the source text has been obfuscated, i.e. paraphrased after translation from another language. Due to the implied complexity involved, research in this field is still in its infancy.

To develop and evaluate state-of-the-art methods for cross language text reuse and plagiarism detection, one obstacle is the shortage of benchmark corpora containing real or simulated examples. Majority of the corpora available are for English language (mono-lingual) or English-European languages (cross-lingual) and less focus has been devoted to developing resources for South Asian languages. The methods proposed in the literature for cross language text reuse and plagiarism detection task are based on the language syntax (CL-CNG), parallel (CL-ASA) or comparable corpora (CL-ESA), some require statistical dictionaries or knowledge bases (CL-CTS) and some others (T+MA) imply language normalisation at the preprocessing step. These methods have proven to produce fair results on syntactically similar languages and on verbatim cases of reuse. However, they have not been evaluated on cross-script cross language plagiarism detection, as most of them require supporting resources which are not ample for these under resourced languages (e.g. Urdu).

Therefore, the aim of my research is to develop large scale mono and cross language text reuse and plagiarism detection corpora (for English-Urdu language pair) and develop and evaluate automatic methods that can detect text reuse across languages by overcoming the limitations of existing methods.

Interests

Natural Language Processing

Computational Linguistics

Cross Language Text Reuse Detection

Cross Language Plagiarism Detection

Urdu Natural Language Tooltik

Urdu English Corpora Development

Machine Learning

Natural Language Processing 4 Requirement Engineering

Great Personnel

Dr. Paul Clough

Research Mentor

+ Follow

Dr. Mark Stevenson

Research Mentor

+ Follow

Dr. Rao Muhammad Adeel Nawab

PhD Supervisor

+ Follow

Dr. Paul Rayson

PhD Supervisor

+ Follow

Dr. Alberto Barrón-Cedeño

Research Mentor

+ Follow

Jawad Shafi Mian

Postdoctoral fellow

+ Follow

Touseef Tahir

Postdoctoral fellow

+ Follow

The list on the left displays my supervisors, research mentors in the field of NLP and few colleages that inspires and motivates me to work hard everyday.

Research Projects

COUNTER

Corpus Of Urdu News TExt Reuse

COUNTER - Corpus Of Urdu News TExt Reuse is a Urdu text reuse corpus developed at CIIT Lahore in partnership with Lancaster University. The corpus is released with an intention that it will foster the research in mono-lingual text reuse detection systems specifically for Urdu language. The corpus has 600 source and 600 derived (suspicious) documents. It contains in total 275,387 words (tokens), 21,426 unique words and 10,841 sentences. It has been manually annotated at document level with three levels of reuse: wholly derived (135), partially derived (288) and non derived (177).

Click here for details
TRUE

Text Reuse Urdu English

TRUE - Text Reuse English Urdu is a research project between NLPT at CIIT Lahore, Pakistan and UCREL at Lancaster University, Lancaster, UK. It aims to develop cross script cross language corpora and methods to detect text reuse at both document and sentence level. An initial corpus is under development that contains 2,500 source derived document pairs. The source and derived documents are from the field of journalism and contain real example of text reuse.

Click here for details
UPlag

Urdu Plagiarism

UPlag is a project that aims to contribute benchmark Urdu Plagiarism corpus with simulated as well as artificial examples of plagiarism. Moreover, the project has a secondary focus on developing (or modifying) state-of-the-art techniques for Urdu plagiarism detection system.
UPPC

Urdu Paraphrase Plagiarism Corpus

UPPC is a corpus that contains 160 documents (20 source documents and 140 suspicious ones). The source documents are original Wikipedia articles on 20 personalities while the set of suspicious documents are either manually paraphrased versions produced by applying different rewriting techniques or set of independently written (non-plagiarised) documents. The resource is the first of its kind developed for the Urdu language and we believe that it will be a valuable contribution to the evaluation of paraphrase plagiarism detection systems. The corpus can be used for: (1) the development, analysis and evaluation of automated paraphrase plagiarism detection systems for Urdu language, (2) identifying which types of obfuscations (paraphrase strategies) are easy or difficult to detect and (3) would be a valuable resource for Urdu paraphrase identification task.

Selected Publications

Filter by type:

Sort by year:

CLEU - A Cross-Language English-Urdu Corpus and Benchmark for Text Reuse Experiments

Iqra Muneer, Muhammad Sharjeel, Muntaha Iqbal, Rao Muhammad Adeel Nawab, Paul Rayson

Journal PaperJournal of the Association for Information Science and Technology, IF: 3.244

Abstract

Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi-lingual content on the Web has increased cross-language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross-language sentence/passage level text reuse corpus for the English-Urdu language pair. The Cross-Language English-Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairsmanually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono-lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f1=0.732 binary, f1=0.552 ternary classification) indicate that it is harder to detect cross-language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross-language text reuse detection systems for the English-Urdu language pair.

Measuring Short Text Reuse For The Urdu Language

Sara Sameen, Muhammad Sharjeel, Rao Muhammad Adeel Nawab, Paul Rayson, Iqra Muneer

Journal PaperIEEE Access IF: 3.244

Abstract

Text reuse occurs when one borrows the text (either verbatim or paraphrased) from an earlier written text. A large and increasing amount of digital text is easily and readily available, making it simpler to reuse but difficult to detect. As a result, automatic detection of text reuse has attracted the attention of the research community due to the wide variety of applications associated with it. To develop and evaluate automatic methods for text reuse detection, standard evaluation resources are required. In this work, we propose one such resource for a significantly under-resourced language - Urdu, which is widely used in day to day communication and has a large digital footprint particularly in the Indian subcontinent. Our proposed Urdu Short Text Reuse Corpus contains 2,684 short Urdu text pairs, manually labelled as verbatim (496), paraphrased (1,329), and independently written (859). In addition, we describe an evaluation of the corpus using various state-of-the-art text reuse detection methods with binary and multi-classification settings and a set of four classifiers. Output results show that Character n-gram Overlap using J48 classifier outperform other methods for the Urdu short text reuse detection task.

COUNTER - COrpus of Urdu News TExt Reuse

Muhammad Sharjeel, Rao Muhammad Adeel Nawab, Paul Rayson

Journal PaperLanguage Resources and Evaluation (LREV) IF: 0.738

Abstract

Text reuse is the process of creating new texts using existing ones. Freely available and easily accessible large on-line repositories are not only making reuse of text more common in society but also harder to detect programmatically. A major hindrance in the development and evaluation of existing mono-lingual text reuse detection methods, especially for South Asian languages, is the unavailability of standardized benchmark corpora. Amongst other things, a gold standard corpus enables researchers to directly compare with existing state-of-the-art methods. In our study, we address this gap by developing a benchmark corpus for one of the widely spoken but under resourced languages i.e. Urdu. The COUNTER corpus contains 1,200 documents with real examples of text reuse from the field of journalism. It has been manually annotated at document level with three levels of reuse: wholly derived, partially derived and non derived. In this paper, we also apply two simple similarity estimation methods (n-gram overlap and longest common subsequence) on our corpus to show how it can be used in the evaluation of text reuse detection systems. The corpus is a vital resource for the development and evaluation of text reuse detection systems in general and specifically for Urdu language.

UPPC - Urdu Paraphrase Plagiarism Corpus

Muhammad Sharjeel, Paul Rayson, Rao Muhammad Adeel Nawab

Conference PapersLanguage Resource and Evaluation Conference (LREC) 2016

Abstract

Paraphrase plagiarism is a significant and widespread problem and research shows that it is hard to detect. Several methods and automatic systems have been proposed to deal with it. However, evaluation and comparison of such solutions is not possible because of the unavailability of benchmark corpora with manual examples of paraphrase plagiarism. To deal with this issue, we present the novel development of a paraphrase plagiarism corpus containing simulated (manually created) examples in the Urdu language - a language widely spoken around the world. This resource is the first of its kind developed for the Urdu language and we believe that it will be a valuable contribution to the evaluation of paraphrase plagiarism detection systems.

Teaching

I am in the 13th year of my teaching career and have taught diverse (from basic to advanced) subjects. Currently I am involved in full time teaching along side academic research work.

Currrent Teaching

Spring 2019

CSC103 - Programming Fundamentals

CSC354 - Machine Learning

Teaching History

COMSATS Institute of Information Technology, Lahore

Fall 2021

CSC103 - Programming Fundamentals

CSC683 - Advanced Algorithm Analysis
Spring 2021

CSC101 - Introduction to Information and Communication Technologies

CSC461 - Introduction to Data Science
Fall 2020

CSC101 - Introduction to Computing

SED348 - Data Security and Encryption
Spring 2020

CSC101 - Introduction to Information and Communication Technologies
Fall 2019

CSC101 - Introduction to Information and Communication Technologies
Spring 2019

CSC101 - Introduction to Information and Communication Technologies
Fall 2018

CSC101 - Introduction to Computing
Spring 2018

CSC101 - Introduction to Information and Communication Technologies
Fall 2017

CSC101 - Introduction to Computing
Spring 2017

CSC111 - Algorithms
Fall 2015

CSC101 - Introduction to Computing
Fall 2014

CSC101 - Introduction to Computing

CSC332 - Network Security
Spring 2014

CSC101 - Introduction to Computing

CSC332 - Network Security
Fall 2013

CSC101 - Introduction to Computing

CSC344 - Wireless and Mobile Computing
Spring 2013

CSC101 - Introduction to Computing

CSC344 - Wireless and Mobile Computing
Fall 2012

CSC401 - Computing for Management

CSC344 - Wireless and Mobile Computing
Spring 2012

CSC401 - Computing for Management

CSC141 - Introduction to Computer Programming
Fall 2011

CSC401 - Computing for Management

CSC101 - Introduction to Computing
Spring 2011

CSC101 - Introduction to Computing

CSC112 - Algorithms and Data Structures

University of Lahore

Fall 2010

CSC1012 - Programming Fundamentals

CSC3535 - Computer Networks

ECE3323 - Data Communications

CS522 - Network Security and Cryptography
Winter 2010

CSC1011 - Introduction to Computing

CSC3535 - Computer Networks

ECE3323 - Data Communications

CS521 - Advanced Computer Networks

Gallery

Through the Lens of my Life.

Contact & Meet Me

I would be happy to talk to you (though I have limited time) if you need my assistance in your research or whether any of the students need help in studies.

+92 334 4518961
+44 7481 961605
s.muhammad6@lancaster.ac.uk
muhammadsharjeel@cuilahore.edu.pk
m-sharjeel
sharjeelmuhammad
msharjeel
muhammad-sharjeel

At Office (Lahore, Pakistan)

You can find me at my office located at H-Block, cabin # 16.

I am at my office (apart from my scheduled lecture slots) working days from 8:30 am until 6:30 pm, but you may consider a call or drop an email (preferred) to fix an appointment.

At Lab (Lancaster, United Kingdom)

You can find me at InfoLab21, Room # C30.

I am there weekdays from 9:00 am until 8:00 pm.