Assistant Professor
Computer Sciences Department, COMSATS University Islamabad, Lahore Campus, Pakistan.
I am Assistant Professor at COMSATS University Islamabad, Lahore Campus, Pakistan and PhD scholar at Lancaster University, Lancaster, United Kingdom. I am a member and active participant at research groups, UCREL at Lancaster University and NLPT at CIIT Lahore. I am currently working on "Cross-lingual Text Resue and Plagiarism Detection" with Dr. Paul Rayson and Dr. Rao Muhammad Adeel Nawab.
Teaching Interests: Data Science, Natural Langauage Processing, Machine Learning, Network Security, Computer Programming.
Research Interests: Urdu Natural Language Processing, Mono- and Cross-lingual Text Reuse and Plagiarism Detection, Natural Language Processing 4 Requirement Engeering.
Computer Sciences Department, COMSATS University Islamabad, Lahore Campus, Pakistan.
Computer Sciences Department, COMSATS Institute of Information Technology, Lahore, Pakistan.
Computer Sciences Department, Univerity of Lahore, Lahore, Pakistan.
Doctor of Philosophy in Computer Science
Lancaster University, Lancaster, United Kingdom.
Master of Science in Computer Science
Swinburne University of Technology, Melbourne, Australia.
Bachelor of Science in Computer Science
University of Peshawar, Pakistan.
F.Sc (Pre-Engineering)
P.E.F Model Degree College for Boys, Peshawar, Pakistan.
Matric (Science)
F.G. Boys High School for Boys, Peshawar, Pakistan.
Since its foundation, one of the chief aims of CIIT has been to promote quality research. This has been done by engaging its faculty, students and researchers to challenge existing ideas and by providing a research friendly environment. To encourage its faculty and promote quality research, CIIT-Research Productivity Awards are an annual feature. CIIT RPA for its Faculty, Staff and Students are for research papers published in a calendar year (in Impact factor and ISI indexed journals), the researchers will be awarded a certificate and cash prize.
The Australian Computer Society is the professional association for Australia’s Information and Communication Technology (ICT) sector. ACS is about recognising professionalism, developing ICT skills and building a community with a true sense of belonging. It help members realise their professional ambitions in the global economy, making the most of an era of extraordinary possibility.
In recent years, due to the easy access and vast amount of multi-lingual information readily available on the Web, cross language text reuse and plagiarism cases have increased considerably and became a matter of concern. To remedy this issue, its detection becomes equally important. However, research indicates that current systems fail to detect reuse when the source text has been obfuscated, i.e. paraphrased after translation from another language. Due to the implied complexity involved, research in this field is still in its infancy.
To develop and evaluate state-of-the-art methods for cross language text reuse and plagiarism detection, one obstacle is the shortage of benchmark corpora containing real or simulated examples. Majority of the corpora available are for English language (mono-lingual) or English-European languages (cross-lingual) and less focus has been devoted to developing resources for South Asian languages. The methods proposed in the literature for cross language text reuse and plagiarism detection task are based on the language syntax (CL-CNG), parallel (CL-ASA) or comparable corpora (CL-ESA), some require statistical dictionaries or knowledge bases (CL-CTS) and some others (T+MA) imply language normalisation at the preprocessing step. These methods have proven to produce fair results on syntactically similar languages and on verbatim cases of reuse. However, they have not been evaluated on cross-script cross language plagiarism detection, as most of them require supporting resources which are not ample for these under resourced languages (e.g. Urdu).
Therefore, the aim of my research is to develop large scale mono and cross language text reuse and plagiarism detection corpora (for English-Urdu language pair) and develop and evaluate automatic methods that can detect text reuse across languages by overcoming the limitations of existing methods.
The list on the left displays my supervisors, research mentors in the field of NLP and few colleages that inspires and motivates me to work hard everyday.
COUNTER - Corpus Of Urdu News TExt Reuse is a Urdu text reuse corpus developed at CIIT Lahore in partnership with Lancaster University. The corpus is released with an intention that it will foster the research in mono-lingual text reuse detection systems specifically for Urdu language. The corpus has 600 source and 600 derived (suspicious) documents. It contains in total 275,387 words (tokens), 21,426 unique words and 10,841 sentences. It has been manually annotated at document level with three levels of reuse: wholly derived (135), partially derived (288) and non derived (177).
TRUE - Text Reuse English Urdu is a research project between NLPT at CIIT Lahore, Pakistan and UCREL at Lancaster University, Lancaster, UK. It aims to develop cross script cross language corpora and methods to detect text reuse at both document and sentence level. An initial corpus is under development that contains 2,500 source derived document pairs. The source and derived documents are from the field of journalism and contain real example of text reuse.
UPlag is a project that aims to contribute benchmark Urdu Plagiarism corpus with simulated as well as artificial examples of plagiarism. Moreover, the project has a secondary focus on developing (or modifying) state-of-the-art techniques for Urdu plagiarism detection system.
UPPC is a corpus that contains 160 documents (20 source documents and 140 suspicious ones). The source documents are original Wikipedia articles on 20 personalities while the set of suspicious documents are either manually paraphrased versions produced by applying different rewriting techniques or set of independently written (non-plagiarised) documents. The resource is the first of its kind developed for the Urdu language and we believe that it will be a valuable contribution to the evaluation of paraphrase plagiarism detection systems. The corpus can be used for: (1) the development, analysis and evaluation of automated paraphrase plagiarism detection systems for Urdu language, (2) identifying which types of obfuscations (paraphrase strategies) are easy or difficult to detect and (3) would be a valuable resource for Urdu paraphrase identification task.
Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi-lingual content on the Web has increased cross-language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross-language sentence/passage level text reuse corpus for the English-Urdu language pair. The Cross-Language English-Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairsmanually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono-lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f1=0.732 binary, f1=0.552 ternary classification) indicate that it is harder to detect cross-language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross-language text reuse detection systems for the English-Urdu language pair.
Text reuse occurs when one borrows the text (either verbatim or paraphrased) from an earlier written text. A large and increasing amount of digital text is easily and readily available, making it simpler to reuse but difficult to detect. As a result, automatic detection of text reuse has attracted the attention of the research community due to the wide variety of applications associated with it. To develop and evaluate automatic methods for text reuse detection, standard evaluation resources are required. In this work, we propose one such resource for a significantly under-resourced language - Urdu, which is widely used in day to day communication and has a large digital footprint particularly in the Indian subcontinent. Our proposed Urdu Short Text Reuse Corpus contains 2,684 short Urdu text pairs, manually labelled as verbatim (496), paraphrased (1,329), and independently written (859). In addition, we describe an evaluation of the corpus using various state-of-the-art text reuse detection methods with binary and multi-classification settings and a set of four classifiers. Output results show that Character n-gram Overlap using J48 classifier outperform other methods for the Urdu short text reuse detection task.
Text reuse is the process of creating new texts using existing ones. Freely available and easily accessible large on-line repositories are not only making reuse of text more common in society but also harder to detect programmatically. A major hindrance in the development and evaluation of existing mono-lingual text reuse detection methods, especially for South Asian languages, is the unavailability of standardized benchmark corpora. Amongst other things, a gold standard corpus enables researchers to directly compare with existing state-of-the-art methods. In our study, we address this gap by developing a benchmark corpus for one of the widely spoken but under resourced languages i.e. Urdu. The COUNTER corpus contains 1,200 documents with real examples of text reuse from the field of journalism. It has been manually annotated at document level with three levels of reuse: wholly derived, partially derived and non derived. In this paper, we also apply two simple similarity estimation methods (n-gram overlap and longest common subsequence) on our corpus to show how it can be used in the evaluation of text reuse detection systems. The corpus is a vital resource for the development and evaluation of text reuse detection systems in general and specifically for Urdu language.
Paraphrase plagiarism is a significant and widespread problem and research shows that it is hard to detect. Several methods and automatic systems have been proposed to deal with it. However, evaluation and comparison of such solutions is not possible because of the unavailability of benchmark corpora with manual examples of paraphrase plagiarism. To deal with this issue, we present the novel development of a paraphrase plagiarism corpus containing simulated (manually created) examples in the Urdu language - a language widely spoken around the world. This resource is the first of its kind developed for the Urdu language and we believe that it will be a valuable contribution to the evaluation of paraphrase plagiarism detection systems.
I am in the 13th year of my teaching career and have taught diverse (from basic to advanced) subjects. Currently I am involved in full time teaching along side academic research work.
I would be happy to talk to you (though I have limited time) if you need my assistance in your research or whether any of the students need help in studies.
You can find me at my office located at H-Block, cabin # 16.
I am at my office (apart from my scheduled lecture slots) working days from 8:30 am until 6:30 pm, but you may consider a call or drop an email (preferred) to fix an appointment.
You can find me at InfoLab21, Room # C30.
I am there weekdays from 9:00 am until 8:00 pm.