Natural language processing: state of the art, current trends and challenges

  • Published: 14 July 2022
  • Volume 82 , pages 3713–3744, ( 2023 )

Cite this article

thesis on nlp

  • Diksha Khurana 1 ,
  • Aditya Koli 1 ,
  • Kiran Khatter   ORCID: orcid.org/0000-0002-1000-6102 2 &
  • Sukhdev Singh 3  

141k Accesses

298 Citations

34 Altmetric

Explore all metrics

This article has been updated

Natural language processing (NLP) has recently gained much attention for representing and analyzing human language computationally. It has spread its applications in various fields such as machine translation, email spam detection, information extraction, summarization, medical, and question answering etc. In this paper, we first distinguish four phases by discussing different levels of NLP and components of N atural L anguage G eneration followed by presenting the history and evolution of NLP. We then discuss in detail the state of the art presenting the various applications of NLP, current trends, and challenges. Finally, we present a discussion on some available datasets, models, and evaluation metrics in NLP.

Similar content being viewed by others

thesis on nlp

Natural Language Processing: Challenges and Future Directions

thesis on nlp

Progress in Natural Language Processing and Language Understanding

thesis on nlp

Natural Language Processing

Avoid common mistakes on your manuscript.

1 Introduction

A language can be defined as a set of rules or set of symbols where symbols are combined and used for conveying information or broadcasting the information. Since all the users may not be well-versed in machine specific language, N atural Language Processing (NLP) caters those users who do not have enough time to learn new languages or get perfection in it. In fact, NLP is a tract of Artificial Intelligence and Linguistics, devoted to make computers understand the statements or words written in human languages. It came into existence to ease the user’s work and to satisfy the wish to communicate with the computer in natural language, and can be classified into two parts i.e. Natural Language Understanding or Linguistics and Natural Language Generation which evolves the task to understand and generate the text. L inguistics is the science of language which includes Phonology that refers to sound, Morphology word formation, Syntax sentence structure, Semantics syntax and Pragmatics which refers to understanding. Noah Chomsky, one of the first linguists of twelfth century that started syntactic theories, marked a unique position in the field of theoretical linguistics because he revolutionized the area of syntax (Chomsky, 1965) [ 23 ]. Further, Natural Language Generation (NLG) is the process of producing phrases, sentences and paragraphs that are meaningful from an internal representation. The first objective of this paper is to give insights of the various important terminologies of NLP and NLG.

In the existing literature, most of the work in NLP is conducted by computer scientists while various other professionals have also shown interest such as linguistics, psychologists, and philosophers etc. One of the most interesting aspects of NLP is that it adds up to the knowledge of human language. The field of NLP is related with different theories and techniques that deal with the problem of natural language of communicating with the computers. Few of the researched tasks of NLP are Automatic Summarization ( Automatic summarization produces an understandable summary of a set of text and provides summaries or detailed information of text of a known type), Co-Reference Resolution ( Co-reference resolution refers to a sentence or larger set of text that determines all words which refer to the same object), Discourse Analysis ( Discourse analysis refers to the task of identifying the discourse structure of connected text i.e. the study of text in relation to social context),Machine Translation ( Machine translation refers to automatic translation of text from one language to another),Morphological Segmentation ( Morphological segmentation refers to breaking words into individual meaning-bearing morphemes), Named Entity Recognition ( Named entity recognition (NER) is used for information extraction to recognized name entities and then classify them to different classes), Optical Character Recognition ( Optical character recognition (OCR) is used for automatic text recognition by translating printed and handwritten text into machine-readable format), Part Of Speech Tagging ( Part of speech tagging describes a sentence, determines the part of speech for each word) etc. Some of these tasks have direct real-world applications such as Machine translation, Named entity recognition, Optical character recognition etc. Though NLP tasks are obviously very closely interwoven but they are used frequently, for convenience. Some of the tasks such as automatic summarization, co-reference analysis etc. act as subtasks that are used in solving larger tasks. Nowadays NLP is in the talks because of various applications and recent developments although in the late 1940s the term wasn’t even in existence. So, it will be interesting to know about the history of NLP, the progress so far has been made and some of the ongoing projects by making use of NLP. The second objective of this paper focus on these aspects. The third objective of this paper is on datasets, approaches, evaluation metrics and involved challenges in NLP. The rest of this paper is organized as follows. Section 2 deals with the first objective mentioning the various important terminologies of NLP and NLG. Section 3 deals with the history of NLP, applications of NLP and a walkthrough of the recent developments. Datasets used in NLP and various approaches are presented in Section 4 , and Section 5 is written on evaluation metrics and challenges involved in NLP. Finally, a conclusion is presented in Section 6 .

2 Components of NLP

NLP can be classified into two parts i.e., Natural Language Understanding and Natural Language Generation which evolves the task to understand and generate the text. Figure 1 presents the broad classification of NLP. The objective of this section is to discuss the Natural Language Understanding (Linguistic) (NLU) and the Natural Language Generation (NLG) .

figure 1

Broad classification of NLP

NLU enables machines to understand natural language and analyze it by extracting concepts, entities, emotion, keywords etc. It is used in customer care applications to understand the problems reported by customers either verbally or in writing. Linguistics is the science which involves the meaning of language, language context and various forms of the language. So, it is important to understand various important terminologies of NLP and different levels of NLP. We next discuss some of the commonly used terminologies in different levels of NLP.

Phonology is the part of Linguistics which refers to the systematic arrangement of sound. The term phonology comes from Ancient Greek in which the term phono means voice or sound and the suffix –logy refers to word or speech. In 1993 Nikolai Trubetzkoy stated that Phonology is “the study of sound pertaining to the system of language” whereas Lass1998 [ 66 ]wrote that phonology refers broadly with the sounds of language, concerned with sub-discipline of linguistics, behavior and organization of sounds. Phonology includes semantic use of sound to encode meaning of any Human language.

The different parts of the word represent the smallest units of meaning known as Morphemes. Morphology which comprises Nature of words, are initiated by morphemes. An example of Morpheme could be, the word precancellation can be morphologically scrutinized into three separate morphemes: the prefix pre , the root cancella , and the suffix -tion . The interpretation of morphemes stays the same across all the words, just to understand the meaning humans can break any unknown word into morphemes. For example, adding the suffix –ed to a verb, conveys that the action of the verb took place in the past. The words that cannot be divided and have meaning by themselves are called Lexical morpheme (e.g.: table, chair). The words (e.g. -ed, −ing, −est, −ly, −ful) that are combined with the lexical morpheme are known as Grammatical morphemes (eg. Worked, Consulting, Smallest, Likely, Use). The Grammatical morphemes that occur in combination called bound morphemes (eg. -ed, −ing) Bound morphemes can be divided into inflectional morphemes and derivational morphemes. Adding Inflectional morphemes to a word changes the different grammatical categories such as tense, gender, person, mood, aspect, definiteness and animacy. For example, addition of inflectional morphemes –ed changes the root park to parked . Derivational morphemes change the semantic meaning of the word when it is combined with that word. For example, in the word normalize, the addition of the bound morpheme –ize to the root normal changes the word from an adjective ( normal ) to a verb ( normalize ).

In Lexical, humans, as well as NLP systems, interpret the meaning of individual words. Sundry types of processing bestow to word-level understanding – the first of these being a part-of-speech tag to each word. In this processing, words that can act as more than one part-of-speech are assigned the most probable part-of-speech tag based on the context in which they occur. At the lexical level, Semantic representations can be replaced by the words that have one meaning. In fact, in the NLP system the nature of the representation varies according to the semantic theory deployed. Therefore, at lexical level, analysis of structure of words is performed with respect to their lexical meaning and PoS. In this analysis, text is divided into paragraphs, sentences, and words. Words that can be associated with more than one PoS are aligned with the most likely PoS tag based on the context in which they occur. At lexical level, semantic representation can also be replaced by assigning the correct POS tag which improves the understanding of the intended meaning of a sentence. It is used for cleaning and feature extraction using various techniques such as removal of stop words, stemming, lemmatization etc. Stop words such as ‘ in ’, ‘the’, ‘and’ etc. are removed as they don’t contribute to any meaningful interpretation and their frequency is also high which may affect the computation time. Stemming is used to stem the words of the text by removing the suffix of a word to obtain its root form. For example: consulting and consultant words are converted to the word consult after stemming, using word gets converted to us and driver is reduced to driv . Lemmatization does not remove the suffix of a word; in fact, it results in the source word with the use of a vocabulary. For example, in case of token drived , stemming results in “driv”, whereas lemmatization attempts to return the correct basic form either drive or drived depending on the context it is used.

After PoS tagging done at lexical level, words are grouped to phrases and phrases are grouped to form clauses and then phrases are combined to sentences at syntactic level. It emphasizes the correct formation of a sentence by analyzing the grammatical structure of the sentence. The output of this level is a sentence that reveals structural dependency between words. It is also known as parsing which uncovers the phrases that convey more meaning in comparison to the meaning of individual words. Syntactic level examines word order, stop-words, morphology and PoS of words which lexical level does not consider. Changing word order will change the dependency among words and may also affect the comprehension of sentences. For example, in the sentences “ram beats shyam in a competition” and “shyam beats ram in a competition”, only syntax is different but convey different meanings [ 139 ]. It retains the stopwords as removal of them changes the meaning of the sentence. It doesn’t support lemmatization and stemming because converting words to its basic form changes the grammar of the sentence. It focuses on identification on correct PoS of sentences. For example: in the sentence “frowns on his face”, “frowns” is a noun whereas it is a verb in the sentence “he frowns”.

On a semantic level, the most important task is to determine the proper meaning of a sentence. To understand the meaning of a sentence, human beings rely on the knowledge about language and the concepts present in that sentence, but machines can’t count on these techniques. Semantic processing determines the possible meanings of a sentence by processing its logical structure to recognize the most relevant words to understand the interactions among words or different concepts in the sentence. For example, it understands that a sentence is about “movies” even if it doesn’t comprise actual words, but it contains related concepts such as “actor”, “actress”, “dialogue” or “script”. This level of processing also incorporates the semantic disambiguation of words with multiple senses (Elizabeth D. Liddy, 2001) [ 68 ]. For example, the word “bark” as a noun can mean either as a sound that a dog makes or outer covering of the tree. The semantic level examines words for their dictionary interpretation or interpretation is derived from the context of the sentence. For example: the sentence “Krishna is good and noble.” This sentence is either talking about Lord Krishna or about a person “Krishna”. That is why, to get the proper meaning of the sentence, the appropriate interpretation is considered by looking at the rest of the sentence [ 44 ].

While syntax and semantics level deal with sentence-length units, the discourse level of NLP deals with more than one sentence. It deals with the analysis of logical structure by making connections among words and sentences that ensure its coherence. It focuses on the properties of the text that convey meaning by interpreting the relations between sentences and uncovering linguistic structures from texts at several levels (Liddy,2001) [ 68 ]. The two of the most common levels are: Anaphora Resolution an d Coreference Resolution. Anaphora resolution is achieved by recognizing the entity referenced by an anaphor to resolve the references used within the text with the same sense. For example, (i) Ram topped in the class. (ii) He was intelligent. Here i) and ii) together form a discourse. Human beings can quickly understand that the pronoun “he” in (ii) refers to “Ram” in (i). The interpretation of “He” depends on another word “Ram” presented earlier in the text. Without determining the relationship between these two structures, it would not be possible to decide why Ram topped the class and who was intelligent. Coreference resolution is achieved by finding all expressions that refer to the same entity in a text. It is an important step in various NLP applications that involve high-level NLP tasks such as document summarization, information extraction etc. In fact, anaphora is encoded through one of the processes called co-reference.

Pragmatic level focuses on the knowledge or content that comes from the outside the content of the document. It deals with what speaker implies and what listener infers. In fact, it analyzes the sentences that are not directly spoken. Real-world knowledge is used to understand what is being talked about in the text. By analyzing the context, meaningful representation of the text is derived. When a sentence is not specific and the context does not provide any specific information about that sentence, Pragmatic ambiguity arises (Walton, 1996) [ 143 ]. Pragmatic ambiguity occurs when different persons derive different interpretations of the text, depending on the context of the text. The context of a text may include the references of other sentences of the same document, which influence the understanding of the text and the background knowledge of the reader or speaker, which gives a meaning to the concepts expressed in that text. Semantic analysis focuses on literal meaning of the words, but pragmatic analysis focuses on the inferred meaning that the readers perceive based on their background knowledge. For example, the sentence “Do you know what time is it?” is interpreted to “Asking for the current time” in semantic analysis whereas in pragmatic analysis, the same sentence may refer to “expressing resentment to someone who missed the due time” in pragmatic analysis. Thus, semantic analysis is the study of the relationship between various linguistic utterances and their meanings, but pragmatic analysis is the study of context which influences our understanding of linguistic expressions. Pragmatic analysis helps users to uncover the intended meaning of the text by applying contextual background knowledge.

The goal of NLP is to accommodate one or more specialties of an algorithm or system. The metric of NLP assess on an algorithmic system allows for the integration of language understanding and language generation. It is even used in multilingual event detection. Rospocher et al. [ 112 ] purposed a novel modular system for cross-lingual event extraction for English, Dutch, and Italian Texts by using different pipelines for different languages. The system incorporates a modular set of foremost multilingual NLP tools. The pipeline integrates modules for basic NLP processing as well as more advanced tasks such as cross-lingual named entity linking, semantic role labeling and time normalization. Thus, the cross-lingual framework allows for the interpretation of events, participants, locations, and time, as well as the relations between them. Output of these individual pipelines is intended to be used as input for a system that obtains event centric knowledge graphs. All modules take standard input, to do some annotation, and produce standard output which in turn becomes the input for the next module pipelines. Their pipelines are built as a data centric architecture so that modules can be adapted and replaced. Furthermore, modular architecture allows for different configurations and for dynamic distribution.

Ambiguity is one of the major problems of natural language which occurs when one sentence can lead to different interpretations. This is usually faced in syntactic, semantic, and lexical levels. In case of syntactic level ambiguity, one sentence can be parsed into multiple syntactical forms. Semantic ambiguity occurs when the meaning of words can be misinterpreted. Lexical level ambiguity refers to ambiguity of a single word that can have multiple assertions. Each of these levels can produce ambiguities that can be solved by the knowledge of the complete sentence. The ambiguity can be solved by various methods such as Minimizing Ambiguity, Preserving Ambiguity, Interactive Disambiguation and Weighting Ambiguity [ 125 ]. Some of the methods proposed by researchers to remove ambiguity is preserving ambiguity, e.g. (Shemtov 1997; Emele & Dorna 1998; Knight & Langkilde 2000; Tong Gao et al. 2015, Umber & Bajwa 2011) [ 39 , 46 , 65 , 125 , 139 ]. Their objectives are closely in line with removal or minimizing ambiguity. They cover a wide range of ambiguities and there is a statistical element implicit in their approach.

Natural Language Generation (NLG) is the process of producing phrases, sentences and paragraphs that are meaningful from an internal representation. It is a part of Natural Language Processing and happens in four phases: identifying the goals, planning on how goals may be achieved by evaluating the situation and available communicative sources and realizing the plans as a text (Fig. 2 ). It is opposite to Understanding.

Speaker and Generator

figure 2

Components of NLG

To generate a text, we need to have a speaker or an application and a generator or a program that renders the application’s intentions into a fluent phrase relevant to the situation.

Components and Levels of Representation

The process of language generation involves the following interweaved tasks. Content selection: Information should be selected and included in the set. Depending on how this information is parsed into representational units, parts of the units may have to be removed while some others may be added by default. Textual Organization : The information must be textually organized according to the grammar, it must be ordered both sequentially and in terms of linguistic relations like modifications. Linguistic Resources : To support the information’s realization, linguistic resources must be chosen. In the end these resources will come down to choices of particular words, idioms, syntactic constructs etc. Realization : The selected and organized resources must be realized as an actual text or voice output.

Application or Speaker

This is only for maintaining the model of the situation. Here the speaker just initiates the process doesn’t take part in the language generation. It stores the history, structures the content that is potentially relevant and deploys a representation of what it knows. All these forms the situation, while selecting subset of propositions that speaker has. The only requirement is the speaker must make sense of the situation [ 91 ].

3 NLP: Then and now

In the late 1940s the term NLP wasn’t in existence, but the work regarding machine translation (MT) had started. In fact, Research in this period was not completely localized. Russian and English were the dominant languages for MT (Andreev,1967) [ 4 ]. In fact, MT/NLP research almost died in 1966 according to the ALPAC report, which concluded that MT is going nowhere. But later, some MT production systems were providing output to their customers (Hutchins, 1986) [ 60 ]. By this time, work on the use of computers for literary and linguistic studies had also started. As early as 1960, signature work influenced by AI began, with the BASEBALL Q-A systems (Green et al., 1961) [ 51 ]. LUNAR (Woods,1978) [ 152 ] and Winograd SHRDLU were natural successors of these systems, but they were seen as stepped-up sophistication, in terms of their linguistic and their task processing capabilities. There was a widespread belief that progress could only be made on the two sides, one is ARPA Speech Understanding Research (SUR) project (Lea, 1980) and other in some major system developments projects building database front ends. The front-end projects (Hendrix et al., 1978) [ 55 ] were intended to go beyond LUNAR in interfacing the large databases. In early 1980s computational grammar theory became a very active area of research linked with logics for meaning and knowledge’s ability to deal with the user’s beliefs and intentions and with functions like emphasis and themes.

By the end of the decade the powerful general purpose sentence processors like SRI’s Core Language Engine (Alshawi,1992) [ 2 ] and Discourse Representation Theory (Kamp and Reyle,1993) [ 62 ] offered a means of tackling more extended discourse within the grammatico-logical framework. This period was one of the growing communities. Practical resources, grammars, and tools and parsers became available (for example: Alvey Natural Language Tools) (Briscoe et al., 1987) [ 18 ]. The (D)ARPA speech recognition and message understanding (information extraction) conferences were not only for the tasks they addressed but for the emphasis on heavy evaluation, starting a trend that became a major feature in 1990s (Young and Chase, 1998; Sundheim and Chinchor,1993) [ 131 , 157 ]. Work on user modeling (Wahlster and Kobsa, 1989) [ 142 ] was one strand in a research paper. Cohen et al. (2002) [ 28 ] had put forwarded a first approximation of a compositional theory of tune interpretation, together with phonological assumptions on which it is based and the evidence from which they have drawn their proposals. At the same time, McKeown (1985) [ 85 ] demonstrated that rhetorical schemas could be used for producing both linguistically coherent and communicatively effective text. Some research in NLP marked important topics for future like word sense disambiguation (Small et al., 1988) [ 126 ] and probabilistic networks, statistically colored NLP, the work on the lexicon, also pointed in this direction. Statistical language processing was a major thing in 90s (Manning and Schuetze,1999) [ 75 ], because this not only involves data analysts. Information extraction and automatic summarizing (Mani and Maybury,1999) [ 74 ] was also a point of focus. Next, we present a walkthrough of the developments from the early 2000.

3.1 A walkthrough of recent developments in NLP

The main objectives of NLP include interpretation, analysis, and manipulation of natural language data for the intended purpose with the use of various algorithms, tools, and methods. However, there are many challenges involved which may depend upon the natural language data under consideration, and so makes it difficult to achieve all the objectives with a single approach. Therefore, the development of different tools and methods in the field of NLP and relevant areas of studies have received much attention from several researchers in the recent past. The developments can be seen in the Fig.  3 :

figure 3

A walkthrough of recent developments in NLP

In early 2000, neural language modeling in which the probability of occurring of next word (token) is determined given n previous words. Bendigo et al. [ 12 ] proposed the concept of feed forward neural network and lookup table which represents the n previous words in sequence. Collobert et al. [ 29 ] proposed the application of multitask learning in the field of NLP, where two convolutional models with max pooling were used to perform parts-of-speech and named entity recognition tagging. Mikolov et.al. [ 87 ] proposed a word embedding process where the dense vector representation of text was addressed. They also report the challenges faced by traditional sparse bag-of-words representation. After the advancement of word embedding, neural networks were introduced in the field of NLP where variable length input is taken for further processing. Sutskever et al. [ 132 ] proposed a general framework for sequence-to-sequence mapping where encoder and decoder networks are used to map from sequence to vector and vector to sequence respectively. In fact, the use of neural networks have played a very important role in NLP. One can observe from the existing literature that enough use of neural networks was not there in the early 2000s but till the year 2013enough discussion had happened about the use of neural networks in the field of NLP which transformed many things and further paved the way to implement various neural networks in NLP. Earlier the use of Convolutional neural networks ( CNN ) contributed to the field of image classification and analyzing visual imagery for further analysis. Later the use of CNNs can be observed in tackling problems associated with NLP tasks like Sentence Classification [ 127 ], Sentiment Analysis [ 135 ], Text Classification [ 118 ], Text Summarization [ 158 ], Machine Translation [ 70 ] and Answer Relations [ 150 ] . An article by Newatia (2019) [ 93 ] illustrates the general architecture behind any CNN model, and how it can be used in the context of NLP. One can also refer to the work of Wang and Gang [ 145 ] for the applications of CNN in NLP. Further Neural Networks those are recurrent in nature due to performing the same function for every data, also known as Recurrent Neural Networks (RNNs), have also been used in NLP, and found ideal for sequential data such as text, time series, financial data, speech, audio, video among others, see article by Thomas (2019) [ 137 ]. One of the modified versions of RNNs is Long Short-Term Memory (LSTM) which is also very useful in the cases where only the desired important information needs to be retained for a much longer time discarding the irrelevant information, see [ 52 , 58 ]. Further development in the LSTM has also led to a slightly simpler variant, called the gated recurrent unit (GRU), which has shown better results than standard LSTMs in many tasks [ 22 , 26 ]. Attention mechanisms [ 7 ] which suggest a network to learn what to pay attention to in accordance with the current hidden state and annotation together with the use of transformers have also made a significant development in NLP, see [ 141 ]. It is to be noticed that Transformers have a potential of learning longer-term dependency but are limited by a fixed-length context in the setting of language modeling. In this direction recently Dai et al. [ 30 ] proposed a novel neural architecture Transformer-XL (XL as extra-long) which enables learning dependencies beyond a fixed length of words. Further the work of Rae et al. [ 104 ] on the Compressive Transformer, an attentive sequence model which compresses memories for long-range sequence learning, may be helpful for the readers. One may also refer to the recent work by Otter et al. [ 98 ] on uses of Deep Learning for NLP, and relevant references cited therein. The use of BERT (Bidirectional Encoder Representations from Transformers) [ 33 ] model and successive models have also played an important role for NLP.

Many researchers worked on NLP, building tools and systems which makes NLP what it is today. Tools like Sentiment Analyser, Parts of Speech (POS) Taggers, Chunking, Named Entity Recognitions (NER), Emotion detection, Semantic Role Labeling have a huge contribution made to NLP, and are good topics for research. Sentiment analysis (Nasukawaetal.,2003) [ 156 ] works by extracting sentiments about a given topic, and it consists of a topic specific feature term extraction, sentiment extraction, and association by relationship analysis. It utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. It analyzes the documents for positive and negative words and tries to give ratings on scale −5 to +5. The mainstream of currently used tagsets is obtained from English. The most widely used tagsets as standard guidelines are designed for Indo-European languages but it is less researched on Asian languages or middle- eastern languages. Various authors have done research on making parts of speech taggers for various languages such as Arabic (Zeroual et al., 2017) [ 160 ], Sanskrit (Tapswi & Jain, 2012) [ 136 ], Hindi (Ranjan & Basu, 2003) [ 105 ] to efficiently tag and classify words as nouns, adjectives, verbs etc. Authors in [ 136 ] have used treebank technique for creating rule-based POS Tagger for Sanskrit Language. Sanskrit sentences are parsed to assign the appropriate tag to each word using suffix stripping algorithm, wherein the longest suffix is searched from the suffix table and tags are assigned. Diab et al. (2004) [ 34 ] used supervised machine learning approach and adopted Support Vector Machines (SVMs) which were trained on the Arabic Treebank to automatically tokenize parts of speech tag and annotate base phrases in Arabic text.

Chunking is a process of separating phrases from unstructured text. Since simple tokens may not represent the actual meaning of the text, it is advisable to use phrases such as “North Africa” as a single word instead of ‘North’ and ‘Africa’ separate words. Chunking known as “Shadow Parsing” labels parts of sentences with syntactic correlated keywords like Noun Phrase (NP) and Verb Phrase (VP). Chunking is often evaluated using the CoNLL 2000 shared task. Various researchers (Sha and Pereira, 2003; McDonald et al., 2005; Sun et al., 2008) [ 83 , 122 , 130 ] used CoNLL test data for chunking and used features composed of words, POS tags, and tags.

There are particular words in the document that refer to specific entities or real-world objects like location, people, organizations etc. To find the words which have a unique context and are more informative, noun phrases are considered in the text documents. Named entity recognition (NER) is a technique to recognize and separate the named entities and group them under predefined classes. But in the era of the Internet, where people use slang not the traditional or standard English which cannot be processed by standard natural language processing tools. Ritter (2011) [ 111 ] proposed the classification of named entities in tweets because standard NLP tools did not perform well on tweets. They re-built NLP pipeline starting from PoS tagging, then chunking for NER. It improved the performance in comparison to standard NLP tools.

Emotion detection investigates and identifies the types of emotion from speech, facial expressions, gestures, and text. Sharma (2016) [ 124 ] analyzed the conversations in Hinglish means mix of English and Hindi languages and identified the usage patterns of PoS. Their work was based on identification of language and POS tagging of mixed script. They tried to detect emotions in mixed script by relating machine learning and human knowledge. They have categorized sentences into 6 groups based on emotions and used TLBO technique to help the users in prioritizing their messages based on the emotions attached with the message. Seal et al. (2020) [ 120 ] proposed an efficient emotion detection method by searching emotional words from a pre-defined emotional keyword database and analyzing the emotion words, phrasal verbs, and negation words. Their proposed approach exhibited better performance than recent approaches.

Semantic Role Labeling (SRL) works by giving a semantic role to a sentence. For example, in the PropBank (Palmer et al., 2005) [ 100 ] formalism, one assigns roles to words that are arguments of a verb in the sentence. The precise arguments depend on the verb frame and if multiple verbs exist in a sentence, it might have multiple tags. State-of-the-art SRL systems comprise several stages: creating a parse tree, identifying which parse tree nodes represent the arguments of a given verb, and finally classifying these nodes to compute the corresponding SRL tags.

Event discovery in social media feeds (Benson et al.,2011) [ 13 ], using a graphical model to analyze any social media feeds to determine whether it contains the name of a person or name of a venue, place, time etc. The model operates on noisy feeds of data to extract records of events by aggregating multiple information across multiple messages, despite the noise of irrelevant noisy messages and very irregular message language, this model was able to extract records with a broader array of features on factors.

We first give insights on some of the mentioned tools and relevant work done before moving to the broad applications of NLP.

3.2 Applications of NLP

Natural Language Processing can be applied into various areas like Machine Translation, Email Spam detection, Information Extraction, Summarization, Question Answering etc. Next, we discuss some of the areas with the relevant work done in those directions.

Machine Translation

As most of the world is online, the task of making data accessible and available to all is a challenge. Major challenge in making data accessible is the language barrier. There are a multitude of languages with different sentence structure and grammar. Machine Translation is generally translating phrases from one language to another with the help of a statistical engine like Google Translate. The challenge with machine translation technologies is not directly translating words but keeping the meaning of sentences intact along with grammar and tenses. The statistical machine learning gathers as many data as they can find that seems to be parallel between two languages and they crunch their data to find the likelihood that something in Language A corresponds to something in Language B. As for Google, in September 2016, announced a new machine translation system based on artificial neural networks and Deep learning. In recent years, various methods have been proposed to automatically evaluate machine translation quality by comparing hypothesis translations with reference translations. Examples of such methods are word error rate, position-independent word error rate (Tillmann et al., 1997) [ 138 ], generation string accuracy (Bangalore et al., 2000) [ 8 ], multi-reference word error rate (Nießen et al., 2000) [ 95 ], BLEU score (Papineni et al., 2002) [ 101 ], NIST score (Doddington, 2002) [ 35 ] All these criteria try to approximate human assessment and often achieve an astonishing degree of correlation to human subjective evaluation of fluency and adequacy (Papineni et al., 2001; Doddington, 2002) [ 35 , 101 ].

Text Categorization

Categorization systems input a large flow of data like official documents, military casualty reports, market data, newswires etc. and assign them to predefined categories or indices. For example, The Carnegie Group’s Construe system (Hayes, 1991) [ 54 ], inputs Reuters articles and saves much time by doing the work that is to be done by staff or human indexers. Some companies have been using categorization systems to categorize trouble tickets or complaint requests and routing to the appropriate desks. Another application of text categorization is email spam filters. Spam filters are becoming important as the first line of defence against the unwanted emails. A false negative and false positive issue of spam filters is at the heart of NLP technology, it has brought down the challenge of extracting meaning from strings of text. A filtering solution that is applied to an email system uses a set of protocols to determine which of the incoming messages are spam; and which are not. There are several types of spam filters available. Content filters : Review the content within the message to determine whether it is spam or not. Header filters : Review the email header looking for fake information. General Blacklist filters : Stop all emails from blacklisted recipients. Rules Based Filters : It uses user-defined criteria. Such as stopping mails from a specific person or stopping mail including a specific word. Permission Filters : Require anyone sending a message to be pre-approved by the recipient. Challenge Response Filters : Requires anyone sending a message to enter a code to gain permission to send email.

Spam Filtering

It works using text categorization and in recent times, various machine learning techniques have been applied to text categorization or Anti-Spam Filtering like Rule Learning (Cohen 1996) [ 27 ], Naïve Bayes (Sahami et al., 1998; Androutsopoulos et al., 2000; Rennie.,2000) [ 5 , 109 , 115 ],Memory based Learning (Sakkiset al.,2000b) [ 117 ], Support vector machines (Druker et al., 1999) [ 36 ], Decision Trees (Carreras and Marquez, 2001) [ 19 ], Maximum Entropy Model (Berger et al. 1996) [ 14 ], Hash Forest and a rule encoding method (T. Xia, 2020) [ 153 ], sometimes combining different learners (Sakkis et al., 2001) [ 116 ]. Using these approaches is better as classifier is learned from training data rather than making by hand. The naïve bayes is preferred because of its performance despite its simplicity (Lewis, 1998) [ 67 ] In Text Categorization two types of models have been used (McCallum and Nigam, 1998) [ 77 ]. Both modules assume that a fixed vocabulary is present. But in first model a document is generated by first choosing a subset of vocabulary and then using the selected words any number of times, at least once irrespective of order. This is called Multi-variate Bernoulli model. It takes the information of which words are used in a document irrespective of number of words and order. In second model, a document is generated by choosing a set of word occurrences and arranging them in any order. This model is called multi-nomial model, in addition to the Multi-variate Bernoulli model, it also captures information on how many times a word is used in a document. Most text categorization approaches to anti-spam Email filtering have used multi variate Bernoulli model (Androutsopoulos et al., 2000) [ 5 ] [ 15 ].

Information Extraction

Information extraction is concerned with identifying phrases of interest of textual data. For many applications, extracting entities such as names, places, events, dates, times, and prices is a powerful way of summarizing the information relevant to a user’s needs. In the case of a domain specific search engine, the automatic identification of important information can increase accuracy and efficiency of a directed search. There is use of hidden Markov models (HMMs) to extract the relevant fields of research papers. These extracted text segments are used to allow searched over specific fields and to provide effective presentation of search results and to match references to papers. For example, noticing the pop-up ads on any websites showing the recent items you might have looked on an online store with discounts. In Information Retrieval two types of models have been used (McCallum and Nigam, 1998) [ 77 ]. Both modules assume that a fixed vocabulary is present. But in first model a document is generated by first choosing a subset of vocabulary and then using the selected words any number of times, at least once without any order. This is called Multi-variate Bernoulli model. It takes the information of which words are used in a document irrespective of number of words and order. In second model, a document is generated by choosing a set of word occurrences and arranging them in any order. This model is called multi-nominal model, in addition to the Multi-variate Bernoulli model, it also captures information on how many times a word is used in a document.

Discovery of knowledge is becoming important areas of research over the recent years. Knowledge discovery research use a variety of techniques to extract useful information from source documents like Parts of Speech (POS) tagging , Chunking or Shadow Parsing , Stop-words (Keywords that are used and must be removed before processing documents), Stemming (Mapping words to some base for, it has two methods, dictionary-based stemming and Porter style stemming (Porter, 1980) [ 103 ]. Former one has higher accuracy but higher cost of implementation while latter has lower implementation cost and is usually insufficient for IR). Compound or Statistical Phrases (Compounds and statistical phrases index multi token units instead of single tokens.) Word Sense Disambiguation (Word sense disambiguation is the task of understanding the correct sense of a word in context. When used for information retrieval, terms are replaced by their senses in the document vector.)

The extracted information can be applied for a variety of purposes, for example to prepare a summary, to build databases, identify keywords, classifying text items according to some pre-defined categories etc. For example, CONSTRUE, it was developed for Reuters, that is used in classifying news stories (Hayes, 1992) [ 54 ]. It has been suggested that many IE systems can successfully extract terms from documents, acquiring relations between the terms is still a difficulty. PROMETHEE is a system that extracts lexico-syntactic patterns relative to a specific conceptual relation (Morin,1999) [ 89 ]. IE systems should work at many levels, from word recognition to discourse analysis at the level of the complete document. An application of the Blank Slate Language Processor (BSLP) ( Bondale et al., 1999) [ 16 ] approach for the analysis of a real-life natural language corpus that consists of responses to open-ended questionnaires in the field of advertising.

There is a system called MITA (Metlife’s Intelligent Text Analyzer) (Glasgow et al. (1998) [ 48 ]) that extracts information from life insurance applications. Ahonen et al. (1998) [ 1 ] suggested a mainstream framework for text mining that uses pragmatic and discourse level analyses of text .

Summarization

Overload of information is the real thing in this digital age, and already our reach and access to knowledge and information exceeds our capacity to understand it. This trend is not slowing down, so an ability to summarize the data while keeping the meaning intact is highly required. This is important not just allowing us the ability to recognize the understand the important information for a large set of data, it is used to understand the deeper emotional meanings; For example, a company determines the general sentiment on social media and uses it on their latest product offering. This application is useful as a valuable marketing asset.

The types of text summarization depends on the basis of the number of documents and the two important categories are single document summarization and multi document summarization (Zajic et al. 2008 [ 159 ]; Fattah and Ren 2009 [ 43 ]).Summaries can also be of two types: generic or query-focused (Gong and Liu 2001 [ 50 ]; Dunlavy et al. 2007 [ 37 ]; Wan 2008 [ 144 ]; Ouyang et al. 2011 [ 99 ]).Summarization task can be either supervised or unsupervised (Mani and Maybury 1999 [ 74 ]; Fattah and Ren 2009 [ 43 ]; Riedhammer et al. 2010 [ 110 ]). Training data is required in a supervised system for selecting relevant material from the documents. Large amount of annotated data is needed for learning techniques. Few techniques are as follows–

Bayesian Sentence based Topic Model (BSTM) uses both term-sentences and term document associations for summarizing multiple documents. (Wang et al. 2009 [ 146 ])

Factorization with Given Bases (FGB) is a language model where sentence bases are the given bases and it utilizes document-term and sentence term matrices. This approach groups and summarizes the documents simultaneously. (Wang et al. 2011) [ 147 ])

Topic Aspect-Oriented Summarization (TAOS) is based on topic factors. These topic factors are various features that describe topics such as capital words are used to represent entity. Various topics can have various aspects and various preferences of features are used to represent various aspects. (Fang et al. 2015 [ 42 ])

Dialogue System

Dialogue systems are very prominent in real world applications ranging from providing support to performing a particular action. In case of support dialogue systems, context awareness is required whereas in case to perform an action, it doesn’t require much context awareness. Earlier dialogue systems were focused on small applications such as home theater systems. These dialogue systems utilize phonemic and lexical levels of language. Habitable dialogue systems offer potential for fully automated dialog systems by utilizing all levels of a language. (Liddy, 2001) [ 68 ].This leads to producing systems that can enable robots to interact with humans in natural languages such as Google’s assistant, Windows Cortana, Apple’s Siri and Amazon’s Alexa etc.

NLP is applied in the field as well. The Linguistic String Project-Medical Language Processor is one the large scale projects of NLP in the field of medicine [ 21 , 53 , 57 , 71 , 114 ]. The LSP-MLP helps enabling physicians to extract and summarize information of any signs or symptoms, drug dosage and response data with the aim of identifying possible side effects of any medicine while highlighting or flagging data items [ 114 ]. The National Library of Medicine is developing The Specialist System [ 78 , 79 , 80 , 82 , 84 ]. It is expected to function as an Information Extraction tool for Biomedical Knowledge Bases, particularly Medline abstracts. The lexicon was created using MeSH (Medical Subject Headings), Dorland’s Illustrated Medical Dictionary and general English Dictionaries. The Centre d’Informatique Hospitaliere of the Hopital Cantonal de Geneve is working on an electronic archiving environment with NLP features [ 81 , 119 ]. In the first phase, patient records were archived. At later stage the LSP-MLP has been adapted for French [ 10 , 72 , 94 , 113 ], and finally, a proper NLP system called RECIT [ 9 , 11 , 17 , 106 ] has been developed using a method called Proximity Processing [ 88 ]. It’s task was to implement a robust and multilingual system able to analyze/comprehend medical sentences, and to preserve a knowledge of free text into a language independent knowledge representation [ 107 , 108 ]. The Columbia university of New York has developed an NLP system called MEDLEE (MEDical Language Extraction and Encoding System) that identifies clinical information in narrative reports and transforms the textual information into structured representation [ 45 ].

3.3 NLP in talk

We next discuss some of the recent NLP projects implemented by various companies:

ACE Powered GDPR Robot Launched by RAVN Systems [ 134 ]

RAVN Systems, a leading expert in Artificial Intelligence (AI), Search and Knowledge Management Solutions, announced the launch of a RAVN (“Applied Cognitive Engine”) i.e. powered software Robot to help and facilitate the GDPR (“General Data Protection Regulation”) compliance. The Robot uses AI techniques to automatically analyze documents and other types of data in any business system which is subject to GDPR rules. It allows users to search, retrieve, flag, classify, and report on data, mediated to be super sensitive under GDPR quickly and easily. Users also can identify personal data from documents, view feeds on the latest personal data that requires attention and provide reports on the data suggested to be deleted or secured. RAVN’s GDPR Robot is also able to hasten requests for information (Data Subject Access Requests - “DSAR”) in a simple and efficient way, removing the need for a physical approach to these requests which tends to be very labor thorough. Peter Wallqvist, CSO at RAVN Systems commented, “GDPR compliance is of universal paramountcy as it will be exploited by any organization that controls and processes data concerning EU citizens.

Link: http://markets.financialcontent.com/stocks/news/read/33888795/RAVN_Systems_Launch_the_ACE_Powered_GDPR_Robot

Eno A Natural Language Chatbot Launched by Capital One [ 56 ]

Capital One announces a chatbot for customers called Eno. Eno is a natural language chatbot that people socialize through texting. CapitalOne claims that Eno is First natural language SMS chatbot from a U.S. bank that allows customers to ask questions using natural language. Customers can interact with Eno asking questions about their savings and others using a text interface. Eno makes such an environment that it feels that a human is interacting. This provides a different platform than other brands that launch chatbots like Facebook Messenger and Skype. They believed that Facebook has too much access to private information of a person, which could get them into trouble with privacy laws U.S. financial institutions work under. Like Facebook Page admin can access full transcripts of the bot’s conversations. If that would be the case then the admins could easily view the personal banking information of customers with is not correct.

Link: https://www.macobserver.com/analysis/capital-one-natural-language-chatbot-eno/

Future of BI in Natural Language Processing [ 140 ]

Several companies in BI spaces are trying to get with the trend and trying hard to ensure that data becomes more friendly and easily accessible. But still there is a long way for this.BI will also make it easier to access as GUI is not needed. Because nowadays the queries are made by text or voice command on smartphones.one of the most common examples is Google might tell you today what tomorrow’s weather will be. But soon enough, we will be able to ask our personal data chatbot about customer sentiment today, and how we feel about their brand next week; all while walking down the street. Today, NLP tends to be based on turning natural language into machine language. But with time the technology matures – especially the AI component –the computer will get better at “understanding” the query and start to deliver answers rather than search results. Initially, the data chatbot will probably ask the question ‘how have revenues changed over the last three-quarters?’ and then return pages of data for you to analyze. But once it learns the semantic relations and inferences of the question, it will be able to automatically perform the filtering and formulation necessary to provide an intelligible answer, rather than simply showing you data.

Link: http://www.smartdatacollective.com/eran-levy/489410/here-s-why-natural-language-processing-future-bi

Using Natural Language Processing and Network Analysis to Develop a Conceptual Framework for Medication Therapy Management Research [ 97 ]

Natural Language Processing and Network Analysis to Develop a Conceptual Framework for Medication Therapy Management Research describes a theory derivation process that is used to develop a conceptual framework for medication therapy management (MTM) research. The MTM service model and chronic care model are selected as parent theories. Review article abstracts target medication therapy management in chronic disease care that were retrieved from Ovid Medline (2000–2016). Unique concepts in each abstract are extracted using Meta Map and their pair-wise co-occurrence are determined. Then the information is used to construct a network graph of concept co-occurrence that is further analyzed to identify content for the new conceptual model. 142 abstracts are analyzed. Medication adherence is the most studied drug therapy problem and co-occurred with concepts related to patient-centered interventions targeting self-management. The enhanced model consists of 65 concepts clustered into 14 constructs. The framework requires additional refinement and evaluation to determine its relevance and applicability across a broad audience including underserved settings.

Link: https://www.ncbi.nlm.nih.gov/pubmed/28269895?dopt=Abstract

Meet the Pilot, world’s first language translating earbuds [ 96 ]

The world’s first smart earpiece Pilot will soon be transcribed over 15 languages. According to Spring wise, Waverly Labs’ Pilot can already transliterate five spoken languages, English, French, Italian, Portuguese, and Spanish, and seven written affixed languages, German, Hindi, Russian, Japanese, Arabic, Korean and Mandarin Chinese. The Pilot earpiece is connected via Bluetooth to the Pilot speech translation app, which uses speech recognition, machine translation and machine learning and speech synthesis technology. Simultaneously, the user will hear the translated version of the speech on the second earpiece. Moreover, it is not necessary that conversation would be taking place between two people; only the users can join in and discuss as a group. As if now the user may experience a few second lag interpolated the speech and translation, which Waverly Labs pursue to reduce. The Pilot earpiece will be available from September but can be pre-ordered now for $249. The earpieces can also be used for streaming music, answering voice calls, and getting audio notifications.

Link: https://www.indiegogo.com/projects/meet-the-pilot-smart-earpiece-language-translator-headphones-travel#/

4 Datasets in NLP and state-of-the-art models

The objective of this section is to present the various datasets used in NLP and some state-of-the-art models in NLP.

4.1 Datasets in NLP

Corpus is a collection of linguistic data, either compiled from written texts or transcribed from recorded speech. Corpora are intended primarily for testing linguistic hypotheses - e.g., to determine how a certain sound, word, or syntactic construction is used across a culture or language. There are various types of corpus: In an annotated corpus, the implicit information in the plain text has been made explicit by specific annotations. Un-annotated corpus contains raw state of plain text. Different languages can be compared using a reference corpus. Monitor corpora are non-finite collections of texts which are mostly used in lexicography. Multilingual corpus refers to a type of corpus that contains small collections of monolingual corpora based on the same sampling procedure and categories for different languages. Parallel corpus contains texts in one language and their translations into other languages which are aligned sentence phrase by phrase. Reference corpus contains text of spoken (formal and informal) and written (formal and informal) language which represents various social and situational contexts. Speech corpus contains recorded speech and transcriptions of recording and the time each word occurred in the recorded speech. There are various datasets available for natural language processing; some of these are listed below for different use cases:

Sentiment Analysis: Sentiment analysis is a rapidly expanding field of natural language processing (NLP) used in a variety of fields such as politics, business etc. Majorly used datasets for sentiment analysis are:

Stanford Sentiment Treebank (SST): Socher et al. introduced SST containing sentiment labels for 215,154 phrases in parse trees for 11,855 sentences from movie reviews posing novel sentiment compositional difficulties [ 127 ].

Sentiment140: It contains 1.6 million tweets annotated with negative, neutral and positive labels.

Paper Reviews: It provides reviews of computing and informatics conferences written in English and Spanish languages. It has 405 reviews which are evaluated on a 5-point scale ranging from very negative to very positive.

IMDB: For natural language processing, text analytics, and sentiment analysis, this dataset offers thousands of movie reviews split into training and test datasets. This dataset was introduced in by Mass et al. in 2011 [ 73 ].

G.Rama Rohit Reddy of the Language Technologies Research Centre, KCIS, IIIT Hyderabad, generated the corpus “Sentiraama.” The corpus is divided into four datasets, each of which is annotated with a two-value scale that distinguishes between positive and negative sentiment at the document level. The corpus contains data from a variety of fields, including book reviews, product reviews, movie reviews, and song lyrics. The annotators meticulously followed the annotation technique for each of them. The folder “Song Lyrics” in the corpus contains 339 Telugu song lyrics written in Telugu script [ 121 ].

Language Modelling: Language models analyse text data to calculate word probability. They use an algorithm to interpret the data, which establishes rules for context in natural language. The model then uses these rules to accurately predict or construct new sentences. The model basically learns the basic characteristics and features of language and then applies them to new phrases. Majorly used datasets for Language modeling are as follows:

Salesforce’s WikiText-103 dataset has 103 million tokens collected from 28,475 featured articles from Wikipedia.

WikiText-2 is a scaled-down version of WikiText-103. It contains 2 million tokens with a 33,278 jargon size.

Penn Treebank piece of the Wall Street Diary corpus includes 929,000 tokens for training, 73,000 tokens for validation, and 82,000 tokens for testing purposes. Its context is limited since it comprises sentences rather than paragraphs [ 76 ].

The Ministry of Electronics and Information Technology’s Technology Development Programme for Indian Languages (TDIL) launched its own data distribution portal ( www.tdil-dc.in ) which has cataloged datasets [ 24 ].

Machine Translation: The task of converting the text of one natural language into another language while keeping the sense of the input text is known as machine translation. Majorly used datasets are as follows:

Tatoeba is a collection of multilingual sentence pairings. A tab-delimited pair of an English text sequence and the translated French text sequence appears on each line of the dataset. Each text sequence might be as simple as a single sentence or as complex as a paragraph of many sentences.

The Europarl parallel corpus is derived from the European Parliament’s proceedings. It is available in 21 European languages [ 40 ].

WMT14 provides machine translation pairs for English-German and English-French. Separately, these datasets comprise 4.5 million and 35 million sentence sets. Byte-Pair Encoding with 32 K tasks is used to encode the phrases.

There are around 160,000 sentence pairings in the IWSLT 14. The dataset includes descriptions in English-German (En-De) and German-English (De-En) languages. There are around 200 K training sentence sets in the IWSLT 13 dataset.

The IIT Bombay English-Hindi corpus comprises parallel corpora for English-Hindi as well as monolingual Hindi corpora gathered from several existing sources and corpora generated over time at IIT Bombay’s Centre for Indian Language Technology.

Question Answering System: Question answering systems provide real-time responses which are widely used in customer care services. The datasets used for dialogue system/question answering system are as follows:

Stanford Question Answering Dataset (SQuAD): it is a reading comprehension dataset made up of questions posed by crowd workers on a collection of Wikipedia articles.

Natural Questions: It is a large-scale corpus presented by Google used for training and assessing open-domain question answering systems. It includes 300,000 naturally occurring queries as well as human-annotated responses from Wikipedia pages for use in QA system training.

Question Answering in Context (QuAC): This dataset is used to describe, comprehend, and participate in information seeking conversation. In this dataset, instances are made up of an interactive discussion between two crowd workers: a student who asks a series of open-ended questions about an unknown Wikipedia text, and a teacher who responds by offering brief extracts from the text.

The neural learning models are overtaking traditional models for NLP [ 64 , 127 ]. In [ 64 ], authors used CNN (Convolutional Neural Network) model for sentiment analysis of movie reviews and achieved 81.5% accuracy. The results illustrate that using CNN was an appropriate replacement for state-of-the-art methods. Authors [ 127 ] have combined SST and Recursive Neural Tensor Network for sentiment analysis of the single sentence. This model amplifies the accuracy by 5.4% for sentence classification compared to traditional NLP models. Authors [ 135 ] proposed a combined Recurrent Neural Network and Transformer model for sentiment analysis. This hybrid model was tested on three different datasets: Twitter US Airline Sentiment, IMDB, and Sentiment 140: and achieved F1 scores of 91%, 93%, and 90%, respectively. This model’s performance outshined the state-of-art methods.

Santoro et al. [ 118 ] introduced a rational recurrent neural network with the capacity to learn on classifying the information and perform complex reasoning based on the interactions between compartmentalized information. They used the relational memory core to handle such interactions. Finally, the model was tested for language modeling on three different datasets (GigaWord, Project Gutenberg, and WikiText-103). Further, they mapped the performance of their model to traditional approaches for dealing with relational reasoning on compartmentalized information. The results achieved with RMC show improved performance.

Merity et al. [ 86 ] extended conventional word-level language models based on Quasi-Recurrent Neural Network and LSTM to handle the granularity at character and word level. They tuned the parameters for character-level modeling using Penn Treebank dataset and word-level modeling using WikiText-103. In both cases, their model outshined the state-of-art methods.

Luong et al. [ 70 ] used neural machine translation on the WMT14 dataset and performed translation of English text to French text. The model demonstrated a significant improvement of up to 2.8 bi-lingual evaluation understudy (BLEU) scores compared to various neural machine translation systems. It outperformed the commonly used MT system on a WMT 14 dataset.

Fan et al. [ 41 ] introduced a gradient-based neural architecture search algorithm that automatically finds architecture with better performance than a transformer, conventional NMT models. They tested their model on WMT14 (English-German Translation), IWSLT14 (German-English translation), and WMT18 (Finnish-to-English translation) and achieved 30.1, 36.1, and 26.4 BLEU points, which shows better performance than Transformer baselines.

Wiese et al. [ 150 ] introduced a deep learning approach based on domain adaptation techniques for handling biomedical question answering tasks. Their model revealed the state-of-the-art performance on biomedical question answers, and the model outperformed the state-of-the-art methods in domains.

Seunghak et al. [ 158 ] designed a Memory-Augmented-Machine-Comprehension-Network (MAMCN) to handle dependencies faced in reading comprehension. The model achieved state-of-the-art performance on document-level using TriviaQA and QUASAR-T datasets, and paragraph-level using SQuAD datasets.

Xie et al. [ 154 ] proposed a neural architecture where candidate answers and their representation learning are constituent centric, guided by a parse tree. Under this architecture, the search space of candidate answers is reduced while preserving the hierarchical, syntactic, and compositional structure among constituents. Using SQuAD, the model delivers state-of-the-art performance.

4.2 State-of-the-art models in NLP

Rationalist approach or symbolic approach assumes that a crucial part of the knowledge in the human mind is not derived by the senses but is firm in advance, probably by genetic inheritance. Noam Chomsky was the strongest advocate of this approach. It was believed that machines can be made to function like the human brain by giving some fundamental knowledge and reasoning mechanism linguistics knowledge is directly encoded in rule or other forms of representation. This helps the automatic process of natural languages [ 92 ]. Statistical and machine learning entail evolution of algorithms that allow a program to infer patterns. An iterative process is used to characterize a given algorithm’s underlying algorithm that is optimized by a numerical measure that characterizes numerical parameters and learning phase. Machine-learning models can be predominantly categorized as either generative or discriminative. Generative methods can generate synthetic data because of which they create rich models of probability distributions. Discriminative methods are more functional and have right estimating posterior probabilities and are based on observations. Srihari [ 129 ] explains the different generative models as one with a resemblance that is used to spot an unknown speaker’s language and would bid the deep knowledge of numerous languages to perform the match. Discriminative methods rely on a less knowledge-intensive approach and using distinction between languages. Whereas generative models can become troublesome when many features are used and discriminative models allow use of more features [ 38 ]. Few of the examples of discriminative methods are Logistic regression and conditional random fields (CRFs), generative methods are Naive Bayes classifiers and hidden Markov models (HMMs).

Naive Bayes Classifiers

Naive Bayes is a probabilistic algorithm which is based on probability theory and Bayes’ Theorem to predict the tag of a text such as news or customer review. It helps to calculate the probability of each tag for the given text and return the tag with the highest probability. Bayes’ Theorem is used to predict the probability of a feature based on prior knowledge of conditions that might be related to that feature. The choice of area in NLP using Naïve Bayes Classifiers could be in usual tasks such as segmentation and translation but it is also explored in unusual areas like segmentation for infant learning and identifying documents for opinions and facts. Anggraeni et al. (2019) [ 61 ] used ML and AI to create a question-and-answer system for retrieving information about hearing loss. They developed I-Chat Bot which understands the user input and provides an appropriate response and produces a model which can be used in the search for information about required hearing impairments. The problem with naïve bayes is that we may end up with zero probabilities when we meet words in the test data for a certain class that are not present in the training data.

Hidden Markov Model (HMM)

An HMM is a system where a shifting takes place between several states, generating feasible output symbols with each switch. The sets of viable states and unique symbols may be large, but finite and known. We can describe the outputs, but the system’s internals are hidden. Few of the problems could be solved by Inference A certain sequence of output symbols, compute the probabilities of one or more candidate states with sequences. Patterns matching the state-switch sequence are most likely to have generated a particular output-symbol sequence. Training the output-symbol chain data, reckon the state-switch/output probabilities that fit this data best.

Hidden Markov Models are extensively used for speech recognition, where the output sequence is matched to the sequence of individual phonemes. HMM is not restricted to this application; it has several others such as bioinformatics problems, for example, multiple sequence alignment [ 128 ]. Sonnhammer mentioned that Pfam holds multiple alignments and hidden Markov model-based profiles (HMM-profiles) of entire protein domains. The cue of domain boundaries, family members and alignment are done semi-automatically found on expert knowledge, sequence similarity, other protein family databases and the capability of HMM-profiles to correctly identify and align the members. HMM may be used for a variety of NLP applications, including word prediction, sentence production, quality assurance, and intrusion detection systems [ 133 ].

Neural Network

Earlier machine learning techniques such as Naïve Bayes, HMM etc. were majorly used for NLP but by the end of 2010, neural networks transformed and enhanced NLP tasks by learning multilevel features. Major use of neural networks in NLP is observed for word embedding where words are represented in the form of vectors. These vectors can be used to recognize similar words by observing their closeness in this vector space, other uses of neural networks are observed in information retrieval, text summarization, text classification, machine translation, sentiment analysis and speech recognition. Initially focus was on feedforward [ 49 ] and CNN (convolutional neural network) architecture [ 69 ] but later researchers adopted recurrent neural networks to capture the context of a word with respect to surrounding words of a sentence. LSTM (Long Short-Term Memory), a variant of RNN, is used in various tasks such as word prediction, and sentence topic prediction. [ 47 ] In order to observe the word arrangement in forward and backward direction, bi-directional LSTM is explored by researchers [ 59 ]. In case of machine translation, encoder-decoder architecture is used where dimensionality of input and output vector is not known. Neural networks can be used to anticipate a state that has not yet been seen, such as future states for which predictors exist whereas HMM predicts hidden states.

Bi-directional Encoder Representations from Transformers (BERT) is a pre-trained model with unlabeled text available on BookCorpus and English Wikipedia. This can be fine-tuned to capture context for various NLP tasks such as question answering, sentiment analysis, text classification, sentence embedding, interpreting ambiguity in the text etc. [ 25 , 33 , 90 , 148 ]. Earlier language-based models examine the text in either of one direction which is used for sentence generation by predicting the next word whereas the BERT model examines the text in both directions simultaneously for better language understanding. BERT provides contextual embedding for each word present in the text unlike context-free models (word2vec and GloVe). For example, in the sentences “he is going to the riverbank for a walk” and “he is going to the bank to withdraw some money”, word2vec will have one vector representation for “bank” in both the sentences whereas BERT will have different vector representation for “bank”. Muller et al. [ 90 ] used the BERT model to analyze the tweets on covid-19 content. The use of the BERT model in the legal domain was explored by Chalkidis et al. [ 20 ].

Since BERT considers up to 512 tokens, this is the reason if there is a long text sequence that must be divided into multiple short text sequences of 512 tokens. This is the limitation of BERT as it lacks in handling large text sequences.

5 Evaluation metrics and challenges

The objective of this section is to discuss evaluation metrics used to evaluate the model’s performance and involved challenges.

5.1 Evaluation metrics

Since the number of labels in most classification problems is fixed, it is easy to determine the score for each class and, as a result, the loss from the ground truth. In image generation problems, the output resolution and ground truth are both fixed. As a result, we can calculate the loss at the pixel level using ground truth. But in NLP, though output format is predetermined in the case of NLP, dimensions cannot be specified. It is because a single statement can be expressed in multiple ways without changing the intent and meaning of that statement. Evaluation metrics are important to evaluate the model’s performance if we were trying to solve two problems with one model.

BLEU (BiLingual Evaluation Understudy) Score: Each word in the output sentence is scored 1 if it appears in either of the reference sentences and a 0 if it does not. Further the number of words that appeared in one of the reference translations is divided by the total number of words in the output sentence to normalize the count so that it is always between 0 and 1. For example, if ground truth is “He is playing chess in the backyard” and output sentences are S1: “He is playing tennis in the backyard”, S2: “He is playing badminton in the backyard”, S3: “He is playing movie in the backyard” and S4: “backyard backyard backyard backyard backyard backyard backyard”. The score of S1, S2 and S3 would be 6/7,6/7 and 6/7. All sentences are getting the same score though information in S1 and S3 is not same. This is because BELU considers words in a sentence contribute equally to the meaning of a sentence which is not the case in real-world scenario. Using combination of uni-gram, bi-gram and n-grams, we can to capture the order of a sentence. We may also set a limit on how many times each word is counted based on how many times it appears in each reference phrase, which helps us prevent excessive repetition.

GLUE (General Language Understanding Evaluation) score: Previously, NLP models were almost usually built to perform effectively on a unique job. Various models such as LSTM, Bi-LSTM were trained solely for this task, and very rarely generalized to other tasks. The model which is used for named entity recognition can perform for textual entailment. GLUE is a set of datasets for training, assessing, and comparing NLP models. It includes nine diverse task datasets designed to test a model’s language understanding. To acquire a comprehensive assessment of a model’s performance, GLUE tests the model on a variety of tasks rather than a single one. Single-sentence tasks, similarity and paraphrase tasks, and inference tasks are among them. For example, in sentiment analysis of customer reviews, we might be interested in analyzing ambiguous reviews and determining which product the client is referring to in his reviews. Thus, the model obtains a good “knowledge” of language in general after some generalized pre-training. When the time comes to test out a model to meet a given task, this universal “knowledge” gives us an advantage. With GLUE, researchers can evaluate their model and score it on all nine tasks. The final performance score model is the average of those nine scores. It makes little difference how the model looks or works if it can analyze inputs and predict outcomes for all the activities.

Considering these metrics in mind, it helps to evaluate the performance of an NLP model for a particular task or a variety of tasks.

5.2 Challenges

The applications of NLP have been growing day by day, and with these new challenges are also occurring despite a lot of work done in the recent past. Some of the common challenges are: Contextual words and phrases in the language where same words and phrases can have different meanings in a sentence which are easy for the humans to understand but makes a challenging task. Such type of challenges can also be faced with dealing Synonyms in the language because humans use many different words to express the same idea, also in the language different levels of complexity such as large, huge, and big may be used by the different persons which become a challenging task to process the language and design algorithms to adopt all these issues. Further in language, Homonyms, the words used to be pronounced the same but have different definitions are also problematic for question answering and speech-to-text applications because they aren’t written in text form. Sentences using sarcasm and irony sometimes may be understood in the opposite way by the humans, and so designing models to deal with such sentences is a really challenging task in NLP. Furthermore, the sentences in the language having any type of ambiguity in the sense of interpreting in more than one way is also an area to work upon where more accuracy can be achieved. Language containing informal phrases, expressions, idioms, and culture-specific lingo make difficult to design models intended for the broad use, however having a lot of data on which training and updating on regular basis may improve the models, but it is a really challenging task to deal with the words having different meaning in different geographic areas. In fact, such types of issues also occur in dealing with different domains such as the meaning of words or sentences may be different in the education industry but have different meaning in health, law, defense etc. So, the models for NLP may be working good for an individual domain, geographic area but for a broad use such challenges need to be tackled. Further together with the above-mentioned challenges misspelled or misused words can also create a problem, although autocorrect and grammar corrections applications have improved a lot due to the continuous developments in the direction but predicting the intention of the writer that to from a specific domain, geographic area by considering sarcasm, expressions, informal phrases etc. is really a big challenge. There is no doubt that for most common widely used languages models for NLP have been doing very well, and further improving day by day but still there is a need for models for all the persons rather than specific knowledge of a particular language and technology. One may further refer to the work of Sharifirad and Matwin (2019) [ 123 ] for classification of different online harassment categories and challenges, Baclic et.al. (2020) [ 6 ] and Wong et al. (2018) [ 151 ] for challenges and opportunities in public health, Kang et.al. (2020) [ 63 ] for detailed literature survey and technological challenges relevant to management research and NLP, and a recent review work by Alshemali and Kalita (2020) [ 3 ], and references cited there in.

In the recent past, models dealing with Visual Commonsense Reasoning [ 31 ] and NLP have also been getting attention of the several researchers and seems a promising and challenging area to work upon. These models try to extract the information from an image, video using a visual reasoning paradigm such as the humans can infer from a given image, video beyond what is visually obvious, such as objects’ functions, people’s intents, and mental states. In this direction, recently Wen and Peng (2020) [ 149 ] suggested a model to capture knowledge from different perspectives, and perceive common sense in advance, and the results based on the conducted experiments on visual commonsense reasoning dataset VCR seems very satisfactory and effective. The work of Peng and Chi (2019) [ 102 ], that proposes Domain Adaptation with Scene Graph approach to transfer knowledge from the source domain with the objective to improve cross-media retrieval in the target domain, and Yen et al. (2019) [ 155 ] is also very useful to further explore the use of NLP and in its relevant domains.

6 Conclusion

This paper is written with three objectives. The first objective gives insights of the various important terminologies of NLP and NLG, and can be useful for the readers interested to start their early career in NLP and work relevant to its applications. The second objective of this paper focuses on the history, applications, and recent developments in the field of NLP. The third objective is to discuss datasets, approaches and evaluation metrics used in NLP. The relevant work done in the existing literature with their findings and some of the important applications and projects in NLP are also discussed in the paper. The last two objectives may serve as a literature survey for the readers already working in the NLP and relevant fields, and further can provide motivation to explore the fields mentioned in this paper. It is to be noticed that even though a great amount of work on natural language processing is available in literature surveys (one may refer to [ 15 , 32 , 63 , 98 , 133 , 151 ] focusing on one domain such as usage of deep-learning techniques in NLP, techniques used for email spam filtering, medication safety, management research, intrusion detection, and Gujarati language etc.), still there is not much work on regional languages, which can be the focus of future research.

Change history

25 july 2022.

Affiliation 3 has been added into the online PDF.

Ahonen H, Heinonen O, Klemettinen M, Verkamo AI (1998) Applying data mining techniques for descriptive phrase extraction in digital document collections. In research and technology advances in digital libraries, 1998. ADL 98. Proceedings. IEEE international forum on (pp. 2-11). IEEE

Alshawi H (1992) The core language engine. MIT press

Alshemali B, Kalita J (2020) Improving the reliability of deep neural networks in NLP: A review. Knowl-Based Syst 191:105210

Article   Google Scholar  

Andreev ND (1967) The intermediary language as the focal point of machine translation. In: Booth AD (ed) Machine translation. North Holland Publishing Company, Amsterdam, pp 3–27

Google Scholar  

Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos CD, Stamatopoulos P (2000) Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. arXiv preprint cs/0009009

Baclic O, Tunis M, Young K, Doan C, Swerdfeger H, Schonfeld J (2020) Artificial intelligence in public health: challenges and opportunities for public health made possible by advances in natural language processing. Can Commun Dis Rep 46(6):161

Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In ICLR 2015

Bangalore S, Rambow O, Whittaker S (2000) Evaluation metrics for generation. In proceedings of the first international conference on natural language generation-volume 14 (pp. 1-8). Assoc Comput Linguist

Baud RH, Rassinoux AM, Scherrer JR (1991) Knowledge representation of discharge summaries. In AIME 91 (pp. 173–182). Springer, Berlin Heidelberg

Baud RH, Rassinoux AM, Scherrer JR (1992) Natural language processing and semantical representation of medical texts. Methods Inf Med 31(2):117–125

Baud RH, Alpay L, Lovis C (1994) Let’s meet the users with natural language understanding. Knowledge and Decisions in Health Telematics: The Next Decade 12:103

Bengio Y, Ducharme R, Vincent P (2001) A neural probabilistic language model. Proceedings of NIPS

Benson E, Haghighi A, Barzilay R (2011) Event discovery in social media feeds. In proceedings of the 49th annual meeting of the Association for Computational Linguistics: human language technologies-volume 1 (pp. 389-398). Assoc Comput Linguist

Berger AL, Della Pietra SA, Della Pietra VJ (1996) A maximum entropy approach to natural language processing. Computational Linguistics 22(1):39–71

Blanzieri E, Bryl A (2008) A survey of learning-based techniques of email spam filtering. Artif Intell Rev 29(1):63–92

Bondale N, Maloor P, Vaidyanathan A, Sengupta S, Rao PV (1999) Extraction of information from open-ended questionnaires using natural language processing techniques. Computer Science and Informatics 29(2):15–22

Borst F, Sager N, Nhàn NT, Su Y, Lyman M, Tick LJ, ..., Scherrer JR (1989) Analyse automatique de comptes rendus d'hospitalisation. In Degoulet P, Stephan JC, Venot A, Yvon PJ, rédacteurs. Informatique et Santé, Informatique et Gestion des Unités de Soins, Comptes Rendus du Colloque AIM-IF, Paris (pp. 246–56). [5]

Briscoe EJ, Grover C, Boguraev B, Carroll J (1987) A formalism and environment for the development of a large grammar of English. IJCAI 87:703–708

Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. arXiv preprint cs/0109015

Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, Androutsopoulos I (2020) LEGAL-BERT: the muppets straight out of law school. arXiv preprint arXiv:2010.02559

Chi EC, Lyman MS, Sager N, Friedman C, Macleod C (1985) A database of computer-structured narrative: methods of computing complex relations. In proceedings of the annual symposium on computer application in medical care (p. 221). Am Med Inform Assoc

Cho K, Van Merriënboer B, Bahdanau D, Bengio Y, (2014) On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259

Chomsky N (1965) Aspects of the theory of syntax. MIT Press, Cambridge, Massachusetts

Choudhary N (2021) LDC-IL: the Indian repository of resources for language technology. Lang Resources & Evaluation 55:855–867. https://doi.org/10.1007/s10579-020-09523-3

Chouikhi H, Chniter H, Jarray F (2021) Arabic sentiment analysis using BERT model. In international conference on computational collective intelligence (pp. 621-632). Springer, Cham

Chung J, Gulcehre C, Cho K, Bengio Y, (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

Cohen WW (1996) Learning rules that classify e-mail. In AAAI spring symposium on machine learning in information access (Vol. 18, p. 25)

Cohen PR, Morgan J, Ramsay AM (2002) Intention in communication, Am J Psychol 104(4)

Collobert R, Weston J (2008) A unified architecture for natural language processing. In proceedings of the 25th international conference on machine learning (pp. 160–167)

Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R, (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860

Davis E, Marcus G (2015) Commonsense reasoning and commonsense knowledge in artificial intelligence. Commun ACM 58(9):92–103

Desai NP, Dabhi VK (2022) Resources and components for Gujarati NLP systems: a survey. Artif Intell Rev:1–19

Devlin J, Chang MW, Lee K, Toutanova K, (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

Diab M, Hacioglu K, Jurafsky D (2004) Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceedings of HLT-NAACL 2004: Short papers (pp. 149–152). Assoc Computat Linguist

Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In proceedings of the second international conference on human language technology research (pp. 138-145). Morgan Kaufmann publishers Inc

Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054

Dunlavy DM, O’Leary DP, Conroy JM, Schlesinger JD (2007) QCS: A system for querying, clustering and summarizing documents. Inf Process Manag 43(6):1588–1605

Elkan C (2008) Log-Linear Models and Conditional Random Fields. http://cseweb.ucsd.edu/welkan/250B/cikmtutorial.pdf accessed 28 Jun 2017.

Emele MC, Dorna M (1998) Ambiguity preserving machine translation using packed representations. In proceedings of the 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics-volume 1 (pp. 365-371). Association for Computational Linguistics

Europarl: A Parallel Corpus for Statistical Machine Translation (2005) Philipp Koehn , MT Summit 2005

Fan Y, Tian F, Xia Y, Qin T, Li XY, Liu TY (2020) Searching better architectures for neural machine translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28:1574–1585

Fang H, Lu W, Wu F, Zhang Y, Shang X, Shao J, Zhuang Y (2015) Topic aspect-oriented summarization via group selection. Neurocomputing 149:1613–1619

Fattah MA, Ren F (2009) GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Comput Speech Lang 23(1):126–144

Feldman S (1999) NLP meets the jabberwocky: natural language processing in information retrieval. Online-Weston Then Wilton 23:62–73

Friedman C, Cimino JJ, Johnson SB (1993) A conceptual model for clinical radiology reports. In proceedings of the annual symposium on computer application in medical care (p. 829). Am Med Inform Assoc

Gao T, Dontcheva M, Adar E, Liu Z, Karahalios K DataTone: managing ambiguity in natural language interfaces for data visualization, UIST ‘15: proceedings of the 28th annual ACM symposium on User Interface Software & Technology, November 2015, 489–500, https://doi.org/10.1145/2807442.2807478

Ghosh S, Vinyals O, Strope B, Roy S, Dean T, Heck L (2016) Contextual lstm (clstm) models for large scale nlp tasks. arXiv preprint arXiv:1602.06291

Glasgow B, Mandell A, Binney D, Ghemri L, Fisher D (1998) MITA: an information-extraction approach to the analysis of free-form text in life insurance applications. AI Mag 19(1):59

Goldberg Y (2017) Neural network methods for natural language processing. Synthesis lectures on human language technologies 10(1):1–309

Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 19-25). ACM

Green Jr, BF, Wolf AK, Chomsky C, Laughery K (1961) Baseball: an automatic question-answerer. In papers presented at the may 9-11, 1961, western joint IRE-AIEE-ACM computer conference (pp. 219-224). ACM

Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2016) LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems 28(10):2222–2232

Article   MathSciNet   Google Scholar  

Grishman R, Sager N, Raze C, Bookchin B (1973) The linguistic string parser. In proceedings of the June 4-8, 1973, national computer conference and exposition (pp. 427-434). ACM

Hayes PJ (1992) Intelligent high-volume text processing using shallow, domain-specific techniques. Text-based intelligent systems: current research and practice in information extraction and retrieval, 227-242.

Hendrix GG, Sacerdoti ED, Sagalowicz D, Slocum J (1978) Developing a natural language interface to complex data. ACM Transactions on Database Systems (TODS) 3(2):105–147

"Here’s Why Natural Language Processing is the Future of BI (2017) " SmartData Collective. N.p., n.d. Web. 19

Hirschman L, Grishman R, Sager N (1976) From text to structured information: automatic processing of medical reports. In proceedings of the June 7-10, 1976, national computer conference and exposition (pp. 267-275). ACM

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991

Hutchins WJ (1986) Machine translation: past, present, future (p. 66). Ellis Horwood, Chichester

Jurafsky D, Martin J (2008) H. Speech and language processing. 2nd edn. Prentice-Hall, Englewood Cliffs, NJ

Kamp H, Reyle U (1993) Tense and aspect. In from discourse to logic (pp. 483-689). Springer Netherlands

Kang Y, Cai Z, Tan CW, Huang Q, Liu H (2020) Natural language processing (NLP) in management research: A literature review. Journal of Management Analytics 7(2):139–172

Kim Y. (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882

Knight K, Langkilde I (2000) Preserving ambiguities in generation via automata intersection. In AAAI/IAAI (pp. 697-702)

Lass R (1998) Phonology: An Introduction to Basic Concepts. Cambridge, UK; New York; Melbourne, Australia: Cambridge University Press. p. 1. ISBN 978–0–521-23728-4. Retrieved 8 January 2011Paperback ISBN 0–521–28183-0

Lewis DD (1998) Naive (Bayes) at forty: The independence assumption in information retrieval. In European conference on machine learning (pp. 4–15). Springer, Berlin Heidelberg

Liddy ED (2001). Natural language processing

Lopez MM, Kalita J (2017) Deep learning applied to NLP. arXiv preprint arXiv:1703.03091

Luong MT, Sutskever I, Le Q V, Vinyals O, Zaremba W (2014) Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206

Lyman M, Sager N, Friedman C, Chi E (1985) Computer-structured narrative in ambulatory care: its use in longitudinal review of clinical data. In proceedings of the annual symposium on computer application in medical care (p. 82). Am Med Inform Assoc

Lyman M, Sager N, Chi EC, Tick LJ, Nhan NT, Su Y, ..., Scherrer, J. (1989) Medical Language Processing for Knowledge Representation and Retrievals. In Proceedings. Symposium on Computer Applications in Medical Care (pp. 548–553). Am Med Inform Assoc

Maas A, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (pp. 142-150)

Mani I, Maybury MT (eds) (1999) Advances in automatic text summarization, vol 293. MIT press, Cambridge, MA

Manning CD, Schütze H (1999) Foundations of statistical natural language processing, vol 999. MIT press, Cambridge

MATH   Google Scholar  

Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of english: the penn treebank. Comput Linguist 19(2):313–330

McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, pp. 41-48)

McCray AT (1991) Natural language processing for intelligent information retrieval. In Engineering in Medicine and Biology Society, 1991. Vol. 13: 1991., Proceedings of the Annual International Conference of the IEEE (pp. 1160–1161). IEEE

McCray AT (1991) Extending a natural language parser with UMLS knowledge. In proceedings of the annual symposium on computer application in medical care (p. 194). Am Med Inform Assoc

McCray AT, Nelson SJ (1995) The representation of meaning in the UMLS. Methods Inf Med 34(1–2):193–201

McCray AT, Razi A (1994) The UMLS knowledge source server. Medinfo MedInfo 8:144–147

McCray AT, Srinivasan S, Browne AC (1994) Lexical methods for managing variation in biomedical terminologies. In proceedings of the annual symposium on computer application in medical care (p. 235). Am Med Inform Assoc

McDonald R, Crammer K, Pereira F (2005) Flexible text segmentation with structured multilabel classification. In proceedings of the conference on human language technology and empirical methods in natural language processing (pp. 987-994). Assoc Comput Linguist

McGray AT, Sponsler JL, Brylawski B, Browne AC (1987) The role of lexical knowledge in biomedical text understanding. In proceedings of the annual symposium on computer application in medical care (p. 103). Am Med Inform Assoc

McKeown KR (1985) Text generation. Cambridge University Press, Cambridge

Book   Google Scholar  

Merity S, Keskar NS, Socher R (2018) An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240

Mikolov T, Chen K, Corrado G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems

Morel-Guillemaz AM, Baud RH, Scherrer JR (1990) Proximity processing of medical text. In medical informatics Europe’90 (pp. 625–630). Springer, Berlin Heidelberg

Morin E (1999) Automatic acquisition of semantic relations between terms from technical corpora. In proc. of the fifth international congress on terminology and knowledge engineering-TKE’99

Müller M, Salathé M, Kummervold PE (2020) Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter. arXiv preprint arXiv:2005.07503

"Natural Language Processing (2017) " Natural Language Processing RSS. N.p., n.d. Web. 25

"Natural Language Processing" (2017) Natural Language Processing RSS. N.p., n.d. Web. 23

Newatia R (2019) https://medium.com/saarthi-ai/sentence-classification-using-convolutional-neural-networks-ddad72c7048c . Accessed 15 Dec 2021

Nhàn NT, Sager N, Lyman M, Tick LJ, Borst F, Su Y (1989) A medical language processor for two indo-European languages. In proceedings. Symposium on computer applications in medical care (pp. 554-558). Am Med Inform Assoc

Nießen S, Och FJ, Leusch G, Ney H (2000) An evaluation tool for machine translation: fast evaluation for MT research. In LREC

Ochoa, A. (2016). Meet the Pilot: Smart Earpiece Language Translator. https://www.indiegogo.com/projects/meet-the-pilot-smart-earpiece-language-translator-headphones-travel . Accessed April 10, 2017

Ogallo, W., & Kanter, A. S. (2017). Using natural language processing and network analysis to develop a conceptual framework for medication therapy management research. https://www.ncbi.nlm.nih.gov/pubmed/28269895?dopt=Abstract . Accessed April 10, 2017

Otter DW, Medina JR, Kalita JK (2020) A survey of the usages of deep learning for natural language processing. IEEE Transactions on Neural Networks and Learning Systems 32(2):604–624

Ouyang Y, Li W, Li S, Lu Q (2011) Applying regression models to query-focused multi-document summarization. Inf Process Manag 47(2):227–237

Palmer M, Gildea D, Kingsbury P (2005) The proposition bank: an annotated corpus of semantic roles. Computational linguistics 31(1):71–106

Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In proceedings of the 40th annual meeting on association for computational linguistics (pp. 311-318). Assoc Comput Linguist

Peng Y, Chi J (2019) Unsupervised cross-media retrieval using domain adaptation with scene graph. IEEE Transactions on Circuits and Systems for Video Technology 30(11):4368–4379

Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

Rae JW, Potapenko A, Jayakumar SM, Lillicrap TP, (2019) Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507

Ranjan P, Basu HVSSA (2003) Part of speech tagging and local word grouping techniques for natural language parsing in Hindi. In Proceedings of the 1st International Conference on Natural Language Processing (ICON 2003)

Rassinoux AM, Baud RH, Scherrer JR (1992) Conceptual graphs model extension for knowledge representation of medical texts. MEDINFO 92:1368–1374

Rassinoux AM, Michel PA, Juge C, Baud R, Scherrer JR (1994) Natural language processing of medical texts within the HELIOS environment. Comput Methods Prog Biomed 45:S79–S96

Rassinoux AM, Juge C, Michel PA, Baud RH, Lemaitre D, Jean FC, Scherrer JR (1995) Analysis of medical jargon: The RECIT system. In Conference on Artificial Intelligence in Medicine in Europe (pp. 42–52). Springer, Berlin Heidelberg

Rennie J (2000) ifile: An application of machine learning to e-mail filtering. In Proc. KDD 2000 Workshop on text mining, Boston, MA

Riedhammer K, Favre B, Hakkani-Tür D (2010) Long story short–global unsupervised models for keyphrase based meeting summarization. Speech Comm 52(10):801–815

Ritter A, Clark S, Etzioni O (2011) Named entity recognition in tweets: an experimental study. In proceedings of the conference on empirical methods in natural language processing (pp. 1524-1534). Assoc Comput Linguist

Rospocher M, van Erp M, Vossen P, Fokkens A, Aldabe I, Rigau G, Soroa A, Ploeger T, Bogaard T(2016) Building event-centric knowledge graphs from news. Web Semantics: Science, Services and Agents on the World Wide Web, In Press

Sager N, Lyman M, Tick LJ, Borst F, Nhan NT, Revillard C, … Scherrer JR (1989) Adapting a medical language processor from English to French. Medinfo 89:795–799

Sager N, Lyman M, Nhan NT, Tick LJ (1995) Medical language processing: applications to patient data representation and automatic encoding. Methods Inf Med 34(1–2):140–146

Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In learning for text categorization: papers from the 1998 workshop (Vol. 62, pp. 98-105)

Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail. arXiv preprint cs/0106040

Sakkis G, Androutsopoulos I, Paliouras G et al (2003) A memory-based approach to anti-spam filtering for mailing lists. Inf Retr 6:49–73. https://doi.org/10.1023/A:1022948414856

Santoro A, Faulkner R, Raposo D, Rae J, Chrzanowski M, Weber T, ..., Lillicrap T (2018) Relational recurrent neural networks. Adv Neural Inf Proces Syst, 31

Scherrer JR, Revillard C, Borst F, Berthoud M, Lovis C (1994) Medical office automation integrated into the distributed architecture of a hospital information system. Methods Inf Med 33(2):174–179

Seal D, Roy UK, Basak R (2020) Sentence-level emotion detection from text based on semantic rules. In: Tuba M, Akashe S, Joshi A (eds) Information and communication Technology for Sustainable Development. Advances in intelligent Systems and computing, vol 933. Springer, Singapore. https://doi.org/10.1007/978-981-13-7166-0_42

Chapter   Google Scholar  

Sentiraama Corpus by Gangula Rama Rohit Reddy, Radhika Mamidi. Language Technologies Research Centre, KCIS, IIIT Hyderabad (n.d.) ltrc.iiit.ac.in/showfile.php?filename=downloads/sentiraama/

Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In proceedings of the 2003 conference of the north American chapter of the Association for Computational Linguistics on human language technology-volume 1 (pp. 134-141). Assoc Comput Linguist

Sharifirad S, Matwin S, (2019) When a tweet is actually sexist. A more comprehensive classification of different online harassment categories and the challenges in NLP. arXiv preprint arXiv:1902.10584

Sharma S, Srinivas PYKL, Balabantaray RC (2016) Emotion Detection using Online Machine Learning Method and TLBO on Mixed Script. In Proceedings of Language Resources and Evaluation Conference 2016 (pp. 47–51)

Shemtov H (1997) Ambiguity management in natural language generation. Stanford University

Small SL, Cortell GW, Tanenhaus MK (1988) Lexical Ambiguity Resolutions. Morgan Kauffman, San Mateo, CA

Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642)

Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res 26(1):320–322

Srihari S (2010) Machine Learning: Generative and Discriminative Models. http://www.cedar.buffalo.edu/wsrihari/CSE574/Discriminative-Generative.pdf . accessed 31 May 2017.]

Sun X, Morency LP, Okanohara D, Tsujii JI (2008) Modeling latent-dynamic in shallow parsing: a latent conditional model with improved inference. In proceedings of the 22nd international conference on computational linguistics-volume 1 (pp. 841-848). Assoc Comput Linguist

Sundheim BM, Chinchor NA (1993) Survey of the message understanding conferences. In proceedings of the workshop on human language technology (pp. 56-60). Assoc Comput Linguist

Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems

Sworna ZT, Mousavi Z, Babar MA (2022) NLP methods in host-based intrusion detection Systems: A systematic review and future directions. arXiv preprint arXiv:2201.08066

Systems RAVN (2017) "RAVN Systems Launch the ACE Powered GDPR Robot - Artificial Intelligence to Expedite GDPR Compliance." Stock Market. PR Newswire, n.d. Web. 19

Tan KL, Lee CP, Anbananthen KSM, Lim KM (2022) RoBERTa-LSTM: A hybrid model for sentiment analysis with transformers and recurrent neural network. IEEE Access, RoBERTa-LSTM: A Hybrid Model for Sentiment Analysis With Transformer and Recurrent Neural Network

Tapaswi N, Jain S (2012) Treebank based deep grammar acquisition and part-of-speech tagging for Sanskrit sentences. In software engineering (CONSEG), 2012 CSI sixth international conference on (pp. 1-4). IEEE

Thomas C (2019)  https://towardsdatascience.com/recurrent-neural-networks-and-natural-language-processing-73af640c2aa1 . Accessed 15 Dec 2021

Tillmann C, Vogel S, Ney H, Zubiaga A, Sawaf H (1997) Accelerated DP based search for statistical translation. In Eurospeech

Umber A, Bajwa I (2011) “Minimizing ambiguity in natural language software requirements specification,” in Sixth Int Conf Digit Inf Manag, pp. 102–107

"Using Natural Language Processing and Network Analysis to Develop a Conceptual Framework for Medication Therapy Management Research (2017) " AMIA ... Annual Symposium proceedings. AMIA Symposium. U.S. National Library of Medicine, n.d. Web. 19

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I, (2017) Attention is all you need. In advances in neural information processing systems (pp. 5998-6008)

Wahlster W, Kobsa A (1989) User models in dialog systems. In user models in dialog systems (pp. 4–34). Springer Berlin Heidelberg, User Models in Dialog Systems

Walton D (1996) A pragmatic synthesis. In: fallacies arising from ambiguity. Applied logic series, vol 1. Springer, Dordrecht)

Wan X (2008) Using only cross-document relationships for both generic and topic-focused multi-document summarizations. Inf Retr 11(1):25–49

Wang W, Gang J, 2018 Application of convolutional neural network in natural language processing. In 2018 international conference on information Systems and computer aided education (ICISCAE) (pp. 64-70). IEEE

Wang D, Zhu S, Li T, Gong Y (2009) Multi-document summarization using sentence-based topic models. In proceedings of the ACL-IJCNLP 2009 conference short papers (pp. 297-300). Assoc Comput Linguist

Wang D, Zhu S, Li T, Chi Y, Gong Y (2011) Integrating document clustering and multidocument summarization. ACM Transactions on Knowledge Discovery from Data (TKDD) 5(3):14–26

Wang Z, Ng P, Ma X, Nallapati R, Xiang B (2019) Multi-passage bert: A globally normalized bert model for open-domain question answering. arXiv preprint arXiv:1908.08167

Wen Z, Peng Y (2020) Multi-level knowledge injecting for visual commonsense reasoning. IEEE Transactions on Circuits and Systems for Video Technology 31(3):1042–1054

Wiese G, Weissenborn D, Neves M (2017) Neural domain adaptation for biomedical question answering. arXiv preprint arXiv:1706.03610

Wong A, Plasek JM, Montecalvo SP, Zhou L (2018) Natural language processing and its implications for the future of medication safety: a narrative review of recent advances and challenges. Pharmacotherapy: The Journal of Human Pharmacology and Drug Therapy 38(8):822–841

Woods WA (1978) Semantics and quantification in natural language question answering. Adv Comput 17:1–87

Xia T (2020) A constant time complexity spam detection algorithm for boosting throughput on rule-based filtering Systems. IEEE Access 8:82653–82661. https://doi.org/10.1109/ACCESS.2020.2991328

Xie P, Xing E (2017) A constituent-centric neural architecture for reading comprehension. In proceedings of the 55th annual meeting of the Association for Computational Linguistics (volume 1: long papers) (pp. 1405-1414)

Yan X, Ye Y, Mao Y, Yu H (2019) Shared-private information bottleneck method for cross-modal clustering. IEEE Access 7:36045–36056

Yi J, Nasukawa T, Bunescu R, Niblack W (2003) Sentiment analyzer: extracting sentiments about a given topic using natural language processing techniques. In data mining, 2003. ICDM 2003. Third IEEE international conference on (pp. 427-434). IEEE

Young SJ, Chase LL (1998) Speech recognition evaluation: a review of the US CSR and LVCSR programmes. Comput Speech Lang 12(4):263–279

Yu S, et al. (2018) "A multi-stage memory augmented neural network for machine reading comprehension." Proceedings of the workshop on machine reading for question answering

Zajic DM, Dorr BJ, Lin J (2008) Single-document and multi-document summarization techniques for email threads using sentence compression. Inf Process Manag 44(4):1600–1610

Zeroual I, Lakhouaja A, Belahbib R (2017) Towards a standard part of speech tagset for the Arabic language. J King Saud Univ Comput Inf Sci 29(2):171–178

Download references

Acknowledgements

Authors would like to express the gratitude to Research Mentors from CL Educate: Accendere Knowledge Management Services Pvt. Ltd. for their comments on earlier versions of the manuscript. Although any errors are our own and should not tarnish the reputations of these esteemed persons. We would also like to appreciate the Editor, Associate Editor, and anonymous referees for their constructive suggestions that led to many improvements on an earlier version of this manuscript.

Author information

Authors and affiliations.

Department of Computer Science, Manav Rachna International Institute of Research and Studies, Faridabad, India

Diksha Khurana & Aditya Koli

Department of Computer Science, BML Munjal University, Gurgaon, India

Kiran Khatter

Department of Statistics, Amity University Punjab, Mohali, India

Sukhdev Singh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Kiran Khatter .

Ethics declarations

Conflict of interest.

The first draft of this paper was written under the supervision of Dr. Kiran Khatter and Dr. Sukhdev Singh, associated with CL- Educate: Accendere Knowledge Management Services Pvt. Ltd. and deputed at the Manav Rachna International University. The draft is also available on arxiv.org at https://arxiv.org/abs/1708.05148

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Khurana, D., Koli, A., Khatter, K. et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl 82 , 3713–3744 (2023). https://doi.org/10.1007/s11042-022-13428-4

Download citation

Received : 03 February 2021

Revised : 23 March 2022

Accepted : 02 July 2022

Published : 14 July 2022

Issue Date : January 2023

DOI : https://doi.org/10.1007/s11042-022-13428-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Natural language processing
  • Natural language understanding
  • Natural language generation
  • NLP applications
  • NLP evaluation metrics
  • Find a journal
  • Publish with us
  • Track your research

MIT Libraries home DSpace@MIT

  • DSpace@MIT Home
  • MIT Libraries
  • Graduate Theses

Improving clinical decision making with natural language processing and machine learning

Thumbnail

Other Contributors

Terms of use, description, date issued, collections.

Subscribe to the PwC Newsletter

Join the community, natural language processing, representation learning.

thesis on nlp

Disentanglement

Graph representation learning, sentence embeddings.

thesis on nlp

Network Embedding

Classification.

thesis on nlp

Text Classification

thesis on nlp

Graph Classification

thesis on nlp

Audio Classification

thesis on nlp

Medical Image Classification

Language modelling.

thesis on nlp

Long-range modeling

Protein language model, sentence pair modeling, deep hashing, table retrieval, nlp based person retrival, question answering.

thesis on nlp

Open-Ended Question Answering

thesis on nlp

Open-Domain Question Answering

Conversational question answering.

thesis on nlp

Answer Selection

Translation, image generation.

thesis on nlp

Image-to-Image Translation

thesis on nlp

Text-to-Image Generation

thesis on nlp

Image Inpainting

thesis on nlp

Conditional Image Generation

Data augmentation.

thesis on nlp

Image Augmentation

thesis on nlp

Text Augmentation

Machine translation.

thesis on nlp

Transliteration

thesis on nlp

Multimodal Machine Translation

Bilingual lexicon induction.

thesis on nlp

Unsupervised Machine Translation

Text generation.

thesis on nlp

Dialogue Generation

thesis on nlp

Data-to-Text Generation

thesis on nlp

Multi-Document Summarization

Text style transfer, knowledge graph completion.

thesis on nlp

Knowledge Graphs

Large language model, triple classification, inductive knowledge graph completion, inductive relation prediction, 2d semantic segmentation, image segmentation.

thesis on nlp

Scene Parsing

thesis on nlp

Reflection Removal

thesis on nlp

Document Classification

thesis on nlp

Topic Models

thesis on nlp

Sentence Classification

thesis on nlp

Emotion Classification

Visual question answering (vqa).

thesis on nlp

Visual Question Answering

thesis on nlp

Machine Reading Comprehension

thesis on nlp

Chart Question Answering

thesis on nlp

Embodied Question Answering

Named entity recognition (ner).

thesis on nlp

Nested Named Entity Recognition

Chinese named entity recognition, few-shot ner, sentiment analysis.

thesis on nlp

Aspect-Based Sentiment Analysis (ABSA)

thesis on nlp

Multimodal Sentiment Analysis

thesis on nlp

Aspect Sentiment Triplet Extraction

thesis on nlp

Twitter Sentiment Analysis

Few-shot learning.

thesis on nlp

One-Shot Learning

thesis on nlp

Few-Shot Semantic Segmentation

Cross-domain few-shot.

thesis on nlp

Unsupervised Few-Shot Learning

Word embeddings.

thesis on nlp

Learning Word Embeddings

thesis on nlp

Multilingual Word Embeddings

Embeddings evaluation, contextualised word representations, optical character recognition (ocr).

thesis on nlp

Active Learning

thesis on nlp

Handwriting Recognition

Handwritten digit recognition, irregular text recognition, continual learning.

thesis on nlp

Class Incremental Learning

Continual named entity recognition, unsupervised class-incremental learning, text summarization.

thesis on nlp

Abstractive Text Summarization

Document summarization, opinion summarization, information retrieval.

thesis on nlp

Passage Retrieval

Cross-lingual information retrieval, table search, relation extraction.

thesis on nlp

Relation Classification

Document-level relation extraction, joint entity and relation extraction, temporal relation extraction, link prediction.

thesis on nlp

Inductive Link Prediction

Dynamic link prediction, hyperedge prediction, anchor link prediction, natural language inference.

thesis on nlp

Answer Generation

thesis on nlp

Visual Entailment

Cross-lingual natural language inference, reading comprehension.

thesis on nlp

Intent Recognition

Implicit relations, active object detection, emotion recognition.

thesis on nlp

Speech Emotion Recognition

thesis on nlp

Emotion Recognition in Conversation

thesis on nlp

Multimodal Emotion Recognition

Emotion-cause pair extraction, natural language understanding, vietnamese social media text processing.

thesis on nlp

Emotional Dialogue Acts

Image captioning.

thesis on nlp

3D dense captioning

Controllable image captioning, aesthetic image captioning.

thesis on nlp

Relational Captioning

Semantic textual similarity.

thesis on nlp

Paraphrase Identification

thesis on nlp

Cross-Lingual Semantic Textual Similarity

Event extraction, event causality identification, zero-shot event extraction, dialogue state tracking, task-oriented dialogue systems.

thesis on nlp

Visual Dialog

Dialogue understanding, coreference resolution, coreference-resolution, cross document coreference resolution, in-context learning, semantic parsing.

thesis on nlp

AMR Parsing

Semantic dependency parsing, drs parsing, ucca parsing, semantic similarity, conformal prediction.

thesis on nlp

Text Simplification

thesis on nlp

Music Source Separation

thesis on nlp

Decision Making Under Uncertainty

Audio source separation.

thesis on nlp

Code Generation

thesis on nlp

Code Translation

thesis on nlp

Code Documentation Generation

Class-level code generation, library-oriented code generation.

thesis on nlp

Sentence Embedding

Sentence compression, joint multilingual sentence representations, sentence embeddings for biomedical texts, specificity, dependency parsing.

thesis on nlp

Transition-Based Dependency Parsing

Prepositional phrase attachment, unsupervised dependency parsing, cross-lingual zero-shot dependency parsing, information extraction, extractive summarization, temporal information extraction, document-level event extraction, cross-lingual, cross-lingual transfer, cross-lingual document classification.

thesis on nlp

Cross-Lingual Entity Linking

Cross-language text summarization, response generation, common sense reasoning.

thesis on nlp

Physical Commonsense Reasoning

Riddle sense, anachronisms, memorization, instruction following, visual instruction following, prompt engineering.

thesis on nlp

Visual Prompting

Data integration.

thesis on nlp

Entity Alignment

thesis on nlp

Entity Resolution

Table annotation, entity linking.

thesis on nlp

Question Generation

Poll generation.

thesis on nlp

Topic coverage

Dynamic topic modeling, part-of-speech tagging.

thesis on nlp

Unsupervised Part-Of-Speech Tagging

Mathematical reasoning.

thesis on nlp

Math Word Problem Solving

Formal logic, geometry problem solving, abstract algebra, abuse detection, hate speech detection, open information extraction.

thesis on nlp

Hope Speech Detection

Hate speech normalization, hate speech detection crisishatemm benchmark, data mining.

thesis on nlp

Argument Mining

thesis on nlp

Opinion Mining

Subgroup discovery, cognitive diagnosis, parallel corpus mining, bias detection, selection bias, word sense disambiguation.

thesis on nlp

Word Sense Induction

Language identification, dialect identification, native language identification, few-shot relation classification, implicit discourse relation classification, cause-effect relation classification.

thesis on nlp

Fake News Detection

Relational reasoning.

thesis on nlp

Semantic Role Labeling

thesis on nlp

Predicate Detection

Semantic role labeling (predicted predicates).

thesis on nlp

Textual Analogy Parsing

Slot filling.

thesis on nlp

Zero-shot Slot Filling

Extracting covid-19 events from twitter, grammatical error correction.

thesis on nlp

Grammatical Error Detection

Text matching, document text classification, learning with noisy labels, multi-label classification of biomedical texts, political salient issue orientation detection, pos tagging, deep clustering, trajectory clustering, deep nonparametric clustering, nonparametric deep clustering, spoken language understanding, dialogue safety prediction, stance detection, zero-shot stance detection, few-shot stance detection, stance detection (us election 2020 - biden), stance detection (us election 2020 - trump), multi-modal entity alignment, intent detection.

thesis on nlp

Open Intent Detection

Word similarity, text-to-speech synthesis.

thesis on nlp

Prosody Prediction

Zero-shot multi-speaker tts, zero-shot cross-lingual transfer, cross-lingual ner, intent classification.

thesis on nlp

Document AI

Document understanding, fact verification, language acquisition, grounded language learning, entity typing.

thesis on nlp

Entity Typing on DH-KGs

Constituency parsing.

thesis on nlp

Constituency Grammar Induction

Self-learning, ad-hoc information retrieval, document ranking.

thesis on nlp

Cross-Modal Retrieval

Image-text matching, multilingual cross-modal retrieval.

thesis on nlp

Zero-shot Composed Person Retrieval

Cross-modal retrieval on rsitmd, model editing, knowledge editing, word alignment, open-domain dialog, dialogue evaluation, novelty detection, multimodal deep learning, multimodal text and image classification, multi-label text classification.

thesis on nlp

Text-based Image Editing

Text-guided-image-editing.

thesis on nlp

Zero-Shot Text-to-Image Generation

Concept alignment, conditional text-to-image synthesis, discourse parsing, discourse segmentation, connective detection.

thesis on nlp

Shallow Syntax

Sarcasm detection.

thesis on nlp

De-identification

Privacy preserving deep learning, explanation generation, morphological analysis.

thesis on nlp

Aspect Extraction

Aspect category sentiment analysis, extract aspect.

thesis on nlp

Aspect-Category-Opinion-Sentiment Quadruple Extraction

thesis on nlp

Aspect-oriented Opinion Extraction

Session search, lemmatization, molecular representation.

thesis on nlp

Chinese Word Segmentation

Handwritten chinese text recognition, chinese spelling error correction, chinese zero pronoun resolution, offline handwritten chinese character recognition, conversational search, entity disambiguation, text-to-video generation, text-to-video editing, subject-driven video generation, source code summarization, method name prediction, speech-to-text translation, simultaneous speech-to-text translation, authorship attribution, text clustering.

thesis on nlp

Short Text Clustering

thesis on nlp

Open Intent Discovery

Keyphrase extraction, linguistic acceptability.

thesis on nlp

Column Type Annotation

Cell entity annotation, columns property annotation, row annotation, abusive language.

thesis on nlp

Visual Storytelling

thesis on nlp

KG-to-Text Generation

thesis on nlp

Unsupervised KG-to-Text Generation

Few-shot text classification, zero-shot out-of-domain detection, term extraction, text2text generation, keyphrase generation, figurative language visualization, sketch-to-text generation, protein folding, phrase grounding, grounded open vocabulary acquisition, deep attention, morphological inflection, multilingual nlp, word translation, spam detection, context-specific spam detection, traditional spam detection, summarization, unsupervised extractive summarization, query-focused summarization.

thesis on nlp

Natural Language Transduction

Knowledge base population, conversational response selection, cross-lingual word embeddings, passage ranking, text annotation, image-to-text retrieval, key information extraction, biomedical information retrieval.

thesis on nlp

SpO2 estimation

Authorship verification.

thesis on nlp

News Classification

Automated essay scoring, graph-to-sequence, keyword extraction, story generation, multimodal association, multimodal generation, sentence summarization, unsupervised sentence summarization, key point matching, component classification, argument pair extraction (ape), claim extraction with stance classification (cesc), claim-evidence pair extraction (cepe), temporal processing, timex normalization, document dating, meme classification, hateful meme classification, morphological tagging, nlg evaluation, weakly supervised classification, weakly supervised data denoising, entity extraction using gan.

thesis on nlp

Rumour Detection

Semantic composition.

thesis on nlp

Sentence Ordering

Comment generation.

thesis on nlp

Lexical Simplification

Token classification, toxic spans detection.

thesis on nlp

Blackout Poetry Generation

Passage re-ranking, semantic retrieval, subjectivity analysis.

thesis on nlp

Emotional Intelligence

Dark humor detection, taxonomy learning, taxonomy expansion, hypernym discovery, conversational response generation.

thesis on nlp

Personalized and Emotional Conversation

Review generation, sentence-pair classification, lexical normalization, pronunciation dictionary creation, negation detection, negation scope resolution, question similarity, medical question pair similarity computation, goal-oriented dialog, user simulation, intent discovery, propaganda detection, propaganda span identification, propaganda technique identification, lexical analysis, lexical complexity prediction, question rewriting, punctuation restoration, reverse dictionary, humor detection.

thesis on nlp

Legal Reasoning

Meeting summarization, table-based fact verification, attribute value extraction, long-context understanding, pretrained multilingual language models, formality style transfer, semi-supervised formality style transfer, word attribute transfer, diachronic word embeddings, hallucination evaluation, persian sentiment analysis, clinical concept extraction.

thesis on nlp

Clinical Information Retreival

Constrained clustering.

thesis on nlp

Only Connect Walls Dataset Task 1 (Grouping)

Incremental constrained clustering, aspect category detection, dialog act classification, extreme summarization.

thesis on nlp

Recognizing Emotion Cause in Conversations

thesis on nlp

Causal Emotion Entailment

thesis on nlp

Nested Mention Recognition

Relationship extraction (distant supervised), binary classification, llm-generated text detection, cancer-no cancer per breast classification, cancer-no cancer per image classification, suspicous (birads 4,5)-no suspicous (birads 1,2,3) per image classification, cancer-no cancer per view classification, clickbait detection, decipherment, semantic entity labeling, text compression, handwriting verification, bangla spelling error correction, ccg supertagging, gender bias detection, linguistic steganography, probing language models, toponym resolution.

thesis on nlp

Timeline Summarization

Multimodal abstractive text summarization, reader-aware summarization, vietnamese visual question answering, explanatory visual question answering, code repair, thai word segmentation, vietnamese datasets, stock prediction, text-based stock prediction, event-driven trading, pair trading.

thesis on nlp

Face to Face Translation

Multimodal lexical translation, aggression identification, arabic sentiment analysis, arabic text diacritization, commonsense causal reasoning, complex word identification, fact selection, sign language production, suggestion mining, temporal relation classification, vietnamese word segmentation, speculation detection, speculation scope resolution, aspect category polarity, cross-lingual bitext mining, morphological disambiguation, scientific document summarization, lay summarization, text attribute transfer.

thesis on nlp

Image-guided Story Ending Generation

Abstract argumentation, dialogue rewriting, logical reasoning reading comprehension.

thesis on nlp

Multi-agent Integration

Unsupervised sentence compression, stereotypical bias analysis, temporal tagging, text anonymization, anaphora resolution, bridging anaphora resolution.

thesis on nlp

Abstract Anaphora Resolution

Hope speech detection for english, hope speech detection for malayalam, hope speech detection for tamil, hidden aspect detection, latent aspect detection, personality generation, personality alignment, chinese spell checking, cognate prediction, japanese word segmentation, memex question answering, polyphone disambiguation, spelling correction, table-to-text generation.

thesis on nlp

KB-to-Language Generation

Vietnamese language models, zero-shot sentiment classification, conditional text generation, contextualized literature-based discovery, multimedia generative script learning, image-sentence alignment, open-world social event classification, action parsing, author attribution, binary condescension detection, conversational web navigation, croatian text diacritization, czech text diacritization, definition modelling, document-level re with incomplete labeling, domain labelling, french text diacritization, hungarian text diacritization, irish text diacritization, latvian text diacritization, misogynistic aggression identification, morpheme segmentaiton, multi-label condescension detection, news annotation, open relation modeling, personality recognition in conversation.

thesis on nlp

Reading Order Detection

Record linking, role-filler entity extraction, romanian text diacritization, slovak text diacritization, spanish text diacritization, syntax representation, text-to-video search, turkish text diacritization, turning point identification, twitter event detection.

thesis on nlp

Vietnamese Scene Text

Vietnamese text diacritization, zero-shot machine translation.

thesis on nlp

Conversational Sentiment Quadruple Extraction

Attribute extraction, legal outcome extraction, automated writing evaluation, chemical indexing, clinical assertion status detection.

thesis on nlp

Coding Problem Tagging

Collaborative plan acquisition, commonsense reasoning for rl, context query reformulation.

thesis on nlp

Variable Disambiguation

Cross-lingual text-to-image generation, crowdsourced text aggregation.

thesis on nlp

Description-guided molecule generation

thesis on nlp

Multi-modal Dialogue Generation

Page stream segmentation.

thesis on nlp

Email Thread Summarization

Emergent communications on relations, emotion detection and trigger summarization, extractive tags summarization.

thesis on nlp

Hate Intensity Prediction

Hate span identification, job prediction, joint entity and relation extraction on scientific data, joint ner and classification, literature mining, math information retrieval, meme captioning, multi-grained named entity recognition, multilingual machine comprehension in english hindi, multimodal text prediction, negation and speculation cue detection, negation and speculation scope resolution, only connect walls dataset task 2 (connections), overlapping mention recognition, paraphrase generation, multilingual paraphrase generation, phrase ranking, phrase tagging, phrase vector embedding, poem meters classification, query wellformedness.

thesis on nlp

Question-Answer categorization

Readability optimization, reliable intelligence identification, sentence completion, hurtful sentence completion, social media mental health detection, speaker attribution in german parliamentary debates (germeval 2023, subtask 1), text effects transfer, text-variation, vietnamese aspect-based sentiment analysis, sentiment dependency learning, vietnamese fact checking, vietnamese natural language understanding, vietnamese sentiment analysis, vietnamese multimodal sentiment analysis, web page tagging, workflow discovery, incongruity detection, multi-word expression embedding, multi-word expression sememe prediction, trustable and focussed llm generated content, pcl detection, semeval-2022 task 4-1 (binary pcl detection), semeval-2022 task 4-2 (multi-label pcl detection), automatic writing, complaint comment classification, counterspeech detection, extractive text summarization, face selection, job classification, multi-lingual text-to-image generation, multlingual neural machine translation, optical charater recogntion, bangla text detection, question to declarative sentence, relation mention extraction.

thesis on nlp

Tweet-Reply Sentiment Analysis

Vietnamese parsing.

  • Our Promise
  • Our Achievements
  • Our Mission
  • Proposal Writing
  • System Development
  • Paper Writing
  • Paper Publish
  • Synopsis Writing
  • Thesis Writing
  • Assignments
  • Survey Paper
  • Conference Paper
  • Journal Paper
  • Empirical Paper
  • Journal Support
  • Innovative 12+ Natural Language Processing Thesis Topics

Generally, natural language processing is the sub-branch of Artificial Intelligence (AI). Natural language processing is otherwise known as NLP. It is compatible in dealing with multi-linguistic aspects and they convert the text into binary formats in which computers can understand it.  Primarily, the device understands the texts and then translates according to the questions asked. These processes are getting done with the help of several techniques. As this article is concentrated on delivering the natural language processing thesis topics , we are going to reveal each and every aspect that is needed for an effective NLP thesis .

NLP has a wide range of areas to explore in which enormous researches will be conducted. As the matter of fact, they analyses emotions, processes images, summarize texts, answer the questions & translates automatically, and so on.

Thesis writing is one of the important steps in researches. As they can deliver the exact perceptions of the researcher to the opponents hence it is advisable to frame the proper one. Let us begin this article with an overview of the NLP system . Are you ready to sail with us? Come on, guys!!!

“This is the article which is framed to the NLP enthusiasts in order to offer the natural language processing thesis topics”

What is Actually an NLP?

  • NLP is the process of retrieving the meaning of the given sentence
  • For this they use techniques & algorithms in order to extract the features
  • They are also involved with the following,
  • Audio capturing
  • Text processing
  • Conversion of audio into text
  • Human-computer interaction

This is a crisp overview of the NLP system. NLP is one of the major technologies that are being used in the day to day life. Without these technologies, we could not even imagine a single scenario . In fact, they minimized the time of human beings by means of spelling checks, grammatical formations and most importantly they are highly capable of handling audio data . In this regard, let us have an idea of how does the NLP works in general. Shall we get into that section? Come let’s move on to that!!!

How does NLP Works?

  • Unstructured Data Inputs
  • Lingual Knowledge
  • Domain Knowledge
  • Domain Model
  • Corpora Model Training
  • Tools & Methods

The above listed are necessary when input is given to the model. The NLP model is in need of the above-itemized aspects to process the unstructured data in order to offer the structured data by means of parsing, stemming and lemmatization, and so on. In fact, NLP is subject to the classifications by their eminent features such as generation & understanding.  Yes my dear students we are going to cover the next sections with the NLP classifications.  

Classifications of NLP

  • Natural Language-based Generation
  • Natural Language-based Understanding

The above listed are the 2 major classifications of NLP technology . In these classifications let us have further brief explanations of the natural language-based understanding for your better understanding.

  • Biometric Domains
  • Spam Detection
  • Opinion/Data Mining
  • Entity Linking
  • Named Entity Recognition
  • Relationship Extraction

This is how the natural language-based understanding is sub-classified according to its functions. In recent days, NLP is getting boom in which various r esearches and projects are getting investigated and implemented successfully by our technical team. Generally, NLP processes are getting performed in a structural manner. That means they are overlays in several steps in crafting natural language processing thesis topics . Yes dears, we are going to envelop the next section with the steps that are concreted with the natural language processing.

NLP Natural Language Processing Steps

  • Segmentation of Sentences
  • Tokenization of Words
  • PoS Tagging
  • Parsing of Syntactic Contexts
  • Removing of Stop Words
  • Lemmatization & Stemming
  • Classification of Texts
  • Emotion/Sentiment Analysis

Here POS stands for the Parts of Speech . These are some of the steps involved in natural language processing. NLP performs according to the inputs given. Here you might need examples in these areas. For your better understanding, we are going to illustrate to you about the same with clear bulletin points. Come let us try to understand them.

  • Let we take inputs as text & speech
  • Text inputs are analyzed by “word tokenization”
  • Speech inputs are analyzed by “phonetics”

In addition to that, they both are further processed in the same manner as they are,

  • Morphological Analysis
  • Syntactic Analysis
  • Semantic Understanding
  • Speech Processing

The above listed are the steps involved in NLP tasks in general . Word tokenization is one of the major which points out the vocabulary words presented in the word groups . Though, NLP processes are subject to numerous challenges. Our technical team is pointed out to you the challenges involved in the current days for a better understanding. Let’s move on to the current challenges sections.

Before going to the next section, we would like to highlight ourselves here. We are one of the trusted crew of technicians who are dynamically performing the NLP-based projects and researches effectively . As the matter of fact, we are offering so many successful projects all over the world by using the emerging techniques in technology. Now we can have the next section.

Current Challenges in NLP

  • Context/Intention Understanding
  • Voice Ambiguity/Vagueness
  • Data Transformation
  • Semantic Context Extracting
  • Word Phrase Matching
  • Vocabulary/Terminologies Creation
  • PoS Tagging & Tokenization

The above listed are the current challenges that get involved in natural language processing. Besides, we can overcome these challenges by improving the NLP model by means of their performance. On the other hand, our technical experts in the concern are usually testing natural language processing approaches to abolish these constraints.

In the following passage, our technical team elaborately explained to you the various natural language processing approaches for the ease of your understanding. In fact, our researchers are always focusing on the students understanding so that they are categorizing each and every edge needed for the NLP-oriented tasks and approaches .  Are you interested to know about that? Now let’s we jump into the section.

Different NLP Approaches

Domain Model-based Approaches

  • Loss Centric
  • Feature Centric
  • Pre-Training
  • Pseudo Labeling
  • Data Selection
  • Model + Data-Centric

Machine Learning-based Approaches

  • Association
  • K-Means Clustering
  • Anomalies Recognition
  • Data Parsing
  • Regular Emotions/Expressions
  • Syntactic Interpretations
  • Pattern Matching
  • BFS Co-location Data
  • BERT & BioBERT
  • Decision Trees
  • Logistic Regression
  • Linear Regression
  • Random Forests
  • Support Vector Machine
  • Gradient-based Networks
  • Convolutional Neural Network
  • Deep Neural Networks

Text Mining Approaches

  • K-nearest Neighbor
  • Naïve Bayes
  • Predictive Modeling
  • Association Rules
  • Classification
  • Document Indexing
  • Term & Inverse Document Frequency
  • Document Term Matrix
  • Distribution
  • Keyword Frequency
  • Term Reduction/Compression
  • Stemming/lemmatization
  • Tokenization
  • NLP & Log Parsing
  • Text Taxonomies
  • Text Classifications
  • Text Categorization
  • Text Clustering

The above listed are the 3 major approaches that are mainly used for natural languages processing in real-time . However, there are some demerits and merits are presented with the above-listed approaches. It is also important to know about the advantages and disadvantages of the NLP approaches which will help you to focus on the constraints and lead will lead you to the developments. Shall we discuss the pros and cons of NLP approaches? Come on, guys!

Advantages & Disadvantages of NLP Approaches

  • Effortless Debugging
  • Effective Precisions
  • Multi-perspectives
  • Short Form Reading
  • Ineffective Parsing
  • Poor Recalls
  • Excessive Skills
  • Low Scalability
  • Speed Processes
  • Resilient Results
  • Effective Documentation
  • Better Recalls
  • High Scalability
  • Narrow Understanding
  • Poor in Reading Messages
  • Huge Annotations
  • Complex in Debugging

The foregoing passage conveyed to you the pros and cons of two approaches named machine learning and text mining. The best approach is also having pros and cons. If you do want further explanations or clarifications on that you can feel free to approach our researchers to get benefit from us. Generally, NLP models are trained to perform every task in order to recognize the inputs with latest natural language processing project ideas . Yes, you people guessed right! The next section is all about the training models of the NLP.

Training Models in NLP

  • Scratch dataset such as language-specific BERTs & multi-linguistic BERT
  • These are the datasets used in model pre-training
  • Auxiliary based Pre-Training
  • It is the additional data tasks used for labeled adaptive pre-training
  • Multi-Phase based Pre-Training
  • Domain & broad tasks are the secondary phases of pre-training
  • Unlabeled data sources make differences in the multiphase pre-training
  • TAPT, DAPT, AdaptaBERT & BioBERT are used datasets

As this article is named as natural language processing thesis topics , here we are going to point out to you the latest thesis topics in NLP for your reference. Commonly, a thesis is the best illustration of the projects or researches done in the determined areas. In fact, they convey the researchers’ perspectives & thoughts to the opponent by the effective structures of the thesis. If you are searching for thesis writing assistance then this is the right platform, you can surely approach our team at any time.

In the following passage, we have itemized some of the latest thesis topics in NLP .  We thought that it would help you a lot. Let’s get into the next section. As this is an important section, you are advised to pay your attention here. Are you really interested in getting into the next section? Come let us also learn them.

Latest Natural Language Processing Thesis Topics

  • Cross & Multilingual based NLP Methods
  • Multi-modal based NLP Methodologies
  • Provocative based NLP Systems
  • Graph oriented NLP Techniques
  • Data Amplification in NLP
  • Reinforcement Learning based NLP
  • Dialogue/Voice Assistants
  • Market & Customer Behavior Modeling
  • Text Classification by Zero-shot/Semi-supervised Learning & Sentiment Analysis
  • Text Generation & Summarization
  • Relation & Knowledge Extraction for Fine-grained Entity Recognition
  • Knowledge & Open-domain based Question & Answering

These are some of the latest thesis topics in NLP . As the matter of fact, we have delivered around 200 to 300 thesis with fruitful outcomes. Actually, they are very innovative and unique by means of their features. Our thesis writing approaches impress the institutes incredibly. At this time, we would like to reveal the future directions of the NLP for the ease of your understanding.

How to select the best thesis topics in NLP?

  • See the latest IEEE and other benchmark papers
  • Understand the NLP Project ideas recently proposed
  • Highlight the problems and gaps
  • Get the future scope of each existing work

Come let’s move on to the next section.

Future Research Directions of Natural Language Processing

  • Logical Reasoning Chains
  • Statistical Integrated Multilingual & Domain Knowledge Processing
  • Combination of Interacting Modules

On the whole, NLP requires a better understanding of the texts. In fact, they understand the text’s meaning by relating to the presented word phrases. Conversion of the natural languages in reasoning logic will lead NLP to future directions. By allowing the modules to interact can enhance the NLP pipelines and modules. So far, we have come up with the areas of natural language processing thesis topics and each and every aspect that is needed to do a thesis. If you are in dilemma you could have the valuable opinions of our technical experts.

“Let’s begin to work on your experimental areas and yield the stunning outcomes”

MILESTONE 1: Research Proposal

Finalize journal (indexing).

Before sit down to research proposal writing, we need to decide exact journals. For e.g. SCI, SCI-E, ISI, SCOPUS.

Research Subject Selection

As a doctoral student, subject selection is a big problem. Phdservices.org has the team of world class experts who experience in assisting all subjects. When you decide to work in networking, we assign our experts in your specific area for assistance.

Research Topic Selection

We helping you with right and perfect topic selection, which sound interesting to the other fellows of your committee. For e.g. if your interest in networking, the research topic is VANET / MANET / any other

Literature Survey Writing

To ensure the novelty of research, we find research gaps in 50+ latest benchmark papers (IEEE, Springer, Elsevier, MDPI, Hindawi, etc.)

Case Study Writing

After literature survey, we get the main issue/problem that your research topic will aim to resolve and elegant writing support to identify relevance of the issue.

Problem Statement

Based on the research gaps finding and importance of your research, we conclude the appropriate and specific problem statement.

Writing Research Proposal

Writing a good research proposal has need of lot of time. We only span a few to cover all major aspects (reference papers collection, deficiency finding, drawing system architecture, highlights novelty)

MILESTONE 2: System Development

Fix implementation plan.

We prepare a clear project implementation plan that narrates your proposal in step-by step and it contains Software and OS specification. We recommend you very suitable tools/software that fit for your concept.

Tools/Plan Approval

We get the approval for implementation tool, software, programing language and finally implementation plan to start development process.

Pseudocode Description

Our source code is original since we write the code after pseudocodes, algorithm writing and mathematical equation derivations.

Develop Proposal Idea

We implement our novel idea in step-by-step process that given in implementation plan. We can help scholars in implementation.

Comparison/Experiments

We perform the comparison between proposed and existing schemes in both quantitative and qualitative manner since it is most crucial part of any journal paper.

Graphs, Results, Analysis Table

We evaluate and analyze the project results by plotting graphs, numerical results computation, and broader discussion of quantitative results in table.

Project Deliverables

For every project order, we deliver the following: reference papers, source codes screenshots, project video, installation and running procedures.

MILESTONE 3: Paper Writing

Choosing right format.

We intend to write a paper in customized layout. If you are interesting in any specific journal, we ready to support you. Otherwise we prepare in IEEE transaction level.

Collecting Reliable Resources

Before paper writing, we collect reliable resources such as 50+ journal papers, magazines, news, encyclopedia (books), benchmark datasets, and online resources.

Writing Rough Draft

We create an outline of a paper at first and then writing under each heading and sub-headings. It consists of novel idea and resources

Proofreading & Formatting

We must proofread and formatting a paper to fix typesetting errors, and avoiding misspelled words, misplaced punctuation marks, and so on

Native English Writing

We check the communication of a paper by rewriting with native English writers who accomplish their English literature in University of Oxford.

Scrutinizing Paper Quality

We examine the paper quality by top-experts who can easily fix the issues in journal paper writing and also confirm the level of journal paper (SCI, Scopus or Normal).

Plagiarism Checking

We at phdservices.org is 100% guarantee for original journal paper writing. We never use previously published works.

MILESTONE 4: Paper Publication

Finding apt journal.

We play crucial role in this step since this is very important for scholar’s future. Our experts will help you in choosing high Impact Factor (SJR) journals for publishing.

Lay Paper to Submit

We organize your paper for journal submission, which covers the preparation of Authors Biography, Cover Letter, Highlights of Novelty, and Suggested Reviewers.

Paper Submission

We upload paper with submit all prerequisites that are required in journal. We completely remove frustration in paper publishing.

Paper Status Tracking

We track your paper status and answering the questions raise before review process and also we giving you frequent updates for your paper received from journal.

Revising Paper Precisely

When we receive decision for revising paper, we get ready to prepare the point-point response to address all reviewers query and resubmit it to catch final acceptance.

Get Accept & e-Proofing

We receive final mail for acceptance confirmation letter and editors send e-proofing and licensing to ensure the originality.

Publishing Paper

Paper published in online and we inform you with paper title, authors information, journal name volume, issue number, page number, and DOI link

MILESTONE 5: Thesis Writing

Identifying university format.

We pay special attention for your thesis writing and our 100+ thesis writers are proficient and clear in writing thesis for all university formats.

Gathering Adequate Resources

We collect primary and adequate resources for writing well-structured thesis using published research articles, 150+ reputed reference papers, writing plan, and so on.

Writing Thesis (Preliminary)

We write thesis in chapter-by-chapter without any empirical mistakes and we completely provide plagiarism-free thesis.

Skimming & Reading

Skimming involve reading the thesis and looking abstract, conclusions, sections, & sub-sections, paragraphs, sentences & words and writing thesis chorological order of papers.

Fixing Crosscutting Issues

This step is tricky when write thesis by amateurs. Proofreading and formatting is made by our world class thesis writers who avoid verbose, and brainstorming for significant writing.

Organize Thesis Chapters

We organize thesis chapters by completing the following: elaborate chapter, structuring chapters, flow of writing, citations correction, etc.

Writing Thesis (Final Version)

We attention to details of importance of thesis contribution, well-illustrated literature review, sharp and broad results and discussion and relevant applications study.

How PhDservices.org deal with significant issues ?

1. novel ideas.

Novelty is essential for a PhD degree. Our experts are bringing quality of being novel ideas in the particular research area. It can be only determined by after thorough literature search (state-of-the-art works published in IEEE, Springer, Elsevier, ACM, ScienceDirect, Inderscience, and so on). SCI and SCOPUS journals reviewers and editors will always demand “Novelty” for each publishing work. Our experts have in-depth knowledge in all major and sub-research fields to introduce New Methods and Ideas. MAKING NOVEL IDEAS IS THE ONLY WAY OF WINNING PHD.

2. Plagiarism-Free

To improve the quality and originality of works, we are strictly avoiding plagiarism since plagiarism is not allowed and acceptable for any type journals (SCI, SCI-E, or Scopus) in editorial and reviewer point of view. We have software named as “Anti-Plagiarism Software” that examines the similarity score for documents with good accuracy. We consist of various plagiarism tools like Viper, Turnitin, Students and scholars can get your work in Zero Tolerance to Plagiarism. DONT WORRY ABOUT PHD, WE WILL TAKE CARE OF EVERYTHING.

3. Confidential Info

We intended to keep your personal and technical information in secret and it is a basic worry for all scholars.

  • Technical Info: We never share your technical details to any other scholar since we know the importance of time and resources that are giving us by scholars.
  • Personal Info: We restricted to access scholars personal details by our experts. Our organization leading team will have your basic and necessary info for scholars.

CONFIDENTIALITY AND PRIVACY OF INFORMATION HELD IS OF VITAL IMPORTANCE AT PHDSERVICES.ORG. WE HONEST FOR ALL CUSTOMERS.

4. Publication

Most of the PhD consultancy services will end their services in Paper Writing, but our PhDservices.org is different from others by giving guarantee for both paper writing and publication in reputed journals. With our 18+ year of experience in delivering PhD services, we meet all requirements of journals (reviewers, editors, and editor-in-chief) for rapid publications. From the beginning of paper writing, we lay our smart works. PUBLICATION IS A ROOT FOR PHD DEGREE. WE LIKE A FRUIT FOR GIVING SWEET FEELING FOR ALL SCHOLARS.

5. No Duplication

After completion of your work, it does not available in our library i.e. we erased after completion of your PhD work so we avoid of giving duplicate contents for scholars. This step makes our experts to bringing new ideas, applications, methodologies and algorithms. Our work is more standard, quality and universal. Everything we make it as a new for all scholars. INNOVATION IS THE ABILITY TO SEE THE ORIGINALITY. EXPLORATION IS OUR ENGINE THAT DRIVES INNOVATION SO LET’S ALL GO EXPLORING.

Client Reviews

I ordered a research proposal in the research area of Wireless Communications and it was as very good as I can catch it.

I had wishes to complete implementation using latest software/tools and I had no idea of where to order it. My friend suggested this place and it delivers what I expect.

It really good platform to get all PhD services and I have used it many times because of reasonable price, best customer services, and high quality.

My colleague recommended this service to me and I’m delighted their services. They guide me a lot and given worthy contents for my research paper.

I’m never disappointed at any kind of service. Till I’m work with professional writers and getting lot of opportunities.

- Christopher

Once I am entered this organization I was just felt relax because lots of my colleagues and family relations were suggested to use this service and I received best thesis writing.

I recommend phdservices.org. They have professional writers for all type of writing (proposal, paper, thesis, assignment) support at affordable price.

You guys did a great job saved more money and time. I will keep working with you and I recommend to others also.

These experts are fast, knowledgeable, and dedicated to work under a short deadline. I had get good conference paper in short span.

Guys! You are the great and real experts for paper writing since it exactly matches with my demand. I will approach again.

I am fully satisfied with thesis writing. Thank you for your faultless service and soon I come back again.

Trusted customer service that you offer for me. I don’t have any cons to say.

I was at the edge of my doctorate graduation since my thesis is totally unconnected chapters. You people did a magic and I get my complete thesis!!!

- Abdul Mohammed

Good family environment with collaboration, and lot of hardworking team who actually share their knowledge by offering PhD Services.

I enjoyed huge when working with PhD services. I was asked several questions about my system development and I had wondered of smooth, dedication and caring.

I had not provided any specific requirements for my proposal work, but you guys are very awesome because I’m received proper proposal. Thank you!

- Bhanuprasad

I was read my entire research proposal and I liked concept suits for my research issues. Thank you so much for your efforts.

- Ghulam Nabi

I am extremely happy with your project development support and source codes are easily understanding and executed.

Hi!!! You guys supported me a lot. Thank you and I am 100% satisfied with publication service.

- Abhimanyu

I had found this as a wonderful platform for scholars so I highly recommend this service to all. I ordered thesis proposal and they covered everything. Thank you so much!!!

Related Pages

Logo Institute of Artificial Intelligence

Theses @ NLP Group

The NLP Group is continuously looking for students who would like write their bachelor's or master's thesis in the area of natural language processing, possibly with connections to information retrieval and general artificial intelligence.

All thesis topics should be related to the main research directions of the NLP Group, which include  computational argumentation , computational sociolinguistics , and computational explanation . 

Below, we provide a selection of currently available topics. Details of the topics are discussed and shaped jointly in the beginning of the thesis process. Other topics are possible, including own ideas from the student's side, if they go hand in hand with our research interests.

Dealing with students responses and identifying the underlying students’ concept is a core competency that teacher students as well as career changers without teacher training („Quer- und Seiteneinsteigende“) need to practice. A way to better prepare future teachers to interact with their students is to provide them with a virtual classroom environment where they can develop these educational competencies. As a first step towards this goal, we aim to simulate statements that individual students would make in a classroom setting using natural language processing methods. For this, we want to develop a language model that can generate text that reflects the thinking and understanding of students at different levels of science competencies. Existing transcripts from 111 *German* chemistry student reports provide the basis for development and evaluation.

Advisor: Maja Stahl

Metaphors in language are central to explaining concepts. Especially in political opinion pieces, multiple studies have shown usage of metaphorical language affect the views of liberals and conservatives adversely, for example persuading them to change their opinion about a political topic. In this thesis, we would explore how metaphors affect large language models (LLMs) in the same regard.  We would further inspect to what extent metaphors affect the results by post-hoc explanation algorithms. Prior working knowledge of python is required.  Keywords: LLMs, Metaphors, Argumentation, Post-hoc explainibilty 

Advisor: Meghdut Sengupta

Working on the outlined and similar topics involves dealing with state-of-the-art technologies such as neural transformers, contrastive learning, multitask learning, and/or various others. Most topics target the development and empirical evaluation of NLP methods for specific tasks.  

Interested?

Candidates should have very good programming skills (preferably in Python) as well as some experience with machine learning and other AI methods (ideally with NLP). You should be enrolled in one of the computer science programs at Leibniz University Hannover.

In case you are interested in a specific topic , please send a mail to the advisor of that topic, including information about the prior knowledge and experience have:

  • What relevant courses did you take?
  • What experience with AI development and evaluation do you have?
  • What other relevant knowledge do you have?

In case you are unsure about the topic , but interested in writing your thesis with the NLP Group, please send a mail to the head of the group .

The grading of a thesis is based on a weighted grades for two parts: 

  • The developed solution to the problem tackled in thesis (45%)
  • The written thesis presenting the solution (55%)

The grading of the developed solution takes five criteria into account:

  • Difficulty / Complexity.  How difficult was it to develop the solution? How much effort was put into it? Is the complexity justified? ... 
  • Technical quality.  Is the design and realization of the solution well-made? Are the experiments systematic and scientifically sound? ...
  • Novelty and own ideas.  Does the solution have scientific novelty? Have own ideas been developed and realized in the solution? ...
  • Impact / Publishability.  Does the solution improve the state of the art? Are the results worth publishing? Can they be published as is? ...
  • Implementation and data.  How easy is it to read and reuse the code? If data has been created, is it well-organized? Are they well-documented? ...

The grading of the written thesis takes six criteria into account:

  • Abstract, introduction, and conclusion.  Are problem, solution, and results well-introduced? Are the right conclusions made? Is the whole story told? ... 
  • Background and related work.  Are basics well-described and relevant? Is the connection to the thesis clear? Is the state of the art well-discussed? ...
  • Approaches and data.  Is the presentation of the developed approaches and data clear, complete, and on the right technical level? ...
  • Experiments, evaluation, and discussion.  Are the experiments described systematically? Are the results clearly presented and correctly interpreted? ...
  • Form, layout, and style.  Is the structure convincing? Is the writing clear and error-free? Do tables and figures support it? Are citations correct? ...
  • Scientific quality.  Does the thesis adhere to scientific standards? Does the presentation follow community principles? …

Past Theses (as of Winter 2022)

  • Evaluating Data-Driven Approaches to Improve Word Lists for Measuring Social Bias in Word Embeddings.  Master's thesis, Vinay Kaundinya Ronur Prakash, UPB.
  • Audience Aware Counterargument Generation.  Master's thesis. Mahammad Namazov, 2023, UPB.
  • Improving Learners’ Arguments by Detecting and Generating Missing Argument Components.  Master's thesis, Nick Düsterhus, 2023, UPB.
  • Gender-inclusive Coreference Resolution using Pronoun Preference.  Master's thesis, Jan-Luca Hansel, 2023, UPB.
  • Dialect-aware Social Bias Detection using Ensemble and Multi-Task Learning.  Master's thesis, Sai Nikhil Menon, 2022, UPB.
  • Counter Argument Generation Using a Knowledge Graph. Master's thesis, Indranil Ghosh, 2022, UPB.
  • Domain-aware Text Professionalization using Sequence-to-Sequence Neural Networks.  Bachelor's thesis, Juela Palushi, 2022, UPB.

Past Theses (Summer 2018 – Summer 2022)

  • Detection and Mitigation of Subjective Bias in Argumentative Text.  Master's thesis, Sambit Mallick, 2022, UPB.
  • Cross-domain analysis of argument quality and its connection to offensive language. Bachelor's thesis, Patrick Bollmann, 2022, UPB.
  • Cross-domain Aspect-based Sentiment Analysis with Multimodal Sources . Master's thesis, Pavan Kumar Sheshanarayana, 2022, UPB.
  • Comparative Evaluation of Automatic Summarization Techniques for German Court Decision Documents.  Master's thesis, Josua Köhler, 2022, UPB.
  • Computational Analysis of Cultural Differences in Learner Argumentation. Master's thesis, Garima Mudgal, 2022, UPB.
  • Propaganda Technique Detection Using Connotation Frames.  Master's thesis, Vinaykumar Budanurmath, 2022, UPB.
  • Contrastive Argument Summarization using Supervised and Unsupervised Learning.  Master's thesis, Jonas Rieskamp, 2022, UPB.
  • Mitigation of Gender Bias in Text using Unsupervised Controllable Rewriting. Master's thesis, Maja Brinkmann, 2021, UPB.
  • Assessing Stereotypical Social Biases in Text Sequences using Language.  Master's thesis, Meher Vivek Dheram, 2021, UPB.
  • Modeling Context and Argumentativeness of Sentences in Argument Snippet Generation.  Master's thesis, Harsh Shah, 2021, UPB.
  • Political Speaker Transfer: Learning to Generate Text in the Styles of Barack Obama and Donald Trump.  Master's thesis, Jonas Bülling, 2021, UPB.
  • Quantifying Social Biases in News Articles with Word Embeddings.  Bachelor's thesis, Maximilian Keiff, 2021, UPB.
  • Computational Text Professionalization using Neural Sequence-to-Sequence Models.  Master's thesis, Avishek Mishra, 2021, UPB.
  • Assessing the Argument Quality of Persuasive Essays using Neural Text Generation .  Master's thesis, Timon Gurcke, 2021, UPB.
  • Automatic Conclusion Generation using Neural Networks.  Bachelor's thesis, Torben Zöllner, 2020, UPB.
  • Computational Analysis of Metaphors based on Word Embeddings.  Bachelor's thesis,  Simon Krenzler, 2020, UPB. 
  • Semi-supervised Cleansing of Web-based Argument Corpora.  Bachelor's thesis, Jonas Dorsch, 2020, BUW.
  • Countering Natural Language Arguments using Neural Sequence-to-Sequence Generation.  Master's thesis, Arkajit Dhar, 2020, UPB.
  • Snippet Generation for Argument Search.  Bachelor's thesis, Nick Düsterhus, 2019, UPB.
  • Argument Quality Assessment in Natural Language using Machine Learning  — bachelor's thesis, Till Werner, 2019, UPB.
  • Stance Classification in Argument Search.  Master's thesis, Philipp Heinisch, 2019, UPB.
  • Towards a Large-scale Causality Graph.  Bachelor's thesis, Yan Scholten, 2019, UPB.

Past Theses (Summer 2009 – Winter 2017)

  • Cross-Domain Mining of Argumentation Strategies using Natural Language Processing .  Master's thesis, 2017, BUW.
  • Mining Relevant Arguments at Web Scale .  Master's thesis, 2017, BUW.
  • Identifying Controversial Topics in Large-Scale Social Media Data .  Master's thesis, 2016, BUW.
  • Efficiency and Effectiveness of Multi-Stage Machine Learning Algorithms for Text Quality Assessment.  Master's thesis, 2013, UPB.
  • An Expert System for the Automatic Construction of Information Extraction Pipelines.  Master's thesis, 2012, UPB.
  • Efficiency and Effectiveness of Text Classification in Information Extraction Pipelines.  Master's thesis, 2012, UPB.
  • Efficient Information Extraction for Creating Use Case Diagrams from Text.  Master's thesis, 2012, UPB.
  • Heuristic Search for the Run-time Optimization of Information Extraction Pipelines.  Master's thesis, 2012, UPB.
  • Aggregation and Visualization of Market Forecasts.  Bachelor's thesis, 2011, UPB.
  • Branch Categorization based on Statistical Analysis of Information Retrieval Results.  Bachelor's thesis 2011, UPB.
  • Evaluation of Cooperative Robot Motion Strategies in Simbad.  Bachelor's thesis, 2009, UPB.

LUH: Leibniz University Hannover, UPB: Paderborn University, BUW: Bauhaus-Universität Weimar

Last Change: 07.02.24 Print

thesis on nlp

  • © 2024:  Leibniz University Hannover
  • Legal Information
  • Data Privacy
  • Accessibility Statement
  • General Overview Studies
  • Theses & Projects
  • AI Course at the FEI
  • Suggested Software
  • Recommended Literature
  • Suggested Tools
  • Faculty of Electrical Engineering and Computer Science

thesis on nlp

  • BE Projects
  • B Tech Projects
  • ME Projects
  • M Tech Projects
  • mca projects
  • Mini Projects for CSE
  • Mini Projects for ECE
  • Mini Projects for IT
  • IEEE Projects for CSE
  • IEEE Projects for ECE
  • Digital Image Processing Projects
  • Medical Image Processing Projects
  • Matlab Thesis
  • Fuzzy Logic Matlab
  • Matlab Projects
  • Matlab Simulation Projects
  • Matlab based Communication Projects
  • Medical Imaging Projects
  • Biomedical Engineering Projects
  • Image Processing Thesis
  • Scilab Projects
  • OpenCV Projects
  • Steganography Projects
  • Cryptography Projects
  • Cyber Security Projects
  • Network Security Projects
  • Information Security Projects
  • Wireless body area network projects
  • Wireless Communication Projects
  • Wireless Sensor Networks Projects
  • Wireless Network Projects
  • Router Projects
  • CAN Protocol Projects
  • NS2 Projects
  • NS3 Projects
  • Opnet Projects
  • Omnet Projects
  • Qualnet Projects
  • VANET Projects
  • Manet Projects
  • LTE Projects
  • Ad hoc projects
  • Software Defined networking projects
  • Peersim Projects
  • P2P Live Streaming Projects
  • Video Streaming Projects
  • Green Radio Projects
  • Distributed Computing Projects
  • PPI Projects
  • Cognitive Radio Projects
  • IoT Projects
  • m2m projects
  • Hadoop Projects
  • MapReduce Projects
  • Core Java Projects
  • Forensics Projects
  • Cloudsim Projects
  • Cloud Analyst Projects
  • Weka Projects
  • Pattern Recognition Projects
  • Gridsim Projects
  • Augmented Reality Projects
  • Android Projects
  • Rtool Projects
  • Software Engineering Projects
  • ARM Projects
  • Signal Processing Projects
  • GPS Projects
  • GSM Projects
  • RFID Projects
  • Embedded System Projects
  • LabVIEW Projects
  • Microcontroller Projects
  • Robotics Projects
  • VHDL Projects
  • FPGA Projects
  • Zigbee Projects
  • Simulink Projects
  • Power Electronics Projects
  • Renewable Energy Projects for Engineering Students
  • Writing Phd Thesis
  • Cognitive Radio Thesis
  • Vanet Thesis
  • Manet Thesis
  • Mobile Application Thesis
  • Neural Network Thesis
  • Security system Thesis
  • Steganography Thesis
  • Software Defined Networking Thesis
  • Wireless Network Sensor Thesis
  • Computer Science Thesis
  • M Tech Thesis
  • Phd Projects
  • Dissertation Writing Service

NLP is expanded as Natural language processing (NLP). It is a method to support the contextual theory of computational approaches to learning human languages. By the by, it is aimed to implement automated analysis, interpretation of human language in a natural way. We provide 10+ interesting latest NLP Thesis Topics.Let’s check two steps to process the NLP,

  • NLP system usually takes a series of words/phrases as input.
  • Then, process the input to analyze the meaning and generate structured representation as output. In point of fact, the output nature will differ based on the proposed tasks.

From this page, we gain more meaningful information about Natural Language Processing from different research perspectives!!!

            In order to support you from all the research directions, we have well-equipped resource teams that serve you in both NLP research and development . Further, we also include a writing team to prepare a well-structured Thesis . Here, we have listed out few important services that we provide on the NLP PhD / MS study.

Our Motivations for NLP Thesis Writing

  • Evolving Concepts
  • Growing NLP Models
  • Advanced NLP approaches
  • New benchmark datasets
  • Programming languages / Frameworks
  • NLP Thesis Topics
  • And many more
  • Provide keen guidance on modern algorithms for solving NLP problems
  • Give end-to-end assistance on project development in appropriate friendly tools and resources
  • Perform an assessment on experimental results and contribute new findings

General Approach to NLP

In order to provide you best NLP research support , we undergo deep study on new frameworks that are perfect to implement textual data science tasks . Since the framework is most important to make your NLP and text mining operations more efficient. Here, we have given you some high-level approaches that are performed on the majority of NLP projects .

  • Data Acquisition
  • Data Preprocessing
  • Data Investigation
  • Model Assessment
  • Data Visualization

Our experts are great to suggest you best-fitting frameworks for your project . We ensure you that our proposed frameworks are good to execute all necessary NLP approaches . Our developers are proficient to handle not only these approaches but also other approaches. Even though a framework is iterative, we are capable enough to demonstrate data visualization than a linear process . For instance: the KDD process.

Further, if you need more details about the framework or significant approaches, then connect with us. We are ready to fulfill your needs in a timely manner.

What are models in NLP?

The practice of representing organizational patterns in an excellent way is known as NLP models. Here, we have given you some important NLP models that surely yield accurate results in the implementation phase. All these models are efficient to make the machine learn human instructions and work accordingly. We ensure you that we design NLP models to achieve high performance in system automation.

Which NLP models give the best accuracy?

  • DMN and Bidirectional LSTM
  • Multichannel CNN
  • CRF with Dilated CNN
  • Linking with Semi-CRF
  • Paragraph Vector
  • DP with Manual Characteristics
  • K-Max Pooling with DCNN
  • CNN-assisted Parsing Features
  • Lexicon Infused-Phrase Embedding
  • Recursive Neural Tensor Network
  • LSTM-based Constituency Tree
  • Highway links with Bidirectional LSTM
  • Bi-LASTM / Bi-LSTM-CRF along with Word+char Embedding
  • Advanced Word Embedding with Tree-LSTM
  • Bi-LSTM along with Lexicon+word+char Embedding
  • MLP along with Gazetteer+word Embedding

How do I choose a thesis for NLP?

Now, we can see the importance of NLP thesis topics. When you are willing to choose an NLP thesis topic, just think of your interested areas which motivate you to do research in the NLP field. As well, your handpicked thesis topic needs to explicitly showcase your passion for research. Also, make sure that your interest in the topic holds throughout the course of the research journey until thesis submission and acceptance.

In general, you need to choose the thesis topic from the current research areas of NLP . So, it is essential to know the present developments of NLP Projects. For that, you have to refer the recent research articles and magazines. Mainly, focus on the widely known concept to have large reference/resource materials. Also, your handpicked topic needs to be most effective than the existing process which no one has achieved before.

When you are confirmed with your interesting research areas, analyze the existing research gaps. For this purpose, do the survey on reputed research journal papers like springer, IEEE, science direct, emerald, etc. Then, assess the pros and cons of existing techniques used in those related research papers. Next, select the set of possible research issues and choose the optimum one. At last, consult with your mentor or field experts on the feasibility of your research issues in a real-world environment.

Prior to finalizing your handpicked research topic, analyze the future research possibilities and current research limitations. Since the lack of future scope is not meant to choose that topic. As well, more limitations may lead to a lot of difficulties in solving your research issue. Also, it takes more time to complete your research. After considering all these aspects, choose the unsolved questions in your desired NLP research area to find the best solutions from the past historical information.

Next, we can see the most important NLP thesis topics from recent research areas. All these topics have a significant role in creating innovations in the field of natural language processing. In addition to topics, we have also included the primary research issue, solving techniques, and supportive datasets.

Once you contact us, we provide you with guidance on all suitable development requirements. Also, we assure you that our proposed research solutions are really advanced to attain the expected results.

List of Natural Language Processing NLP Thesis Topics

  • Use ML approach to grade essay review in an automatic way
  • Need feature engineering method
  • Linear Regression over Data Features (sentiments, lexical diversity, entities count, etc.)
  • Human Graded Scores Dataset
  • Use Quora dataset about 400,000 pair questions
  • Compute semantic equivalence over Quora questions
  • Identify the closet one by the binary value
  • Need feature engineering processes
  • Support up to sentence-level methods. For instance: parsing
  • Naïve Bayes Classifier
  • Support Vector Machines
  • Quora Datasets
  • Predict tags over StackOverflow Q&A using ML approach
  • Utilizes conventional multi-label text classification
  • For instance: Every query has multiple tags
  • Labeled LDA
  • Stack Overflow Questions and Tags
  • Although spam filter uses the rule-based method for spam SMS, spammers effortlessly detect and break the rules
  • ML model is utilized to forecast the spam SMS and retrain data while spammer add new spam term
  • Naive Bayes Classifier
  • Spam Collection Datasets
  • Perform topic modeling by unsupervised algorithm
  • Perform clustering by K clusters
  • Do the manual process for investigating the cluster
  • Latent Semantic Analysis / LDA
  • News Headlines Datasets
  • Use of conventional named entity extraction method
  • Not flexible to extract health entities in medical data
  •  Entities may be symptoms, diseases, procedures, medications, disorders, etc.
  • Named Entity Recognition (NER)
  • Constrained Random Fields
  • Informatics for Integrating Biology and the Bedside
  • Forecast tweet language by language recognition
  • Natural Language Recognition
  • Short text language identification
  • Construct automated spell checker model using the correction method
  • Spelling Checking and Correction
  • Contains massive sentences with misspellings
  • Main file holds tags like <ERR targ=sister> siter </ERR> where siter refers sister
  • Other files hold statistical info like the number of mistakes, etc.
  • Datasets comprise a set of misspellings from Wikipedia
  • For instance: broad soldiers (soldiers replaced by shoulders)
  • Feed collected tweets as input
  • Train a model to classify human opinions/emotion in tweets
  • Classify / Cluster into neural, negative, and positive
  • Deep Random Forest
  • Tweets sentiment tagged by humans

In specific, here we have given you some key datasets of language processing, data mining and text mining. All these datasets are globally accepted by many developers to implement NLP projects. As a matter of fact, each dataset has the objective to support a specific set of NLP operations.

There are several commercial and non-commercial datasets for NLP research. We help you to choose the best free download datasets for your project based on project purposes. Since the result of the project is technically based on the handpicked dataset.

Benchmark Datasets for NLP Projects

  • Dataset – ~6.5M Entities and ~5.4M Resources
  • Categories – 845K Places, 1.6 Persons, 56K Plants, 280K Companies, 5K Disease, 310K Species
  • Purpose – Classification and Ontology
  • Purpose – Information Retrieval
  • Dataset – Semantic Web
  • Purpose – Textual Reasoning, Language Understanding, etc.
  • Dataset – ~10,900+ News Documents
  • Categories – ~20
  • Purpose – Clustering and Classification
  • Dataset – US Profiling about World Territories and Countries
  • Categories – Government, Transportation, etc.
  • Purpose – Translation, Processing, and Analysis
  • Purpose – Dialogue Models, Speech Collection, Speech Recognition, Speech Synthesis, etc.
  • Dataset – ~21575+ Text Documents
  • Categories – Group of Categorized Documents
  • Purpose – Classification

In addition, we have also given you some important open-source development frameworks and programming languages for NLP projects. When the dataset of the project is confirmed, the next step is to select suitable developing technologies. To choose the optimal one, analyze the supportive libraries, modules, toolboxes, packages, and simplicity of language. Majorly, Core Java and Python are considered as developer-friendly languages which are flexible to develop many sorts of NLP applications/systems.

Programming Languages for NLP

Overall, we are here to provide you best end-to-end research services in Natural Language Processing using python research field. As well, we have an abundant amount of new NLP thesis to make you develop modernistic research work. Also, we suggest suitable development platforms, tools, and technologies based on your project needs. Further, we also provide support in preparing the perfect thesis. Overall, we guarantee you that we meet your level of satisfaction through our smart solutions. So, connect with us to know more Interesting NLP thesis topics to begin your PhD / MS study.

  • Natural Language Processing NLP Thesis Topics

Help | Advanced Search

Computer Science > Computation and Language

Title: facilitating opinion diversity through hybrid nlp approaches.

Abstract: Modern democracies face a critical issue of declining citizen participation in decision-making. Online discussion forums are an important avenue for enhancing citizen participation. This thesis proposal 1) identifies the challenges involved in facilitating large-scale online discussions with Natural Language Processing (NLP), 2) suggests solutions to these challenges by incorporating hybrid human-AI technologies, and 3) investigates what these technologies can reveal about individual perspectives in online discussions. We propose a three-layered hierarchy for representing perspectives that can be obtained by a mixture of human intelligence and large language models. We illustrate how these representations can draw insights into the diversity of perspectives and allow us to investigate interactions in online discussions.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

edugate

  • NLP Master Thesis (Research Areas)

The term NLP stands for Natural Language Processing .  It is a kind of artificial intelligence technique used in human language understanding and analysis .  Intellectual devices such as computers, smartphones, and some other gadgets are performing with natural language inputs. In fact, they are familiar with the binary or ASCII values.   Thus handling & understanding the language inputs and producing the corresponding outputs are the main objectives in implementing NLP Master Thesis here.

An NLP process ensures multi-lingual based communication between humans and intelligent devices . Semantic analysis is one of the important approaches which have reserved a significant place in the recent growths of NLP technology . At the end of this article, you will be able to write your own NLP master Thesis without any hesitations. Now let us initiate the article with the overview and how NLP allows the smart devices to perform.

Top 6 NLP Master Thesis Research Areas

Overview of Natural Language Processing

  • Language is being identified by exploring the text information & its insights
  • Algorithms are the pillars of the NLP virtualized processes

The intellectual computer systems are facilitated by the NLP skills such as mentioned below.

  • Understands the unstructured text groups
  • Mining meaning inputs from text groups
  • Computes the responses of the process
  • Executes entire tasks

To conclude the overview of NLP , it is useful in the following tasks,

  • Insights and inspects the content (what is the core meaning of the content?)
  • Exploration and perspectives of the content (how, when, or why it’s?)
  • Recognizing the emotions contented in the message (what are the sentiments/emotions/feelings?)

The above listed are the skillsets that can be obtained by the computer devices when it is accompanied with the NLP techniques. Apart from this, intellectual machines are being highly benefited by the NLP in various ways. If you do want further facts in this area you can approach our experts at any time. As the matter of fact, our technical team is offering technical assistance 24/7 ; we are being trusted by various students and scholars.

NLP consists of 2 important components. One is NLG and another one is NLU . In other words, these are basic elements of the NLP . Let’s have further explanations in the immediate passage with bolt and nut points.

Are you looking for an article in accordance with the NLP master thesis? Then this article is absolutely meant for you!!!

Components of NLP

  • Sentence & Text Planning
  • Text Recognition
  • Referential & Lexical Ambiguity
  • Syntactical Ambiguity

Now we can see about the five categories of the NLP systems.

What are the Five Categories of Natural Language Processing (NLP) Systems?

  • Pragmatic (Logical) Analysis
  • Discourse (Integration) Analysis
  • Semantic (Contextual) Analysis
  • Parsing (Enriching)Analysis
  • Structure (Lexicon) Analysis

The above listed are the 5 important categories of natural language processing and the possible application areas. We hope that you would understand the statements as of now listed. Some of the NLP applications phases are mentioned below for your better understanding.

  • Automated Translation
  • Speech Recognition
  • Optical Character Recognition

The aforesaid are the applications get encompassed in the NLP so far. These applications play a vital role in natural language processing . In fact, it involves several processing steps. Our experts have listed you the same for the ease of your understanding. Let us try to understand them.

Processing Steps for NLP

  • Textual or  Numerical Elements
  • Normalization
  • Text Embedding
  • E.g. Feed Forward Neural Network
  • E.g. XGBoost
  • E.g. Support Vector Machine
  • E.g. Logistic Regression
  • Performance Analysis Evaluations

As of now, we have discussed the fundamental steps of natural language processing . We have concentrated on the various aspects handled in the NLP to make you much wiser. As a matter of fact, our articles are being published in the top journals by having the best features in them.

When researching the NLP, we might face some of the important issues. You may not aware of this. We know the research issues and their corresponding solutions by frequently conducting experiments. In this regard, we wanted to list out several issues for the ease of your understanding.

What are the Important Issues in NLP?

  • Ambiguity in ‘Semantics’
  • Word phrase modifications (prepositions- on, with, by)
  • Ambiguity in ‘Lexical’
  • Different word usages (adverb, noun & verb)
  • Ambiguity in ‘Syntactic’
  • Vagueness in the sentences (verb & noun)

Semantic and syntactic ambiguities are being interconnected in nature. For your better understanding, let’s consider an example.

Ex: I saw a girl on the road with my sunglasses.

This can be neither understood as the girl is having my sunglasses with her nor it can be understood as I have seen the girl through my sunglasses. Hence it creates ambiguity syntactically. We hope that you are getting the points as of now stated.

We’ve also listed some of the other issues for your added knowledge as stated below,

  • False-positive Handling
  • Continuous Conversations
  • Texts with Different Meaning
  • Less Memory & Context Understanding

Till now, we have debated on the essential areas covered in the NLP . While doing projects and researches one should consider the algorithms used in that technology . Here, you might get confused with the algorithms lists that get involved in the NLP . Don’t worry guys we are going to cover the upcoming sections with the algorithms. Before jumping into that area, let us first discuss the outline of the NLP and how do they work. Shall we get into the next phase? Come let’s try to understand them.

Outline of NLP Algorithms

  • Transfiguration of Strings to Vectors
  • Phrase-Context Saving & Corpus Training
  • Application of Probabilistic Methods

This is the shortest overview of the NLP algorithms . We know that you are getting curious about the algorithms. Hence we covered the next section with the algorithms and the workflow of the same for your better understanding.

How do NLP Algorithms Work?

  • Training Data Outline
  • Acquisition of Training Data
  • Supervised Learning
  • Unsupervised Learning
  • Structure Forming
  • Data Authentication
  • Accuracy Evaluation

The listed above are the ways in which NLP algorithms do their work . Structures of the languages are being predicted by the discrete optimization methods and techniques to exactly recognize the structure of languages as given as input. NLP is highly compatible with machine learning concepts hence they learn plenty of rules by examining various test datasets.  For example, books which are consisted of huge pages and various phases of sentences. The interpretation of the statistical elements is considered in the NLP .  For your better understanding here we have pointed out the NLP machine learning algorithms for the ease of your understanding.

  • Extraction of unstructured text similarities
  • Neighbor voting based data classification
  • Probability based data point classification
  • Probability-based data point classification assigning
  • Identification of classification dividers

The aforementioned are the various algorithms used in the NLP . In fact, our technical crew is very familiar with these areas. As they are filtered out from the talent, handling the complexities that arise is becoming easier for them. It results in the effective guidance of the students. In fact, we are offering many successful projects and thesis writing across the world. So if you want assistance in the NLP master thesis you can approach our expert without any hesitations. In this regard, let us discuss the research areas of the NLP .

What are the Research Areas of NLP Master Thesis?

  • Opinion Mining & Emotion Analysis
  • Summarization of Text
  • Classification of Text
  • Extraction of Information
  • Email Spam Recognition

The itemized above are some of the research areas in NLP . Besides there are various research areas are presented with future directions. NLP is one of the promising technologies which can yield the best results among other projects. Yes, we are going to list you the future directions of the NLP for your better understanding of the concept.

What are the Future Directions of NLP?

  • Multi-modal Learning Techniques
  • NLP Resource-free Tasks
  • Common Sense & Knowledge Identification
  • NLP Models for Training

These are the future directions involved in natural language processing.   In fact, libraries and frameworks of the NLP are playing a key role in each and every process lie under. Libraries are developed with specific objectives. By performing with the various libraries presented to the natural language processing would help us to apply the relevant libraries in which they can fit. Let’s have further discussions according to the most commonly used libraries in NLP.

Popularly Used NLP Libraries

  • Huge data processing & streaming
  • Suits with unsupervised deep learning methods
  • Effective text processing & documentation
  • Suits with machine learning methods
  • Well optimized frameworks
  • Suits with neural networks to train models
  • Assimilation of  word vectors
  • Huge NLP approaches & suits with numerous languages

The aforementioned are the libraries generally used in the NLP . Apart from this, there are various libraries are being used to enhance the performance of the NLP. If you do want more information in these areas then you can approach our technical team for brief explanations. In fact, our researchers are habitually experimenting the natural language processing to improve the incapabilities aroused in it. Being masters in the NLP, we are highly trusted for each and every technical project, researches and thesis most importantly.

In the following passage, we have listed the performance metrics which determine the natural languages processing for your better understanding. In fact, they are classified under 2 categories, one is regression and another is classification metrics NLP Master Thesis. Shall we get into that phase? Come let’s have one of the important sections of this article.

NLP Master Thesis Research Guidance

Performance Metrics for NLP

  • R^2- Coefficient of Determination
  • Coefficient value that compares the existing model with the predetermined baselines
  • RMSE – Root Mean Squared Error
  • Square root of the mean values of Squared Difference between the actual and observed values
  • MSE – Mean Squared Error
  • Mean values of the squared difference between the actual and the observed values
  • MAE – Mean Absolute Error
  • Mean values of difference between the actual and the observed values.
  • ROC Curve Area
  • ROC- Receiver Operator Characteristic Curve
  • Shows the interchanges between specificity & recall
  • Harmonic average of the recall & precision
  • Rate of negative instances among total (-) instances
  • Rate of positive instances among total (+) instances
  • Rate of accurately classified instances among (+) instances
  • Signifies the accurate & inaccurate classifications
  • N – No. of. classes (N*N Matrix)

      Itemized above are the performance metrics that determine the NLP . So far, we have discussed all the possible areas covered in the NLP Master Thesis . If you are facing any challenges while writing the thesis you can surely have an interact with our experts. We are always delighted to assist you!!!

  • Thesis Projects
  • Publications
  • Research Fields
  • Uncategorized
  • Working Groups

DBIS

LLM-based Tool for FAIR Data Assessment

FAIR data is one of the most important pillars of research data management. It refers to data or a dataset that is Findable, Accessible, Interoperable, and Reusable. The fundamental challenge for researchers is to assess the FAIRness of their research data using various platforms or tools. Most of these tools are implemented based on established assessment frameworks such as the RDA FAIR Data Maturity Indicators , which is then implemented either as a guideline /or as a manual checklist of a question-answer system that the data owner fills in, or as an automatic assessment approach such as that provided by the F-UJI Automated FAIR Data Assessment Tool .

However, FAIR data assessment is undergoing a transformative paradigm shift with the advent of Large Language Models (LLMs). Leveraging advanced natural language processing capabilities, LLMs have the potential to enhance information retrieval, knowledge synthesis, and hypothesis generation.

The goal of the thesis, as depicted in the figure below, is to leverage the benefits inherent in LLMs to develop a tool for FAIR data assessment.

thesis on nlp

The dataset for training the LLM will be selected from a data portal such as Kaggle Data Science ( https://www.kaggle.com/datasets ) or HuggingFace ( https://huggingface.co/datasets ) or FAIRsharing.org ( https://fairsharing.org/FAIRsharing.cc3QN9 )

  • Review existing literature on FAIR data assessment tools and LLMs
  • Find relevant datasets to compare classical FAIR data Assessment methods with LLM-based methods
  • Implement an LLM-based FAIR data assessment tool
  • Conduct evaluation between “classical” FAIR data Assessment methods vs LLM-based (performance and usability of the LLM FAIR data assessment tool)
  • Utilize the LLM to generate recommendations to improve the FAIRness of particular datasets

Related project

NFDI4DS: https://www.nfdi4datascience.de/

  • Raza, Shaina et al. “FAIR Enough: How Can We Develop and Assess a FAIR-Compliant Dataset for Large Language Models’ Training?” (2024). https://api.semanticscholar.org/CorpusID:267069102
  • Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3 , 160018 (2016). https://doi.org/10.1038/sdata.2016.18
  • FAIRification Process: https://www.go-fair.org/fair-principles/fairification-process/
  • Ahmad, Raia Abu et al. (2024). Toward FAIR Semantic Publishing of Research Dataset Metadata in the Open Research Knowledge Graph. https://api.semanticscholar.org/CorpusID:269137144
  • FAIR for AI: An Interdisciplinary and International Community Building Perspective – https://www.nature.com/articles/s41597-023-02298-6
  • FAIR AI models in high energy physics – https://iopscience.iop.org/article/10.1088/2632-2153/ad12e3/meta
  • Soiland-Reyes, S., Sefton, P., Leo, S., Castro, L. J., Weiland, C., & Van De Sompel, H. (2024). Practical webby FDOs with RO-Crate and FAIR Signposting: Experiences and lessons learned. International FAIR Digital Objects Implementation Summit 2024 . https://pure.manchester.ac.uk/ws/portalfiles/portal/300255290/TIB_FDO2024_signposting_ro_crate-11.pdf

Example of existing FIAR data assessment tools:

  • F-UJI, EOSC FAIR EVA: https://fair.csic.es/en
  • AutoFAIR; FAIR Shake: https://fairshake.cloud/
  • FAIR-Checker: https://fair-checker.france-bioinformatique.fr/
  • The Evaluator: https://fairsharing.github.io/FAIR-Evaluator-FrontEnd/
  • DANS Self-Assessment Tool: https://satifyd.dans.knaw.nl/
  • EUDAT Fair Data Checklist: https://zenodo.org/records/1065991
  • https://github.com/FAIRMetrics/Metrics
  • Natural language Processing (NLP)
  • Interest in working with Large Language Models
  • Some Knowledge about various LLMs (pre-)training, Tokenization, and fine-tuning techniques

',get_the_category_list(', ')); ?--> '); ?--> RSS 2.0 feed.", "kubrick"), get_post_comments_feed_link()); ?-->

Entries Comments

  • Search Search

Quick Links

  • Featured Videos
  • How to find us

Contact Prof. Dr. S. Decker RWTH Aachen University Lehrstuhl Informatik 5 Ahornstr. 55 52074 Aachen Tel +49/241/80-21509 Fax +49/241/80-22321

Recent News

  • Nachruf auf Prof. Dr. rer.pol. Matthias Jarke
  • Special Honorable Mention at ICWL 2023
  • Best Paper Application Award at MIS4TEL 2023
  • Best Academic Reviewer Award at the iLRN 2023
  • Best Paper Award at the 2023 European Interdisciplinary Cybersecurity Conference (EICC)

Recent Publications

Loading publications

  • RWTH Main Page
  • Faculty of Mathematics, Computer Science and Natural Sciences
  • Department of Computer Science
  • Knowledge-Based Systems Group
  • Fraunhofer FIT

Early iterations of the AI applications we interact with most today were built on traditional machine learning models. These models rely on learning algorithms that are developed and maintained by data scientists. In other words, traditional machine learning models need human intervention to process new information and perform any new task that falls outside their initial training.

For example, Apple made Siri a feature of its iOS in 2011. This early version of Siri was trained to understand a set of highly specific statements and requests. Human intervention was required to expand Siri’s knowledge base and functionality.

However, AI capabilities have been evolving steadily since the breakthrough development of  artificial neural networks  in 2012, which allow machines to engage in reinforcement learning and simulate how the human brain processes information.

Unlike basic machine learning models, deep learning models allow AI applications to learn how to perform new tasks that need human intelligence, engage in new behaviors and make decisions without human intervention. As a result, deep learning has enabled task automation, content generation, predictive maintenance and other capabilities across  industries .

Due to deep learning and other advancements, the field of AI remains in a constant and fast-paced state of flux. Our collective understanding of realized AI and theoretical AI continues to shift, meaning AI categories and AI terminology may differ (and overlap) from one source to the next. However, the types of AI can be largely understood by examining two encompassing categories: AI capabilities and AI functionalities.

1. Artificial Narrow AI

Artificial Narrow Intelligence, also known as Weak AI (what we refer to as Narrow AI), is the only type of AI that exists today. Any other form of AI is theoretical. It can be trained to perform a single or narrow task, often far faster and better than a human mind can.

However, it can’t perform outside of its defined task. Instead, it targets a single subset of cognitive abilities and advances in that spectrum. Siri, Amazon’s Alexa and IBM Watson are examples of Narrow AI. Even OpenAI’s ChatGPT is considered a form of Narrow AI because it’s limited to the single task of text-based chat.

2. General AI

Artificial General Intelligence (AGI), also known as  Strong AI , is today nothing more than a theoretical concept. AGI can use previous learnings and skills to accomplish new tasks in a different context without the need for human beings to train the underlying models. This ability allows AGI to learn and perform any intellectual task that a human being can.

3. Super AI

Super AI is commonly referred to as artificial superintelligence and, like AGI, is strictly theoretical. If ever realized, Super AI would think, reason, learn, make judgements and possess cognitive abilities that surpass those of human beings.

The applications possessing Super AI capabilities will have evolved beyond the point of understanding human sentiments and experiences to feel emotions, have needs and possess beliefs and desires of their own.

Underneath Narrow AI, one of the three types based on capabilities, there are two functional AI categories:

1. Reactive Machine AI

Reactive machines are AI systems with no memory and are designed to perform a very specific task. Since they can’t recollect previous outcomes or decisions, they only work with presently available data. Reactive AI stems from statistical math and can analyze vast amounts of data to produce a seemingly intelligent output.

Examples of Reactive Machine AI  

  • IBM Deep Blue: IBM’s chess-playing supercomputer AI beat chess grandmaster Garry Kasparov in the late 1990s by analyzing the pieces on the board and predicting the probable outcomes of each move.
  • The Netflix Recommendation Engine: Netflix’s viewing recommendations are powered by models that process data sets collected from viewing history to provide customers with content they’re most likely to enjoy.

2. Limited Memory AI

Unlike Reactive Machine AI, this form of AI can recall past events and outcomes and monitor specific objects or situations over time. Limited Memory AI can use past- and present-moment data to decide on a course of action most likely to help achieve a desired outcome.

However, while Limited Memory AI can use past data for a specific amount of time, it can’t retain that data in a library of past experiences to use over a long-term period. As it’s trained on more data over time, Limited Memory AI can improve in performance.

Examples of Limited Memory AI  

  • Generative AI: Generative AI tools such as ChatGPT, Bard and DeepAI rely on limited memory AI capabilities to predict the next word, phrase or visual element within the content it’s generating.
  • Virtual assistants and chatbots: Siri, Alexa, Google Assistant, Cortana and IBM Watson Assistant combine natural language processing (NLP) and Limited Memory AI to understand questions and requests, take appropriate actions and compose responses.
  • Self-driving cars: Autonomous vehicles use Limited Memory AI to understand the world around them in real-time and make informed decisions on when to apply speed, brake, make a turn, etc.

3. Theory of Mind AI

Theory of Mind AI is a functional class of AI that falls underneath the General AI. Though an unrealized form of AI today, AI with Theory of Mind functionality would understand the thoughts and emotions of other entities. This understanding can affect how the AI interacts with those around them. In theory, this would allow the AI to simulate human-like relationships.

Because Theory of Mind AI could infer human motives and reasoning, it would personalize its interactions with individuals based on their unique emotional needs and intentions. Theory of Mind AI would also be able to understand and contextualize artwork and essays, which today’s generative AI tools are unable to do.

Emotion AI is a theory of mind AI currently in development. AI researchers hope it will have the ability to analyze voices, images and other kinds of data to recognize, simulate, monitor and respond appropriately to humans on an emotional level. To date, Emotion AI is unable to understand and respond to human feelings.  

4. Self-Aware AI

Self-Aware AI is a kind of functional AI class for applications that would possess super AI capabilities. Like theory of mind AI, Self-Aware AI is strictly theoretical. If ever achieved, it would have the ability to understand its own internal conditions and traits along with human emotions and thoughts. It would also have its own set of emotions, needs and beliefs.

Emotion AI is a Theory of Mind AI currently in development. Researchers hope it will have the ability to analyze voices, images and other kinds of data to recognize, simulate, monitor and respond appropriately to humans on an emotional level. To date, Emotion AI is unable to understand and respond to human feelings.

Computer vision

Narrow AI applications with  computer vision  can be trained to interpret and analyze the visual world. This allows intelligent machines to identify and classify objects within images and video footage.

Applications of computer vision include:

  • Image recognition and classification
  • Object detection
  • Object tracking
  • Facial recognition
  • Content-based image retrieval

Computer vision is critical for use cases that involve AI machines interacting and traversing the physical world around them. Examples include self-driving cars and machines navigating warehouses and other environments.

Robots in industrial settings can use Narrow AI to perform routine, repetitive tasks that involve materials handling, assembly and quality inspections. In healthcare, robots equipped with Narrow AI can assist surgeons in monitoring vitals and detecting potential issues during procedures.

Agricultural machines can engage in autonomous pruning, moving, thinning, seeding and spraying. And smart home devices such as the iRobot Roomba can navigate a home’s interior using computer vision and use data stored in memory to understand its progress.

Expert systems

Expert systems equipped with Narrow AI capabilities can be trained on a corpus to emulate the human decision-making process and apply expertise to solve complex problems. These systems can evaluate vast amounts of data to uncover trends and patterns to make decisions. They can also help businesses predict future events and understand why past events occurred.

IBM has pioneered AI from the very beginning, contributing breakthrough after breakthrough to the field. IBM most recently released a big upgrade to its cloud-based, generative AI platform known as watsonx.  IBM watsonx.ai  brings together new generative AI capabilities, powered by foundation models and traditional machine learning into a powerful studio spanning the entire AI lifecycle. With watsonx.ai, data scientists can build, train and deploy machine learning models in a single collaborative studio environment.

Get email updates about AI advancements, strategies, how-tos, expert perspectives and more.

Explore watsonx.ai today

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.

thesis on nlp

Biomaterials Science

Colon-targeted oral nanoliposomes loaded with psoralen alleviate dss-induced ulcerative colitis.

ORCID logo

* Corresponding authors

a Guangdong Provincial Key Laboratory of Pharmaceutical Bioactive Substances, School of BasicMedical Sciences, Guangdong Pharmaceutical University, Guangzhou 510006, People's Republic of China E-mail: [email protected]

b Shenzhen Center for Disease Control and Prevention, Shenzhen 518055, People's Republic of China

c Intensive Care Unit, Shenzhen Second People's Hospital, the First Affiliated Hospital of Shenzhen University, Shenzhen 518031, People's Republic of China

d Key Laboratory of the Ministry of Health for Research on Quality and Standardization of Biotechnology Products, National Institutes for Food and Drug Control, Beijing 102629, People's Republic of China

Oral administration, while convenient, but complex often faces challenges due to the complexity of the digestive environment. In this study, we developed a nanoliposome (NLP) encapsulating psoralen (P) and coated it with chitosan (CH) and pectin (PT) to formulate PT/CH-P-NLPs. PT/CH-P-NLPs exhibit good biocompatibility, superior to liposomes loaded with psoralen and free psoralen alone. After oral administration, PT/CH-P-NLPs remain stable in the stomach and small intestine, followed by a burst release of psoralen after reaching the slightly alkaline and gut microbiota-rich colon segment. In the DSS-induced ulcerative colitis of mice, PT/CH-P-NLPs showed significant effects on reducing inflammation, mitigating oxidative stress, protecting the integrity of the colon mucosal barrier, and modulating the gut microbiota. In conclusion, the designed nanoliposomes demonstrated the effective application of psoralen in treating ulcerative colitis.

Graphical abstract: Colon-targeted oral nanoliposomes loaded with psoralen alleviate DSS-induced ulcerative colitis

Article information

Download citation, permissions.

thesis on nlp

L. Su, G. Song, T. Zhou, H. Tian, H. Xin, X. Zou, Y. Xu, X. Jin, S. Gui and X. Lu, Biomater. Sci. , 2024, Advance Article , DOI: 10.1039/D4BM00321G

To request permission to reproduce material from this article, please go to the Copyright Clearance Center request page .

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page .

Read more about how to correctly acknowledge RSC content .

Social activity

Search articles by author.

This article has not yet been cited.

Advertisements

COMMENTS

  1. PDF RECURSIVE DEEP LEARNING A DISSERTATION

    The new model family introduced in this thesis is summarized under the term Recursive Deep Learning. The models in this family are variations and extensions of unsupervised and supervised recursive neural networks (RNNs) which generalize deep and feature learning ideas to hierarchical structures. The RNN models of this thesis

  2. PDF Linguistic Knowledge in Data-Driven Natural Language Processing

    tion of learned distributed representations. The scientific contributions of this thesis include a range of answers to new research questions and new statistical models; the practical contribu-tions are new tools and data resources, and several quantitatively and qualitatively improved NLP applications. ii

  3. Natural language processing: state of the art, current trends and

    Natural language processing (NLP) has recently gained much attention for representing and analyzing human language computationally. It has spread its applications in various fields such as machine translation, email spam detection, information extraction, summarization, medical, and question answering etc. In this paper, we first distinguish four phases by discussing different levels of NLP ...

  4. (PDF) Natural Language Processing: A Review

    Abstract and Figures. Natural Language Processing (NLP) is a way of analyzing texts by computerized means. NLP involves gathering of knowledge on how human beings understand and use language. This ...

  5. PDF Thesis Proposal: People-Centric Natural Language Processing

    In this thesis, I advocate for a model of text analysis that focuses on people, leveraging ideas from machine learning, the humanities and the social sciences. People intersect with text in multiple ways: they are its authors, its audience, and often the subjects of its content. While much current work in NLP

  6. PDF NEURAL READING COMPREHENSION AND BEYOND A ...

    Stanford NLP group — for being on my thesis committee and for a lot of guidance and help throughout my PhD studies. Dan is an extremely charming, enthusiastic and knowl-edgeable person and I always feel my passion getting ignited after talking to him. Percy is a superman and a role model for all the NLP PhD students (at least myself). I never

  7. Efficient algorithms and hardware for Natural Language Processing

    Recently, NLP is witnessing rapid progresses driven by Transformer models with the attention mechanism. Though enjoying the high performance, Transformers are challenging to deploy due to the intensive computation. In this thesis, we present an algorithm-hardware co-design approach to enable efficient Transformer inference.

  8. Improving clinical decision making with natural language processing and

    Abstract. This thesis focused on two tasks of applying natural language processing (NLP) and machine learning to electronic health records (EHRs) to improve clinical decision making. The first task was to predict cardiac resynchronization therapy (CRT) outcomes with better precision than the current physician guidelines for recommending the ...

  9. PDF Klein Dan Thesis

    Stanford NLP group, which was so much fun to work with, including officemates Roger Levy and Kristina Toutanova,and honorary officemates Teg Grenager and Ben Taskar. Finally, my deepest thanks for the love and support of my family. To my grandfathers Joseph Klein and Herbert Miller: I love and miss you both. To mymomJanandtoJenn,

  10. PDF Analysis of Natural Language Processing (NLP) approaches to determine

    A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Artificial Intelligence in Faculty of Science ... Research in NLP has focused their efforts on various tasks. Natural Language Understanding (NLU), which maps text to its meaning, is working on interpretation of text. ...

  11. Natural Language Processing (NLP) in Qualitative Public Health Research

    NLP allows processing of large amounts of data almost instantaneously. As researchers become conversant with NLP, it is becoming more frequently employed outside of computer science and shows promise as a tool to analyze qualitative data in public health. This is a proof of concept paper to evaluate the potential of NLP to analyze qualitative data.

  12. PDF Representations of Meaning in Neural Networks for NLP: a Thesis Proposal

    1.1 Thesis Proposal The thesis will consist of two parts. In the first part, described in Section2, we will consider var-ious theories and properties of meaning from the point of view of philosophy of language. We will find which aspects of these theories are useful to describe the process of representing meaning in neural networks in NLP.

  13. PDF Explainable AI (XAI) in Natural Language Processing (NLP)

    The goal of this master thesis is to provide a background on interpretation techniques, i.e., methods for explaining the predictions of NLP models. The thesis should provide a literature review on XAI methods for NLP including their respective pro's and con's in view of Munich Re's activities. This shall be the basis for a high-level high ...

  14. Neural Transfer Learning for Natural Language Processing (PhD thesis)

    The thesis touches on the four areas of transfer learning that are most prominent in current Natural Language Processing (NLP): domain adaptation, multi-task learning, cross-lingual learning, and sequential transfer learning. Most of the work in the thesis has been previously presented (see Publications ). Nevertheless, there are some new parts ...

  15. Natural Language Processing

    NLP based Person Retrival Decoder. 3527 papers with code Question Answering Question Answering. 227 benchmarks 2930 papers with code Open-Ended Question Answering. 209 papers with code Open-Domain Question Answering. 15 benchmarks ...

  16. PDF Computational Investigations of Pragmatic Effects in Natural Language

    Semantics and pragmatics are two complimen-tary and intertwined aspects of meaning in language. The former is concerned with the literal (context-free) meaning of words and sentences, the latter focuses on the intended meaning, one that is context-dependent. While NLP research has focused in the past mostly on semantics, the goal of this thesis ...

  17. Innovative 12+ Natural Language Processing Thesis Topics

    The above listed are the 2 major classifications of NLP technology. In these classifications let us have further brief explanations of the natural language-based understanding for your better understanding. NLP Applications. Biometric Domains. Spam Detection. Opinion/Data Mining. Extracting Information. Entity Linking.

  18. Thesis NLP

    All thesis topics should be related to the main research directions of the NLP Group, which include computational argumentation, computational sociolinguistics, and computational explanation . Below, we provide a selection of currently available topics. Details of the topics are discussed and shaped jointly in the beginning of the thesis process.

  19. Top 10+ Latests Natural Language Processing NLP Thesis Topics

    NLP is expanded as Natural language processing (NLP). It is a method to support the contextual theory of computational approaches to learning human languages. By the by, it is aimed to implement automated analysis, interpretation of human language in a natural way. We provide 10+ interesting latest NLP Thesis Topics.Let's check two steps to process the NLP,

  20. PhD Thesis: Neural Information Extraction From Natural Language Text

    Abstract and Figures. Natural language processing (NLP) deals with building computational techniques that allow computers to automatically analyze and meaningfully represent human language. With ...

  21. (PDF) Natural Language Processing

    the machine knowledge according to the output ob tained. Natural language processing is an integral area of computer. science in which machine learni ng and computational. linguistics are b roadly ...

  22. Facilitating Opinion Diversity through Hybrid NLP Approaches

    Modern democracies face a critical issue of declining citizen participation in decision-making. Online discussion forums are an important avenue for enhancing citizen participation. This thesis proposal 1) identifies the challenges involved in facilitating large-scale online discussions with Natural Language Processing (NLP), 2) suggests solutions to these challenges by incorporating hybrid ...

  23. NLP Master Thesis (Research Areas)

    The term NLP stands for Natural Language Processing . It is a kind of artificial intelligence technique used in human language understanding and analysis . Intellectual devices such as computers, smartphones, and some other gadgets are performing with natural language inputs. In fact, they are familiar with the binary or ASCII values.

  24. LLM-based Tool for FAIR Data Assessment « DBIS

    The goal of the thesis, as depicted in the figure below, is to leverage the benefits inherent in LLMs to develop a tool for FAIR data assessment. ... Natural language Processing (NLP) Interest in working with Large Language Models; Some Knowledge about various LLMs (pre-)training, Tokenization, and fine-tuning techniques; Entries Comments ...

  25. (PDF) Text Summarizing Using NLP

    G. Vijay K umar et al. / Text Summarizing Using NLP 61. N. Moratanch et al [10] clarified about the strategy of summarization of text is that. removed data is gotten as synopsis report and ...

  26. Types of Artificial Intelligence

    Because Theory of Mind AI could infer human motives and reasoning, it would personalize its interactions with individuals based on their unique emotional needs and intentions. Theory of Mind AI would also be able to understand and contextualize artwork and essays, which today's generative AI tools are unable to do.

  27. Colon-targeted oral nanoliposomes loaded with psoralen alleviate DSS

    Oral administration, while convenient, but complex often faces challenges due to the complexity of the digestive environment. In this study, we developed a nanoliposome (NLP) encapsulating psoralen (P) and coated it with chitosan (CH) and pectin (PT) to formulate PT/CH-P-NLPs. PT/CH-P-NLPs exhibit good biocompatibi