Tutorial Series on NLP: Information Extraction tasks

Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources.

In this Series of blogs I will walkthrough several tutorials giving you what composes of Information Extraction tasks and give you fundamental code samples on which you can further work on.


What is Information extraction?

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP)

Some more Background….

The field of information extraction has its genesis in the natural language processing community where the primary impetus came from competitions centered around the recognition of named entities like people names and organization from news articles. As society became more data oriented with easy online access to both structured and unstructured data, new applications of structure extraction came around. Now, there is interest in converting our personal desktops to structured databases, the knowledge in scientific publications to structured records, and harnessing the Internet for structured fact finding queries. Consequently, there are many different communities of researchers bringing in techniques from machine learning, databases, information retrieval, and computational linguistics for various aspects of the information extraction problem

Catalogue of Language Resources and Tools in Japan

There are several resources available but It is a good idea to follow this catalog for all the exhaustive list of resources available and one which we can refer. Please follow the link

PS: Note this is really one of most resourceful page I found ever on Japanese text

List of all Modules

Module 1 : Tagger Module

The tagger module performs following tasks: text preprocessing, syntactic parsing using stanford parser, sense tagging using super sense tagger, coreference resolution.

Module 2: Fact Extraction Module

This module performs various syntactic transformations on sentences to extract factual information. Syntactic transformations are based on English syntactic Rules.

Module 3: Entity Extraction Module

The Entity extraction module extracts the entities from the text. The entity types are based on wordnet senses. In total, there are 27 noun categories and 15 verb categories

Module 4: Relation Extraction Module

Relation Extraction Module will extract the triplet: predicate, subject, object which will be present in sentences. For complex sentences, more than one triplet can be present.

Module 5: Sentiment Analysis Module

The first step in the construction of any sentiment analysis engine is to define clearly the type of problem which your sentiment analysis engine will target. SA comes into different flavors and will demand different types of NLP techniques and resources. In fact, the synonymic term for SA, opinion mining, better reflects the broader nature of the task which is associated with sentiment analysis.

SubModule– Classification Type (Simple, Aspect-based, Comparative vs Sentiment Analysis)

Certain sentiment analysis scenarios demand the identification in text of the specific entity (object) and the aspect (attribute or feature) which the sentiment refers to. This requires the application of information extraction (IE) techniques into the target text in order to individuate these elements, adding a significant layer of complexity to the sentiment analysis process. This type of sentiment analysis, called aspect-based sentiment analysis, is commonly applied to online product reviews, in which target objects (a specific camera product, for example) and their different aspects (e.g. its luminosity capture) are classified according to a specific sentiment category or class.

Other types of sentiment analysis target a general aggregate assessment of the polarity attached to an entity (a politician, artist or a brand), which is identified in a simple fashion. Usually this is the entry point for doing sentiment analysis and defines a coarse-grained type of sentiment analysis.

Another possible variation include the presence of comparative opinions (versus regular opinions), i.e. outputting comparisons between different entities and aspects.

SubModule– Polarity Granularity

Consists in the types and granularities of classes which will be the target of the classification task. Typical class schemes vary from 3 (positive, neutral, negative) classes up-to 5 classes (very positive positive, neutral, negative, very negative). The meaning of the classes typically vary across different domains of discourse (bullish, neutral and bearish, for the financial domain, and 5 star ratings for product reviews).

SubModule– Discourse Granularity (Opinion Target)

Depending on the domain of discourse, the typical size of the text which needs to be analyzed can vary significantly, ranging from a tweet to a full text (a long product review for example). Additionally, event for large corpora, it is possible to define different levels of analysis granularities. Typical levels are: document level, sentence level and entity/aspect-level.

SubModule– Subjectivity Level

Depending on the type of analysis which is being aimed at, it can be useful to differentiate between subjective and objective types of sentences. While an objective sentence expresses factual information, a subjective sentence expresses some personal opinions, beliefs or feelings. This separation is not always clear, as some opinionated information can be communicated in a more factoid type of discourse. For example, a technical comparative analysis between different product attributes will primarily target objective types of discourse while general brand perception analyses may be more focused on subjective discourse types.

SubModule– Discourse Attributes (Formality, Language)

Additionally, the domain of discourse will define the level of language formality (presence of slangs, abbreviations), which can impact the quality of sentiment analysis. Another factor to take into account is the set of target languages which will be addressed by the sentiment analysis, which can be mono-lingual (focus on a single language) or multi-lingual (target multiple languages).

Figure 1 summarizes the set of core categories which can be used to define the type of target sentiment analysis and indicates the level of complexity. Understanding these classes are a fundamental for a proper scoping of the problem, as these categories deeply relate to the type and the complexity of NLP strategy which will need to be employed.

Module 6: Document Classification & Language Modeling Module

The Document Classification & Language Modeling Module performs text classification is to automatically classify the text documents into one or more defined categories and assign probability of an class.

Module 7: Network Graph Module

Network of ambassador

In the next post I will available list resources for each Modules and post after that what technique I will use to develop each of these modules to perform our IE task.

Stay tuned :)
Please support this article if it helps you.

Data Scientist by profession and just lazy by nature.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store