Relationship Extraction from Any Web Articles using spaCy and Jupyter Notebook in 6 Steps

6 min readDec 29, 2021

Image: Venn Diagram showing the Relation Extraction task within Artificial Intelligence domain (Note: Size doesn’t resembles the coverage percentage)

Introduction

Natural Language Processing (NLP) is a branch of Artificial Intelligence, referring to the ability of a computer program to understand human language as it is spoken and written. A gentle overview of this field is well documented this this article: A Introduction to NLP.

Among the applications of NLP, there is a focus on Content Analysis for social media or web data mining, and one of the important aspect of Content Analysis is Relationship Extraction.

Relationship Extraction is the process of identification of relationships between different entities in a text. It involves identifying entities in a sentence and then performing the relation analysis between the entities identified.

What is spaCy?

SpaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. SpaCy is designed specifically for production use and helps to build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

Another widely used library is NLTK where it is more research focused. NLTK provides ranges of options for algorithms to be used, while spaCy uses the latest and best algorithms available.

Image: spaCy arthitecture illustration (source: https://spacy.io)

The following are among the features that spaCy offers:

Tokenization: Segmenting text into words, punctuations marks etc.
Part-of-speech (POS) Tagging: Assigning word types to tokens, like verb or noun.
Dependency Parsing: Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
Lemmatization: Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.
Sentence Boundary Detection (SBD): Finding and segmenting individual sentences.
Named Entity Recognition (NER): Labelling named “real-world” objects, like persons, companies or locations.

There are many other important features provided by spaCy, which can be further explored in spaCy documentations.

For the next sections, we are going into the step-by-step process of Relationship Extraction from website.

Note: If you are totally new to jupyter notebook & python in general, you can visit Introduction to Jupyter for a quick introduction.

Step 1: Install & Import Dependencies

To install spaCy, use the following command:

!python -m spacy download en_core_web_sm

It is also required to install pandas and bs4 (BeautifulSoup)

!pip install pandas
!pip install bs4

After installations, we need to import those libraries for usage in our code.

Step 2: Choose your desired Web Content, Scrape it and Save it into CSV format

In my experiment, I selected an article from rigzone.com about omicron impact to oil global market. If you would like to import different article, just replace the line 2 in my code below.

Line 4 to 12 are basically the code to read the webpage and save only texts into ‘parsed_text’ variable. The data stored in ‘parsed_text’ is as below.

Then, we split each sentence into different units in a list using the code in Line 15, where we save the output in the variable ‘sentences’. The output should look as snip below.

To save the data into CSV, we make use of the following code

Above code will save the data stored in ‘sentences’ into a CSV with filename “article_text.csv”. The CSV file looks something like below

Step 3: Manually Clean Data (if necessary) & Re-Import the CSV File

If you notice, there are some blank lines which are unnecessary for our further processing. At this stage, we will just manually delete those blank lines. (There should be a code to automate this process but I skipped it for now). After data cleaning, I saved the file into filename “article_text_clean.csv”. Then, data can be re-imported using following line of code.

Output of it can be seen as below

Step 4: Get entity pairs

The next step is to get entity pairs using the following Code:

The function above is basically can read any sentence and return two entities. For example, see the following usage.

We then run the function throughout our texts imported from CSV file earlier.

Step 5: Get Relations for the entities

Now that we got the entity pairs for the article, next is to get the relation for each pairs. We make use of the following code.

Below image snip shows how this function supposed to work:

Now, using the function we apply to the CSV imported text:

We can display the relations summary as below:

Step 6: Display Entity Relations into Graphs

The final step is to visualize the entity relations into a network graph.

To display everything (all entities and their relations), use the following code.

Output is as per below image:

Sometimes, it is not good to display everything because the visualization can be hardly readable. This usually happens to huge text because of large number of relation. An example of this “overcrowded” visualization is as below.

Note: this image was produced when running the code using Wiki_Sentences_V2.csv dataset

To mitigate above from happen, we can “filter” the relation for which one we want to display, by using the following code.

In code above, we filter using the relation “cost”. The graph (data using Rigzone article) will show below output.

You can always change what relation value do you want to apply for the filter by replacing the “cost” with another possible relation. Below is the output when I applied to another graph (Wiki_Sentences_V2.csv), using the relation “composed by”.

Conclusion

With above step-by-step guide, I hope you can see the possible use of relationship extraction. Obviously, much could be improved on the text analysis by the model and the visualization itself. By improving both of these, this particular subject could be applied in real business or academic application.

Full Code

The full code (in Jupyter Notebook) can be downloaded in my Github page: https://github.com/hamiasmaiX/web-relationextraction

Acknowledgement

This article is part of the course requirement by Professor Anton Kolonin in Application aspects of social data processing (Social intelligence technologies or Social computing) at Novosibirsk State University. (Anton’s medium page: https://aigents.medium.com/)