About The Project
What if AI can tell what your page is all about?
Today’s world is substantial with unstructured information. Big data is not just a description of raw volume, but the real issue is usability. Conventional analytics focused on structured data, but these methods are not appropriate for large volume of unstructured. Text analytics is the way to extract significance from the unstructured text to find out patterns and transformations.
With that in mind, our team decided to take on the challenge of using natural language processing (NLP) to analyse text data, then recommend the sentiment, key phases & category.
Part 1: Getting raw text data
The Concept

Websites are built for human consumption, not machine. Copying and pasting information from websites is time-consuming, error-prone and not feasible. Web scraping is a way to get data from a website by sending a query to the requested page, then combined with some algorithms to clean out the HTML code / scripting to get the raw text data.
Hello, it's not that easy.
The Challenges
Data accuracy is extremely important in content extraction for a webpage. When deal with different type of websites, be ready to face the challenge of different pages having different HTML coding & website structure.

Our Approach
The solution is constant monitoring and timely adjustments the algorithm. In short, to find the differences in raw data, then act as required. With much attempt, we have identified a great volume of exclusions in the process. Nonetheless, continuous effort is still required in this process.
Part 2: Data Processing & Analysis
The Concept
The powerful pre-trained models of the Natural Language API empower developers to easily apply natural language understanding (NLU) to the application.
We utilize Google Natural Language API to classify webpage into a set of categories. Azure Text Analysis to gain page sentiment, key phases, and domain name entity (NER).

The Method
To give users a holistic view of a page content's quality, we covered four key areas in our implementation method:

  • Sentiment Analysis: Sentiment analysis creates estimations of how positive, neutral, or negative a text is. The Azure Text Analytics API returns confidence scores between 0 and 1 for each document & sentences within it for positive, neutral and negative sentiment.

  • Key Phrase Extraction: The API evaluates unstructured text and returns a list of key phrases. This capability is useful if you need to quickly identify the main points in a collection of documents.

  • Named Entity Recognition (NER): Named Entity Recognition (NER) can identify and categorize entities in your text as people, places, organizations, quantities, Well-known entities are also recognized and linked to more information on the web.

  • Classify text into categories: The Google Cloud Natural Language API lets you classify text into categories. Using a database of 700+ categories, this API feature makes it easy to classify a large dataset of text.

The Process
The project team identified the steps to make data processing happen according to the flow below:
1. Enter URL
Extract raw text data from a webpage
2. Data Processing
Convert raw text data to machine-readable data
3. Evaluate
Get insights with Azure/Google NLP API
The Outcome
From collecting data to processing data and to training the AI method, we are able to provide preliminary insights for users to learn more about the quality of their content. As we combine all the tools we discovered in this project, the team has successfully implemented a AI Engine to auto-generate SEO Keywords simply by running through the content to detect the most relevant keywords without human intervention. This feature is officially rolled out and made available for all users in XTOPIA.
End Note
This project was completed within four weeks of rapid research and development cycle. The potential of this AI implementation is beyond our imagination and the team is constantly looking for ways to maximise the use of AI in the web. Got questions? Contact us directly at [email protected]

An AI initiative by
© 2022 XTOPIA SDN BHD (516045-V) All rights reserved | Privacy Policy
Generic Popup