Home » SEO » Using Open Source Software for building SEO Tools

Using Open Source Software for building SEO Tools

Bob Scavilla
Published: March 31, 2022
SEO

Table of Contents

Add a header to begin generating the table of contents

A Live Demo of the Apache OpenNLP Project: NameFinder

URL to pull Names From:

Open source has many great projects for helping SEO Tool developers. Here I created a live demo of an AI Natural Language Processing toolkit along with Java sample code.

Source code can be downloaded from Github

Before starting my own digital marketing and analytics company back in 2011, I had a career in software development behind me. At the time, I had a lot of experience with search algorithms, data collection, storage, and processing. I had been using open source repositories for years and often used these projects as starting points for my software projects. One of my favorite things to do is write tools for my own company. I don’t have to write all of our SEO tools, there are some mature platforms that we use such as Screaming Frog and SEMRush that meet our needs just fine. But, to stay ahead of competition, I often build proprietary systems that allow us to produce something unique that offers high value to our clientele.

My favorite repo is Apache.org. This is a well-run organization which currently has over 300 top level projects. I am constantly working through the project demos to learn how a project may be useful to our SEO team and how it may be integrated into our tech stack. One of the projects that I’ve become interested in is OpenNLP. The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. There are many reasons why an SEO company needs to work with NLP, and we have several tools developed inhouse that work with natural language. What really interested me in the OpenNLP project was how easy it was to train models for different purposes.

Command Line Interfaced example for OpenNLP — CLI Text Annotation Tool

The OpenNLP toolkit has a command line interface and an API for application development. The command line interface is an excellent tool for familiarizing yourself with all of the toolkit’s features.

There are several components of the toolkit that perform different NLP functions such as Language Detection, Part Of Speech tagger, and Document Categorizer. For the purpose of this article I’m going to demonstrate the Name Finder function. The technical term for finding things like names, addresses and such is Named-Entity Recognition, abbreviated as NER. An entity in this case can be a person’s name, a location, time, phone number, address, a VIN for a vehicle, or really anything. The cool thing is, OpenNLP models can be trained to extract and label any types of entity.

The input box at this top of this page allows you to enter a URL, and the name entities on that page will be returned. This demo is using the demo model that comes with the toolkit, so it may not be perfect, but it’s close enough to get the idea. A little additional training would spruce up the results pretty good. The code samples are in java and intentionally kept simple for clarity.

I assume that you are able to create a java application and add the OpenNLP dependencies.

For maven:

				
							<dependency>
			<groupId>org.apache.opennlp</groupId>
			<artifactId>opennlp-tools</artifactId>
			<version>1.9.4</version>
		</dependency>

The models are read from disk. There are 3 models used for NameFinder:

en-token.bin
en-sent.bin
en-ner-person.bin

**If you are using Eclipse, these models would go into the resources folder To find names the text must be segmented into sentences and tokens. Models for these come with the toolkit and work well:

				
						private static String[] sentenceDetector(String text) {
		InputStream is = null;
		is = OpenNLP.class.getResourceAsStream("/en-sent.bin");
		String sentences[] = null;
		try {
			SentenceModel model = new SentenceModel(is);
			SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
			sentences = sentenceDetector.sentDetect(text);
		} catch (IOException e) {
			e.printStackTrace();
		}		
		return sentences;
	}

The outcome of the sentence detector is the sentence array holds a string for each sentence of the text. The tokenizer process is similar to the sentence detector. The outcome is that each word or punctuation mark is stored in a separate string.

				
						private static List TokenizedText(String[] sentenceText) {
	  // create a stream
		InputStream tokenM = null;
		tokenM = OpenNLP.class.getResourceAsStream("/en-token.bin");
		TokenizerModel model=null;
		try {
			model = new TokenizerModel(tokenM);
		} catch (IOException e) {
			e.printStackTrace();
		}
		Tokenizer tokenizer = new TokenizerME(model);
		
		List ts = new ArrayList();
		for(String sentence : sentenceText) {
			ts.add(tokenizer.tokenize(sentence));
		}
		return ts;
	}

Once the text is pre-processed, it’s ready for the NER model.

				
						private static void findNames(List ts) {
		InputStream is = null;
		is = OpenNLP.class.getResourceAsStream("/en-ner-person.bin");
		
		try {
			TokenNameFinderModel model = new TokenNameFinderModel(is);
			NameFinderME nameFinder = new NameFinderME(model);
			
			ts.forEach(docs -> {
				Span nameSpans[] = nameFinder.find(docs);
				for(Span s: nameSpans) {
					String entity=s.getType() + ": " + getCoveredText(s, docs);
					System.out.println(entity);
				}
			});
			
			nameFinder.clearAdaptiveData();

			
		} catch (IOException e) {
			e.printStackTrace();
		}
}

Bob Scavilla

Partner + CEO

Bob’s endorphin level rises when he is coding. Read More »

About FourFront

FourFront uses data to provide digital marketing and market research services. In our blog, our team of analysts, strategists, and engineers provides tips, insights, analysis, and commentary.

Keep In Touch

Learn about new articles by following us on social:

Using Open Source Software for building SEO Tools

Bob Scavilla

Leave a Reply Cancel reply

SEO

How Data-Driven Marketing Moves Strategy Forward

SEO Checklist for Your Website Redesign

What is SERP Volatility and Why Does it Matter?

How To Improve Lead Quality

Data Ethics in Marketing: How to Safeguard Consumer Data

All Categories

SEO

PPC

UX

DATA STORIES

COMPANY NEWS

CORE WEB VITALS

WEB ANALYTICS

GUESTS

Popular Posts

What is Sustainable Digital Marketing?

10 SEO and Digital Marketing Trends for 2025

Understanding and Optimizing for Generative Search

Local SERP Checker

10 SEO and Web Marketing Trends to Watch in 2024

About FourFront

Keep In Touch

What We Do

Who We Serve

Free Resources

About FourFront

Sign Up for Updates

Using Open Source Software for building SEO Tools

Leave a Reply Cancel reply

SEO

All Categories

Popular Posts

About FourFront

Keep In Touch

What We Do

Who We Serve

Free Resources

About FourFront

Sign Up for Updates

Submit a Request