Written by Michel Fortin. Originally published on seoplus.com.
- Weighing Keywords For Relevance
- TF-IDF is Still Important But Problematic
- Three Things Keywords Fail to Consider
- Switching From Keywords to Entities
- What’s The Idea Behind Entities?
- Why Tracking Keywords is Losing Ground
- Topics: The New Keywords
- Topical Research vs Keyword Research
- Topical Clusters and The Future of SEO
For ages, search engine optimization (SEO) was all about keywords. They’re a fundamental component of the information retrieval process. So it made perfect sense: search engines crawled, found, indexed, and classified content based on keywords.
Keywords are still important in the way SEO works, albeit to a lesser degree. At its most primitive level, the key signal for ranking the relevance of a piece of content — and all other signals notwithstanding — is based on how often the keyword is mentioned on a page.
This is called “TF” or “term frequency.”
The problem is that stop words like “the,” “a,” “and,” “in,” “at,” and so on all appear far more frequently. So another process is needed. It’s to compare the term frequency of all other terms with how infrequently it is used in other documents.
This eliminates stop words from being classified as more relevant.
Called “inverse term frequency” or “IDF,” this formula looks at the frequency of keywords and compares it to the frequency of other terms across a body of work (like a site’s blog, for example).
Since stop words appear proportionately more frequently, they are eliminated.
Called TF*IDF, which means “term frequency (multiplied by) inverse document frequency,” this formula gives a keyword a certain numerical weight.
First, it weighs how many times a keyword is mentioned on a single page. Then, by weighing the keyword across all the other pages (and comparing it to the weight of other keywords), the page with the heaviest frequency will score — and therefore rank — higher.
The formula is applied to understand the importance of a given document within a group of documents — like the relevance of a page regarding a certain keyword compared to the relevance of all the other pages within a website, for example.
This is the reason why, for the longest time, SEO was not only focused on keywords but also, as a best practice, prescribed a primary keyword (or “focused keyword”) for each page. Many of the SEO tools and plugins that content creators use still follow this guideline.
Where TF-IDF is particularly beneficial is when it is used to compare the weight of a keyword to others within a larger body of work — like the entire Internet. It was the basis of how search engines would rank certain pages from multiple websites. It still is to some extent.
The formula is more complicated than what I’m explaining here, but fundamentally it’s simple: when searching for a keyword, the page with the heavier weight will rank higher.
Since its inception in 1972, TD-IDF remains the standard process of retrieving information and weighing its relevance. In fact, today, it’s used in machine learning and the building of artificial intelligence (AI) by giving sophisticated software a base to work from.
However, by itself, this formula creates a problem.
Back in the primitive days of the Internet, increasing the frequency of keywords within a page was one of the easiest and most effective ways to get high rankings. This, in itself, is not a problem.
Adding an extra keyword here or there doesn’t hurt things.
But over time, as more website owners caught on, it created an opportunity for abuse. It fueled a game of one-upmanship, where keyword-stuffed pages that made little sense filled search engine results. It significantly impacted the user experience.
Even though TF-IDF determines relevance, it is not a great indicator of quality — the quality of the content in which the keyword appears. This pushed search engines to evolve to understand better what pages mean and measure their relevance beyond keywords.
Consequently, major Google algorithm updates are reducing the importance of (or better said, the reliance on) keyword-driven content, making TF-IDF ironically less relevant.
Keywords are weighted for relevance. But relevance alone is not enough. There are three significant limitations with the TF*IDF technology when it comes to keywords:
TD-IDF focuses only on keywords. It doesn’t consider keyword variations, semantically related keywords, and the relationship between keywords. Moreover, it fails to look at synonyms and words that are connected based on themes or context.
For example, say your keyword is “soap.” You have bathing soap, dishwasher soap, laundry soap, cleaning soap, shaving soap, etc. You also have different types of soap, such as handmade soap, perfumed soap, baby soap, medicated soap, glycerin soap, etc.
That problem is compounded when you consider different forms of soap, like shampoo liquid, laundry detergent, shower gel, shaving cream, bubble bath, etc. There’s also “soap operas,” “Soap” TV show, “to soap” (to flatter), and SOAP (Simple Object Access Protocol).
The possibilities are nearly endless.
Without considering the context, such as keyword variations, their position in the content, the interrelatedness between keywords, and the ways keywords fit within and connect to the rest of the page, keywords alone can be quite misleading.
TF-IDF aims to determine a keyword’s relevance but fails to consider how meaningful that relevance is. What if another keyword is more relevant but appears less frequently? What if another page with different keywords offers more value based on the same keyword?
In other words, TF-IDF fails to consider not only how keywords fit within the context of the page but also within the rest of the site. As a result, other pages may contain the same keywords that appear less frequently, but their content may be more relevant and suited for the topic.
A better way to say it is, relevance doesn’t equal importance.
A keyword may be deemed relevant because it’s used more frequently, but it doesn’t mean it increases the value of the page it’s on. Other keywords, let alone other pages, are possibly more relevant, regardless of the keyword’s frequency or TF-IDF score.
For example, a page with the keyword “medicated soap” has a higher relevance score when compared to other pages. But a less relevant page might discuss antibacterial, antifungal, and antimicrobial soap in greater depth, which may be more topically relevant.
Similar to the previous limitation, other keywords and keyword variations (and related keywords that are connected but dissimilar) may be found on other pages that may be more relevant to the user than the document TF-IDF is analyzing.
Finally, there’s the most important part of the equation. In fact, it’s not even part of the TF-IDF equation at all. And that’s the user.
TF-IDF can provide some idea of what the page may be about. But it may be too general or too specific for what the user wants it for. Also, TF-IDF may be comparing it to completely different pages that may serve different audiences or achieve different goals.
All these other pages, regardless of their intended purpose, are lumped together in the equation. For instance, TF-IDF may compare a keyword on a blog post to the keyword on a shopping page, an FAQ, or a page targeting an entirely different industry.
In essence, what the page is supposed to do plays an important role and should be considered in determining its relevance. But looking at keywords alone doesn’t consider the different types of pages that may not align with the user’s search intent.
Granted, some longer keywords do provide an idea of search intent, such as questions or keyword qualifiers (e.g., “How much is Ivory soap?” or “best medicated soap for dry skin”).
But relying on TF-IDF alone, it would simply remove stop words, extract different keywords (e.g., “medicated soap” or “dry skin”), and compare them to other keywords on different and possibly irrelevant pages — such as wholesale soap sales or soap-making tutorials.
Luckily, frequency is only one of many metrics that go into weighing keywords. Plus, other ranking factors play a role in determining how relevant a certain keyword is.
But in recent years, machine learning and a process called “natural language processing” (NLP) are changing the way we look at keywords. While still important, new algorithms will look at and try to understand keywords at a deeper, more nuanced level.
To address the three drawbacks mentioned earlier, search engine software now aims to understand keywords by learning how they’re used in natural language. It does so by considering the keywords’ usage, context, and relationship to each other.
Doing so allows the software to determine the subtleties and complexities of a keyword. While they are still keywords technically, in the world of NLP, they’re called “entities.”
Entities are becoming increasingly important, particularly in digital marketing, because they change the way we think of SEO. By giving keywords meaning that can change based on various factors, we can’t simply optimize content based on keywords alone any longer.
The meaning of a keyword can vastly differ based on its context, usage, and position in the content (i.e., its surrounding text and other on-page elements). It can even completely alter the meaning of the passage or content itself, giving it a completely different context.
As the adage goes, which applies perfectly to today’s SEO:
Content, without context, is meaningless.
Entities are keywords that, depending on the context, mean something specific. They can be “names,” “types,” or “attributes.” They can relate to other ideas. By grouping them, they help uniquely identify a certain person, object, or event.
For example, “antibacterial soap” is an entity while “hand sanitizer” is another. The latter may not be a different type of soap like the former is, but they’re still related. So, depending on the context, both are different types of “disinfectant cleansers” (another entity).
To take this example further, in an article about “COVID-19” (also an entity), “antibacterial soap” has a different meaning because of the context. That’s why “antibacterial soap,” as a keyword, doesn’t mean much. But as an entity, it has meaning, importance, and purpose.
Rather than thinking of keywords as linear or on a spectrum, you can think of them appearing in a group of ideas related to each other, much like a hub-and-spoke wheel. (Google calls them “branches” and “nodes,” and maps them together into what it calls a “Knowledge Graph.”)
Take “head” and “shoulders.” They’re two different keywords. “Head and shoulders” is also a different keyword. But “Head & Shoulders” is a brand name. It’s an entity. Also, “dandruff,” “shampoo,” and “anti-dandruff shampoo” are entities, too — and related to each other.
Entities are far more complex than what I’m conveying in this article, and they have far more implications and potential applications than I’m able to describe appropriately.
The important thing to keep in mind is that search behaviour has changed, like it or not. Consequently, SEO has evolved and continues to evolve. Therefore, it goes to reason that the practice and practitioners of SEO, and the recipients of SEO services, must change along with it.
Before, search queries generally consisted of single or multiple words strung together. But search results didn’t take into account the complexity of the human language.
Search results were all over the place. Getting the answer you wanted was mostly a game of chance. (Speaking of which, it’s probably one of the reasons why Google introduced the “I’m Feeling Lucky” button as a means of bypassing all the search result pages or SERPs.)
In an attempt to get better results, users would add more keywords to their queries. But this would often backfire: Google would look at individual keywords within the query and offer varying results for each one. It would then rank everything, regardless of relatedness.
But since the introduction of entity-oriented search (a term coined by former Google researcher Kriszrtian Balog), keywords are becoming inherently meaningless. Or better said, focusing on keywords and their rankings has become a meaningless pursuit.
Keywords are still important, such as for conducting research. But optimizing content with specific keywords — and trying to rank for them — is becoming increasingly outdated.
Today, with digital assistants, intelligent devices, and voice search giving users the ability to ask long, complex, and nuanced questions, chasing specific keywords is pointless.
Previously, the process was to optimize content around a popular keyword. Either that or start with the keyword and write content around it. But both approaches neglect what the keyword means or how it fits within the rest of the content, much less the site.
These techniques are susceptible to abuse, which tend to make content either unreadable or unusable. But, more importantly, they ignore the most important aspect of the content: the reader.
Instead of keywords, focus on topics.
Using the previous example, the term “anti-dandruff shampoo” is more than just a keyword. It’s a topic. It may be an umbrella topic about dandruff control, or it may be a subtopic in an article discussing the different types of “shampoos.” Either way, context is key.
By focusing on topics, vying for specific keywords becomes irrelevant. There’s no longer the need to do backflips trying to force irregular, unnatural, and often misspelled keywords into content just for the sake of trying to rank for them because they’re popular.
For example, trying to fit “best covid soap toronto” in a sentence, as is, is unnecessary, let alone mind-numbingly difficult. It’s even worse if it’s repeated several times on the page.
While keyword research is still important, it’s better to understand what topics the user wants to know about, what topics are already covered (or not), and what topics to write about that will also provide all the information needed to boost search signals.
The process comes down to these essential steps:
- First, find a pain point the reader is experiencing, a question they’re asking, or a certain topic they are interested in — one they might be researching themselves.
- Look at the results that come up and compare. For example, research existing types of content that cover the topic (or how they fail to cover it adequately).
- Above all, create a goal for the content, which makes sense to both the reader and the website. Then cover the topic with both the user and that goal in mind.
- Finally, using the topic as a guide (rather than a specific keyword as a goal), include related keywords, which will appear naturally and effortlessly throughout the piece.
Of course, additional steps can help, but they’re not mandatory. For instance, select the most commonly searched keywords that fall under the topic’s umbrella. Then, incorporate these keywords into headings and subheadings, as well as the page’s HTML.
But if the topic content reflects what the reader is actually interested in and searching for, everything else will fall into place naturally, including the right keywords. All that remains from an SEO perspective is to make sure the content is structured properly.
The relationship between topics and content is what’s important. Some topics are larger and more encompassing than others. The others may be subtopics or related topics.
A piece of content may cover an umbrella topic in two ways. First, it might break it down into subtopics on a single page to make sure it covers the topic thoroughly. Or it may involve multiple pieces of content, where each one covers distinct subtopics linked together.
Similar to the map of nodes and branches mentioned above with entities and the Knowledge Graph, topical clusters are like wheels — with hubs and spokes, too.
Before, keywords were grouped and organized according to categories or silos. While this might still be useful for structuring content, it’s linear and not how topics (and the relationships between them) tend to work. Think of a mindmap, for example.
Where old-school SEO used to be based on keywords and how popular they are (to the search engine), today’s SEO is based on topics and how valuable they are (to the reader).
The former forced writers to create content for search engines first and users last. Now, it’s not only flipped around but also streamlined because the search engine is like the user.
In other words, machine learning algorithms are helping search engines become more sophisticated, learning and understanding language like a human does. Therefore, it no longer makes sense to write for the search engines. It’s unnecessary.
It’s like trying to translate something that will end up getting translated back anyway. So this process is not only redundant, but it can also be detrimental as things can get lost in translation.
Ultimately, it’s better to write for the user. Focus on delighting them. Give them the best possible content and the best possible experience when consuming that content.
If you write for your audience, you’re writing for Google, too. Do this, and, in turn, you’ll send all the right search signals. You’ll include keywords, earn links, gain mentions, build authority, generate word of mouth, rank well, and drive traffic. Naturally.
That’s new-school SEO.