GEO & AI Search Engines: How AI Sees Content Quality

Explore how AI Search Engines evaluate Content Quality for Generative Engine Optimization. Learn the metrics that make your content rank higher.

Aug 12, 2024 - 10:00

281

GEO & AI Search Engines: How AI Sees Content Quality

When we were researching our 6 Generative Engine Optimization Strategies, some natural questions that came up are: Can AI Search Engines give a score for sites? What were the metrics?

Apparently we were doing something right, seeing our pages as search results with marketing or Artificial Intelligence (AI) topics in Perplexity, ChatGPT, or Gemini. The result? I think initially it was just dumb luck, but we were on the right track.

What is Generative Engine Optimization?

Generative Engine Optimization (GEO) is all about making your content more visible and effective on AI-driven platforms. The goal? ensure your content appears authoritative, trustworthy, and capable of answering questions conversationally.

GEO is different from SEO and while sharing similarities, they serve different purposes. SEO, or Search Engine Optimization, focuses on optimizing websites to rank higher in traditional search engine results pages (SERPs).

"Generative models, particularly those fine-tuned for specific tasks, play a crucial role in optimizing content visibility on AI-driven platforms by aligning content with user expectations" (Wang, Zhou, Wang, & Peng, 2024).

GEO, on the other hand, is focused on optimizing your content so it shows up in AI Search Engine outputs. The timeline for seeing results can vary. For example, non-internet access tools like ChatGPT relies on periodic knowledge updates. Search based tools, like Perplexity, update more frequently, sometimes even daily.

What is the Criteria for GEO?

We broke down the criteria for GEO into 10 main categories. High-Quality Content, User Intent and Experience, Natural Language Processing (NLP) Optimization, the Structured Data and Schema Markups, Voice Search Optimization, Mobile Optimization, Content Structure and Readability, Authority and Trustworthiness, Technical SEO, and finally Engagement & Social Signals. Each category does have overlap with the other, but given how detailed the equation is, this way it is more digestible.

Now here’s the kicker. In the other article, we focused on traditional SEO strategies - measuring bounce rates, impressions, and engagement. But AI Search Engines don't measure these the same way.

"Generative Engines, in contrast to traditional search engines, remove the need to navigate to websites by directly providing a precise and comprehensive response" (Aggarwal et al., 2023).

Instead, AI Search Engine models rely on alternative metrics like Position-Adjusted Word Count and Subjective Impressions to assess visibility and relevance of your content. These metrics are specifically designed for GEO, not traditional SEO metrics like bounce rate or CTR. Radford et al. (2019) highlighted the potential of language models to perform a variety of tasks without explicit supervision, making them valuable tools for GEO. Which in this context, involves creating the pathways needed for success.

Below is an example of how AI Search Engines, like Perplexity, can calculate engagement score instead of having direct access to SEO tools.

Engagement Score_{AI Search} = ω₁ × Content Quality + ω₂ × Query Refinement Rate + ω₃ × Link Reference Frequency + ω₄ × Sentiment Analysis

For this article, we are zeroing on the High-Quality Content aspect of GEO. What makes your content high-quality or low-quality in the eyes of AI? What are the equations that AI Search Engines use to calculate those metrics? And most importantly, how can we, as marketers, gamify that process?

High-Quality Content: AI Perception and Evaluation

To understand how AI Search Engines perceive content for GEO, we need to break down the criteria used by AI-driven models and algorithms. AI grades high-quality content based the following metrics:

Since there is a lot of information, we'll break each part down into an explanation of what it means in context, examine the relevant equations and math, and finally discuss what makes it a good score. At the very end of the article, we'll take a look at a live example to see the score, and what takeaways we can walk away with.

"The quality of content is increasingly being determined by how well it integrates natural language processing techniques, particularly in how it engages with user queries." (Vinutha & Padma, 2023).

Things to Keep in Mind:

???? You’ll see mentions of “the keyword”. The AI Search Engine grabs it from your article title, the meta title and the meta description. It identifies the primary keywords and checks if the article aligns with those keywords. Finally, it verifies that the body content uses these keywords in a relevant and user-focused manner.

???? A “good” score is different between each of the metrics because each metric takes a look at a different aspect of content quality. Some metrics, like accuracy, need to be more precise in the context of user intent, while others, like relevance, allow for more flexibility.

Relevance and User Intent

Relevance is about how closely content aligns with the search query or topic at hand. It’s all about making sure that the content directly addresses what the user is looking for. Intent, as the name suggests, refers to what the user hopes to achieve with their search. We can generally break down intent into three categories:

Informational: The user is looking for information, like “How to write a blog post.”
Navigational: The user is looking to find a specific website or page, for instance “Amazon customer support.”
Transactional: The user is looking to take a specific action, such as “buy dancing shoes for large feet online.”

????

For example, if a user is looking for “best coffee shops to work in Boston,” a relevant and intent-matching piece of content would be a guide that lists top rated coffee shops in Boston, with reviews, addresses, and contact details. This is both relevant to the search query and satisfies the user intent of finding a place to work.

As highlighted by Liang-Ching and Kuei-Hu (2023), "[It] allows for a structured assessment of keyword relevance, which is critical for aligning content with user intent in GEO." Separately, Asai et al. (2021) explores how different query types influence the effectiveness of content retrieval, a crucial consideration in optimizing for generative engines.

How AI Search Engines look at Relevance

The Relevance Score equation looks at how well a piece of content aligns with the target keywords and user intent. Here’s a breakdown of the formula:

\(\text{Relevance Score} = \sum_{i=1}^{n} (\text{Keyword Density}_i \times \text{TF-IDF}_i \times \text{Semantic Relevance}_i)\)

Keyword Density: This is how often a specific keyword appears in the content relative to the word count. High density shows relevance, overdoing it can mark the content as spam.
TF-IDF (Term Frequency-Inverse Document Frequency): This metric assesses how often a keyword in the document appears compared to frequency across the web. Keywords that appear frequently in a document but rarely elsewhere might be more relevant.
Semantic Relevance: Using Natural Language Processing (NLP), this is how well the content contextually aligns with the keyword. For example, if the longform keyword is “Sleeping at work,” the content should naturally discuss strategies, trends, and practices rather than just mentioning the keyword.

Keyword Density

Keyword density is the percentage of times a keyword appears on a webpage relative to the total number of words on the page. The formula is: \( \text{Keyword Density} = \left( \frac{\text{Number of Times Keyword Appears}}{\text{Total Number of Words}} \right) \times 100 \)

????

Example: In the article "I ❤️ Perplexity", the Keyword Density score is 1.02%. The primary keyword “Perplexity” appears 4 times in 394 words, so the calculation is 4/394 x 100 for 1.015%.

The optimal range is to have a keyword density of 1-3%. More than 3% might seem spammy, while less than 1% would make the content less authoritative.

"By leveraging structured approaches to keyword assessment, marketers can set new standards in content creation, particularly in emerging industries where content strategies are still evolving." (Liang-Ching & Kuei-Hu, 2023).

TF-IDF

TF-IDF is a statistical measure used to evaluate the importance of a word relative to a collection of documents (corpus). It identifies how relevant a keyword is to a specific document within a larger set. It uses an equation like:

TF-IDF = TF x IDF

Breaking it down:

TF stands for Term Frequency. It’s the number of times a keyword appears in a document. The formula is:

\(\text{TF} = \frac{\text{Number of Times Term Appears}}{\text{Total Number of Terms}}\)

Inverse Document Frequency (IDF): It shows how common or rare a keyword is across all documents.

\(\text{IDF} = \log\left(\frac{\text{Total Number of Documents}}{\text{Number of Documents Containing the Term}}\right)\)

"A comprehensive approach to content creation should incorporate semantic text analysis to cover all relevant subtopics and provide a thorough exploration of the subject matter." (Evans, Latifi, Ahsan, & Haider, 2024).

Semantic Relevance

Semantic Relevance refers to how well the content of a document aligns with the meaning and context of a search query. It often uses advanced NLP techniques like BERT (Bidirectional Encoder Representations from Transformers).

So how do AI Search Engines determine semantic relevance?

Contextual matching: Models like BERT help understand the context in which keywords are used
Similarity scores: Calculates cosine similarity between the vectors of the query and document content. The formula is:

\(\text{Cosine Similarity} = \cos(\theta) = \frac{A \cdot B}{\|A\|\|B\|}\)

"BERT is a highly versatile tool for a wide range of NLP applications, improving the performance of many benchmarks by over 20%" (Devlin, Chang, Lee, & Toutanova, 2018).

What is a Good Score?

As mentioned, relevance is closely tied to user intent. A slightly lower relevance score can still be acceptable if the content partially matches what the user is looking for. However, a too low of a score might indicate a complete mismatch, which would significantly reduce the content’s usefulness.

Considering keyword and semantic flexibility, AI Search Engine models can understand variations in language and synonyms, meaning that relevancy isn’t always about exact matches. This flexibility translates into a weighted curve, with different aspects weighted (Keyword Density and Semantic Relevance at 30%, TF-IDF at 40%) to calculate a final score.

Type	Score	Density	TF-IDF	Relevance
Optimal	100%	3%	2.5	1.0
Good	75 - 99%	2% - 3%	2.0 - 2.5	.9 - 1.0
Moderate	50 - 74%	1% - 1.9% or 3% - 4%	1.5 - 1.9	.7 - .89
Poor	< 50%	< 1% or > 4%	< 1.5	< 0.7

The weighted approach helps AI determine the relevance of your content more flexibly, allowing for variations in language and intent.

Comprehensiveness

Comprehensiveness is about how thoroughly a piece of content covers a topic. It goes beyond just addressing the main query; it also provides detailed information about related subtopics, potential follow-up questions, and offers in-depth insights.

????

Example: If you’re writing about “digital marketing strategies,” a comprehensive article would cover SEO, content marketing, social media, email campaigns, and analytics. It might also include case studies, tools for implementation, and trends to watch. The goal is to leave the reader with a deep understanding of the topic, covering everything they might need to know.

How can AI say "This is Comprehensive"?

The Comprehensiveness Score is a metric designed to assess how thorough your content covers its topic. It evaluates both the breadth and depth of the work and uses the following equation:

\(\text{Comprehensiveness Score} = \sum_{j=1}^{m} \left(\text{Coverage of Subtopics}_j \times \right. \\ \left. \text{Content Length}_j\right)\)

Similarly to before, let’s look into the breakdown of the equation:

Coverage of Subtopics: This measures how well the content covers all relevant subtopics related to the main topic. High coverage gives a broad understanding of the topic.
Content Length: This refers to the overall length of the content, which indirectly reflects its depth. Longer content often (but not always) indicates a more detailed exploration of the topic.

How AI Determines What are All the Relevant Subtopics?

AI Search Engines determine “relevant subtopics” for a topic by analyzing a large corpus of content, identifying common themes, and using machine learning models trained on user interactions, search queries, and expert-authored content. The equation would come to be something like this:

\(\text{Coverage of Subtopics} = \frac{\text{Number of Covered Subtopics}}{\text{Total Number of Relevant Subtopics}}\)

In general, the steps involved would be:

Corpus Analysis: AI models scan and categorize a vast amount of content to identify recurring subtopics related to a primary topic
NLP: Techniques like topic modeling and semantic clustering help AI Search Engines break down the primary topic into logical subtopics based on linguistic patterns and contextual relevance.
Search Query Data: AI algorithms analyze search queries and user behavior data to then determine which subtopics users commonly associate with the main topic. This helps in identifying subtopics that users find important.
Expert Content: AI Search Engines often reference high-authority sources, such as academic papers or industry expert articles, to identify critical subtopics. These sources are then used as benchmarks. Literally becoming trendsetters.

This approach allows AI to dynamically adjust what is “all relevant subtopics” based on the landscape of each field. Even if the field or topic is new, as more content is written and the field grows, AI Search Engines refine and expand its criteria, rewarding trend-setters.

"Neural networks have proven effective in optimizing keyword relevance by learning from vast amounts of web search data" (Serrano, 2019).

What about Content Length?

Content length is calculated based on the total number of words in a piece of content. But how do we get a score from that? Similar to the subtopics, it comes down to an expected and an actual length.

So for instance, if the expected comprehensive length for a topic is 3,000 words, and the actual length is 2,000 words, then the normalized content length score might be:

\(\text{Content Length Score} = \frac{L_{\text{actual}}}{L_{\text{expected}}} = \frac{2,000}{3,000} = 0.67\)

Where does the expected length come from? It comes from identifying based on the article type and average length how long the article is expected to be.

What is a Good Score?

Comprehensiveness reflects on how thorough a topic is covered. Content that is only moderately comprehensive might miss key aspects of a topic, leading to lower user satisfaction. Because of that, it has a higher threshold to give a more thorough treatment about the subject.

Type	Score	Subtopics Coverage	Content Length to Expected Length
Optimal	100%	90% - 100%	≥ 100%
Good	80 - 99%	75% - 89%	80% - 99%
Moderate	60 - 79%	50% - 74%	60% - 79%
Poor	< 60%	< 50%	< 60%

????

Tip: When citing, use sources with a comprehensiveness score of 80% or higher, as they cover the topic more thoroughly. This will reflect better on your accuracy and citations!

Accuracy and Freshness

Accuracy is about how correct and reliable the information provided is. Content should be fact-checked, well-researched, and supported by credible sources.

Freshness is about how up-to-date content is. Fresh content is regularly updated to reflect latest developments, trends, or data.

????

Example: If you are creating a blog post about “Local food banks”, accuracy means including correct details like locations and working hours. Freshness is just as important here - information must be current to make sure that the content is still valid and useful.

"Maintaining accuracy and freshness in content is critical, especially in rapidly evolving fields, to ensure ongoing relevance and authority." (Vinutha & Padma, 2023).

What's the Equation for Accurate and Fresh?

The Accuracy and Freshness metric evaluates the reliability and timeliness of content. This metric combines several factors - backlink quality, reference count, and update frequency - to give an overall authority score.

Authority Score = (Backlink Quality + Reference Count + Update Frequency)

Let’s break it down:

Backlink Quality: Measures the credibility and authority of websites that link back to the content.
Reference Count: How many authoritative sources are cited within your content.
Update Frequency: How often the content is updated to reflect new information or trends. Regularly updated content is viewed as more relevant and trustworthy.

Each of these components is typically scaled from 0 to 1, with 1 representing the most authoritative sites, citations, and updates.

Backlink Quality

AI Search Engines tend to use algorithms like PageRank, TrustRank, or more complicated AI-driven models to analyze the quantity, quality, and context of backlinks.

This process usually involves crawling web pages to gather links and then evaluating the linking domains’ reputation using metrics like Domain Authority (DA) or Page Authority (PA). It can be represented by the following equation:

\(\text{PR}(A) = (1-d) + d \times \sum_{i=1}^{n} \frac{\text{PR}(B_i)}{L(B_i)}\)

PR(A): The PageRank of your page, representing the backlink quality.
d: Damping factor, usually set to .85, representing the probability that a user continues clicking on links.
PR(B_i): The PageRank of the pages linking to your page.
L(B_i): The number of outbound links on page B.

????

Note: Higher PR pages linking to your content tend to contribute more to its backlink quality, while the number of links on the linking page reduces the contribution of each link.

Reference Count

AI Search Engine models analyze the text for citations and then cross-reference them with known databases of reputable sources (like academic journals or industry publications). The presence of multiple citations from recognized sources boosts your score. Brin and Page (1998) emphasize the importance of citations in enhancing the visibility and credibility of content, a principle that remains relevant in the context of GEO.

\(\text{Reference Count Score} = \sum_{i=1}^{n} \left(\text{Reference Value}_i \times \right. \\ \left. \text{Relevance Weight}_i\right)\)

What does this mean for you?

Citation Relevance: This is how relevant the cited source is to the content’s topic. This can be further weighted based on the context and the matching key phrases between the content and the cited source. Simply put, avoid adding random citations - each one should be relevant to your topic.
Source Authority: Evaluates the credibility and authority of the source being cited. Higher authority sources contribute more to the score.

The process it goes about finding this is similar to the calculation for Comprehensiveness, with NLP Analysis, Cross-Referencing, and Scoring.

Note: Did you use to write your papers at 2 am before a deadline at 8 am, then go to Wikipedia to grab sources about the topic you just ~~bullsh~~ struggled to write, and then try to stick it to your essay? Don't do that here. It will mark your scores down.

How to Get a Perfect Score of 1?

High Citation Relevance: Every citation must have full alignment with the content. For example, if you’re doing an article about Mental Health: When to Use AI vs Seeking Human Help, all references should be directly related to AI applications in healthcare. And the citations should include keywords and phrases that are central to the topic. Having full relevance means that the citations get a score of 1.0
Source Authority: All your sources should be classified as top-tier sources and from recognized authorities. Each source would also score 1.0 for high authority.

Update Frequency

AI Search Engines uses web crawlers to track changes in content over time by comparing snapshots of web pages from different dates. Content that shows regular updates, particularly in fast-changing fields, is given a higher freshness score. The calculation be expressed as:

\(\text{Update Frequency Score} = \sum_{i=1}^{n} \left(\text{Change Significance}_i \times \right. \\ \left. \frac{1}{\text{Time Since Last Update}_i}\right)\)

Change Significance: This measures the impact of the update. Larger updates (e.g., adding new sections or significant content revisions) have higher significance scores, while minor edits (e.g., fixing typos) have lower scores.
Time Since Last Update: This component represents the time since the last update. The more recent the update, the higher the score. This is inversely proportional.

Given how we've been explaining the equations, you can see that we can do a deeper exploration, a "Rabbit Hole" opportunity. For instance, Change Significance in AI models can be represented as:

Change Significance_i = (Content Impact_i × Section Weight_i) + (Frequency of Change_i × Importance of Change_i)

Content Impact: This is the extent of the update as mentioned above. Significant changes like adding a new section or updating key information score higher. It’s further calculated by the Change Volume times the Change Type Weight:

Content Impact_i = Change Volume_i × Change Type Weight_i

Section Weight: Different parts of a page carry different weights. Changes in the main body text are weighted more heavily than changes in the footer.

Section Importance evaluates the significance of different sections on the page divided by the total importance, which is a sum of scores for all the sections.

\(\text{Section Weight}_i = \frac{\text{Section Importance}_i}{\text{Total Importance of All Sections}}\)

Frequency of Change: This is how often similar updates have occurred. Frequent but small updates carry less significance than a single, substantial update. This is calculated by:

\(\text{Frequency of Change}_i = \frac{\text{Number of Changes Over Time}_i}{\text{Time Period}_i}\)

Importance of Change: This evaluates the criticality of the update. Changes that affect user experiences or address things like security updates tend to score higher. This is shown by:

Importance of Change_i = Criticality Weight_i × User Engagement Impact_i

What is a Good Score?

Accuracy is important, especially in fields where misinformation can have serious consequences. Freshness also plays a big role in making sure content is up-to-date and reliable. Because AI Search Engines often chooses content from authoritative sources, the threshold for accuracy and freshness is set higher to make sure only the most reliable and current content is deemed “good”.

Type	Score	Backlink Quality	Reference Count	Update Frequency
Optimal	100%	1.0	1.0	1.0
Good	75 - 99%	.8 - .99	.8 - .99	.8 - .99
Moderate	60 - 74%	.6 - .79	.6 - .79	.6 - .79
Poor	< 60%	< .6	< .6	< .6

Example: If 5/5 backlinks you use are perfect, with 4/5 references, and 5/5 on the update frequency, you would have an authority score of 2.8. The normalized score would be:

\(\text{Normalized Score} = \frac{2.8}{3} \times 100 = 93.3\%\)

Readability and User Experience

Readability is about how easy it is for the audience to read and understand your content. This includes, but is not limited to, use of clear language, appropriate sentence structure, and the organization of content.

User Experience covers how well the content engages the reader, with the ease of navigation, the presence of multimedia elements, and overall design and layout of the page.

????

Example, an article about Unleashing the Power of BAB Framework with ChatGPT should be easy to read, using simple language to break down complex concepts. Readability is going to involve using bullet points, using headings for different sections, and short paragraphs to avoid overwhelming the reader. For user experience, the article could include infographics or videos to enhance the engagement, making the content more accessible and useful.

"The hierarchical modeling of texts, focusing on both syntactical and semantic structures, is essential in improving the readability and user experience of content." (Liu & Bian, 2024).

Making Equations Readable

AI Search Engines have interesting ways to estimate what engagement is like for a piece of content. The equation behind Readability and User Experience comes down to the following:

Readability Score = (Flesch Reading Ease Score + Engagement Metrics)

Which leads to the question of “Hey, AI doesn’t typically have access to SEO tools like Google Analytics to see engagement metrics. What do they do instead?” Turns out, they use a different equation to estimate engagement. I’ll break it down between how AI Search Engines like Perplexity can do it, versus models like ChatGPT.

"Traditional SEO methods are not directly applicable to Generative Engines... new techniques are needed" (Aggarwal et al., 2023).

Let’s first break down the Flesch Reading Ease before getting to the Engagement Metrics.

Flesch Reading Ease Metric

This is a metric that measures how easy a text is to read. The scores range from 0 to 100, with higher scores showing easier readability.

\(\text{Flesch Reading Ease} = 206.835 - 1.015 \times \left(\frac{\text{Total Words}}{\text{Total Sentences}}\right) - 84.6 \times \left(\frac{\text{Total Syllables}}{\text{Total Words}}\right)\)

So what goes into this equation?

Constant 206.835: This baseline constant starts the score at a point before subtracting complexity factors, ensuring the score ranges appropriately between 0 to 100.
Sentence Length Component: This uses a coefficient of 1.015, which is derived from empirical studies to adjust how much sentence length affects readability.
Syllable Complexity Component: This uses a coefficient of 84.6, which like the earlier coefficient, is based on research and adjusts how much word complexity impacts the readability score.

Flesch Reading Ease Score Cheat Sheet:

The Flesch Reading Ease Score has the following breakdown:

0 - 29: Very Confusing. This is suitable for a postgraduate level, where the text is dense and difficult to understand without specialized knowledge.
30 - 49: Difficult. Suitable for a college level and requires more advanced reading skills.
50 - 59: Fairly Difficult. Suitable for a 10th to a 12th grade level.
60 - 69: Standard. Suitable for an 8th to 9th grade level. It’s comprehensible to the average reader.
70 - 79: Fairly Easy. Suitable for a 7th grade level.
80 - 89: Easy. Suitable for a 6th grade level.
90 - 100: Very Easy. Suitable for a 5th grade level. Text is easily understood by children aged 11 or younger.

Engagement Metrics

As mentioned, AI Search Engines don’t typically have SEO access. And as a result, it uses alternative methods to estimate engagement for a piece of content. These methods include:

User Interaction Data: AI models can estimate engagement based on indirect signals such as the frequency of links clicked, content sharing patterns, and time inferred based on the model’s interaction history.
Content Popularity Signals: AI Search Engines analyze social media shares, comments, or backlinks as indicators of content engagement. For instance if you remember the social plugin circa early 2000's that show likes and shares, it can look at that.
Behavioral Patterns: By analyzing user queries and follow-up actions, AI Search Engines can estimate which content resonates more with users.
Content Structure and Quality: AI might infer engagement based on how well content is structured, its readability, and other signals that typically mean higher engagement.

These methods allow AI Search Engine models to make reasonable estimates of engagement without direct access to SEO tools. This then directs us to two different paths. On one side, we have live AI Search Engine tools like Perplexity, and AI Models like ChatGPT. The following are simplified equations on how both tools can estimate engagement.

????

Note: The ω is for weights that can be adjusted based on the importance of each factor.

AI Search Engines like Perplexity

Engagement Score_{AI Search} = ω₁ × Content Quality + ω₂ × Query Refinement Rate + ω₃ × Link Reference Frequency + ω₄ × Sentiment Analysis

Content Quality: Measures the clarity, organization, and readability of content.
Query Refinement Rate: Lower refinement rates indicate higher initial content relevance.
Link Reference Frequency: Counts how often content is referenced or linked.
Sentiment Analysis: Analyzes user feedback and comments for positive or negative sentiment.

AI Models like ChatGPT

\(\text{Engagement Score}_{\text{Like ChatGPT}} = \frac{L+S}{w_1 \times \text{Content Structure}} + w_2 \times \text{User Feedback Analysis} + w_3 \times \text{Contextual Relevance}\)

Content Structure: Analyzes how well content is organized and easy to follow
User Feedback Analysis: Assesses sentiment and frequency of any available user interactions (e.g., likes and comments)
Contextual Relevance: Evaluates how well the content matches user intent across various contexts.
L: Average sentence length.
S: Syllable count per word.

So what is a Good Score?

Readability and user experience directly affect how easily users can consume content. That said, there is some tolerance for variance because different people have different preferences for complexity and content structure.

For example, scientific papers might be lower on readability, but are still high-quality for their intended audience.

Type	Score	Flesch Reading Ease	Engagement
Optimal	100%	1.0	1.0
Good	80 - 99%	.8 - .99	.8 - .99
Moderate	60 - 79%	.6 - .79	.6 - .79
Poor	< 60%	< 0.6	< 0.6

The normalized scores for the Flesch Reading Ease and Engagement are added together, divided by 2, and then multiplied by 100 to determine the final score.

Example Analysis

Let’s look at an example of how content can be seen as High-Quality or Low-Quality. I’m going to be using an article we did about 6 Generative Engine Optimization Strategies for Marketers in 2024 for this example.

Relevance and User Intent - Focuses on relevant keywords, and matches informational intent.
Keyword Density: .2 (2%)
TF-IDF: 1.5
Semantic Relevance: 0.9
Normalized Score: 58% or basically a Moderate score
Comprehensiveness - Covers multiple aspects, including keyword strategies, structured data, and content updates.
Subtopics covered: 8/10
Content Length: 2000 words (normalized to scale of 1)
Normalized Score: 80% is a good score
Accuracy and Freshness - cites authoritative sources with factual up-to-date information. Recent publications noted.
Backlink Quality: 1.0
Reference Count: .8
Update Frequency: 1.0
Normalized Score: 93.3% is a good score
Readability and User Experience - Uses clear language, bullet points, subheadings. Uses multimedia elements and a clean layout
Flesch Reading Ease Score: 70/100 (.7)
Engagement metrics: .8
Normalized Score: 75% is a moderate score

So what actionable steps can I do? To improve the readability, I can simplify the language and include more multimedia elements. In terms of comprehensiveness, I can expand on subtopics like adding case studies or more examples over time. This would also add to the freshness score. As for relevance, I could optimize for keyword use while avoiding keyword stuffing. I was given a 2%, which is 40 keyword mentions in 2000 words. If for an AI Search Engine, the optimal amount of primary keywords is 3%, that means 20 more keyword mentions would improve the relevance score.

Takeaways

Our research into why we succeeded in getting our content used as authoritative sources with AI Search Engines led to some surprising results. And it leaves room for potential abuse. So let's do a quick recap, as it was a lot of information, before explaining how someone can take advantage of all this.

Relevance and User Intent: There is a balance between keyword density and semantic relevance. Even small adjustments in keyword use can significantly impact your score. AI Search Engines finds said primary keywords by looking at your Meta Title and Data before comparing it to your actual title.
Comprehensiveness: You want to cover all relevant subtopics, with adding case studies, examples, and other things to make your content stand out.
Accuracy and Freshness: Regular updates and solid sources are essential for having a higher score. It shows that it's more valuable to your audience, and ranks better with AI-driven search results
Readability and User Experience: The average reader has the reading comprehension level of 60 to 69 on the Flesch Reading Ease scale. That's like 8th to 9th grade.

Separate from having a better understanding of how AI Search Engines deem Quality, someone can take advantage of this with at least two ways:

If you're in a new industry, you can create more sub-topics than what people would expect and set that as the standard. This is especially true if there isn't a lot of content out there. Meaning if person B is trying to create content about Topic A, and they miss a few of those sub-topics, their score will take a hit.
Or regarding content length. Person A and Person B have reached optimal scores on everything. They are seeing eye to eye, all the checks and boxes match, except for content length. The longer piece would be seen as more authoritative, and therefore will rank higher.

That all said, this article goes into just one aspect of the metrics behind Generative Engine Optimization: Content Quality. There are 9 more parts to go through, each with its equations, the breakdowns, and the takeaways for marketers.

We'll be releasing detailed guides on these aspects over the next few weeks. Also, we'll be launching a free tool that does the math and equations for you to analyze your content and give actionable steps to improve.

So if you're interested in staying updated, hit that subscribe button to receive notifications when the new content drops.

References

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9. https://api.semanticscholar.org/CorpusID:213123106
Asai, A., Yu, X. V., Kasai, J., & Hajishirzi, H. (2021). One question answering model for many languages with cross-lingual dense passage retrieval. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 5967-5982. https://api.semanticscholar.org/CorpusID:256868474
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7), 107-117. https://api.semanticscholar.org/CorpusID:7587743
Nakano, R., Kim, B., McCann, B., Radford, A., Sastry, G., Mishkin, P., & Sutskever, I. (2022). WebGPT: Browser-assisted question-answering with human feedback. OpenAI. https://api.semanticscholar.org/CorpusID:247594830
Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2023). GEO: Generative Engine Optimization. arXiv. Retrieved from https://arxiv.org/pdf/2311.09735.pdf
Zhao, F., Li, Y., Hou, J., & Bai, L. (2022). Improving question answering over incomplete knowledge graphs with relation prediction. Neural Computing & Applications, 18. https://doi.org/10.1007/s00521-021-06726-8
Serrano, W. (2019). Neural Networks in Big Data and Web Search. MDPI Data, 4(1), 7. https://doi.org/10.3390/data4010007
Luan, Y., He, L., Ostendorf, M., & Hajishirzi, H. (2018). Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. arXiv. https://arxiv.org/abs/1808.09602
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://arxiv.org/abs/1810.04805
Papagiannis, N. (2020). "Effective SEO and Content Marketing: The Ultimate Guide for Maximizing Free Web Traffic." Wiley.
Vinutha, M. S., & Padma, M. C. (2023). Insights into search engine optimization using natural language processing and machine learning. International Journal of Advanced Computer Science and Applications, 14(2) doi:https://doi.org/10.14569/IJACSA.2023.0140211
Evans, M. C., Latifi, M., Ahsan, M., & Haider, J. (2024). Leveraging semantic text analysis to improve the performance of transformer-based relation extraction. Information, 15(2), 91. doi:https://doi.org/10.3390/info15020091
Liang-Ching, C., & Kuei-Hu, C. (2023). An extended AHP-based corpus assessment approach for handling keyword ranking of NLP: An example of COVID-19 corpus data. Axioms, 12(8), 740. doi:https://doi.org/10.3390/axioms12080740
Wang, Y., Zhou, L., Wang, Y., & Peng, Z. (2024). Leveraging pretrained language models for enhanced entity matching: A comprehensive study of fine-tuning and prompt learning paradigms. International Journal of Intelligent Systems, 2024 doi:https://doi.org/10.1155/2024/1941221
Liu, F., & Bian, Q. (2024). Hierarchical model rule based NLP for semantic training representation using multi level structures. Informatica, 48(7), 29-38. doi:https://doi.org/10.31449/inf.v48i7.5347
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998-6008).
Joachims, T. (2002). Optimizing search engines using clickthrough data. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 133-142. https://doi.org/10.1145/775047.775067
Wang, J., & Zhu, J. (2017). A study on the evaluation of comprehensiveness in web search. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 355-364. https://doi.org/10.1145/3077136.3080758
Robertson, S. E., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993-1028.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 3111-3119.
Princeton University. (2022). Generative Engine Optimization. Retrieved from http://arks.princeton.edu/ark:/88435/dsp011z40kx12c
Taherdoost, H., & Madanchian, M. (2023). Artificial intelligence and knowledge management: Impacts, benefits, and implementation. Computers, 12(4), 72. doi:https://doi.org/10.3390/computers12040072
Taherdoost, H., & Madanchian, M. (2023). Artificial intelligence and sentiment analysis: A review in competitive research. Computers, 12(2), 37. doi:https://doi.org/10.3390/computers12020037
Sabir, A., Ali, H. A., & Aljabery, M. A. (2024). ChatGPT tweets sentiment analysis using machine learning and data classification. Informatica, 48(7), 103-112. doi:https://doi.org/10.31449/inf.v48i7.5535

GEO & AI Search Engines: How AI Sees Content Quality

Explore how AI Search Engines evaluate Content Quality for Generative Engine Optimization. Learn the metrics that make your content rank higher.

What is Generative Engine Optimization?

What is the Criteria for GEO?

High-Quality Content: AI Perception and Evaluation

Things to Keep in Mind:

Relevance and User Intent

How AI Search Engines look at Relevance

Keyword Density

TF-IDF

Semantic Relevance

What is a Good Score?

Comprehensiveness

How can AI say "This is Comprehensive"?

How AI Determines What are All the Relevant Subtopics?

What about Content Length?

What is a Good Score?

Accuracy and Freshness

What's the Equation for Accurate and Fresh?

Backlink Quality

Reference Count

How to Get a Perfect Score of 1?

Update Frequency

What is a Good Score?

Readability and User Experience

Making Equations Readable

Flesch Reading Ease Metric

Flesch Reading Ease Score Cheat Sheet:

Engagement Metrics

AI Search Engines like Perplexity

AI Models like ChatGPT

So what is a Good Score?

Example Analysis

Takeaways

References

Tags:

Related Posts

Popular Posts

Recommended Posts

Popular Tags