SEO · Content Audit · Keyword Analysis

HTML to Text for SEO

Updated: May 2026

Converting HTML to plain text is a core technique in SEO content auditing. It lets you measure keyword density on actual visible content, calculate the text-to-code ratio, detect thin content, and prepare pages for readability scoring — all based on what search engine text parsers extract, not the raw HTML source.

Extract text for SEO analysis →

Free · No upload · Browser-based

Why SEO Analysis Requires Plain Text, Not Raw HTML

Search engines do not rank HTML. They extract text from HTML, tokenise it, and rank the resulting word sequences. When you analyse a page's keyword distribution using raw HTML, you measure the wrong thing — you include class names, attribute values, JSON-LD data, and JavaScript code as if they were readable content, which inflates word counts and distorts keyword frequencies.

By stripping the page to plain text first, you analyse the same signal that Google's parser builds from. Your keyword density calculations, readability scores, and word counts become accurate representations of what the crawler actually reads.

Text-to-HTML Ratio: What It Is and Why It Matters

The text-to-HTML ratio is the proportion of a page's raw source that consists of visible text, as opposed to markup. A page with 1,000 characters of readable text and 9,000 characters of HTML markup has a 10% text-to-HTML ratio.

Search engines have historically used this ratio as a quality signal. Very low ratios (under 10%) suggest that a page is mostly code and navigation with little actual content — sometimes called a "thin content" page. While no official threshold exists, pages with high text-to-HTML ratios tend to correlate with richer, more substantive content that performs better in organic search.

Common culprits of low text-to-HTML ratios:

Bloated JavaScript frameworks (React, Angular, Vue) embedding large inline bundles
Base64-encoded images embedded in the HTML source
Excessive inline CSS or style attributes from WYSIWYG editors
Multiple tracking scripts and analytics libraries in the <head>
Sparse content pages (category pages, tag archives) with little body text

The Flowfiles HTML tag stripper shows you both input and output character counts, making it easy to calculate the ratio: output characters ÷ input characters.

Keyword Density Analysis on Plain Text

Keyword density is the percentage of a page's words that are a specific keyword or phrase. The formula is:

Keyword density = (keyword occurrences ÷ total word count) × 100

To calculate this accurately:

Strip the HTML to plain text using the Flowfiles tool.
Paste the plain text into a word counter (the Flowfiles Word Counter shows keyword density).
Search for your target keyword to count its occurrences.
Divide by total word count and multiply by 100.

Measuring density on raw HTML produces misleading numbers because class names, IDs, and attribute values that happen to contain your keyword inflate the count. On plain text, every occurrence is a genuine in-content use.

Readability Scoring for SEO

Readability scores (Flesch-Kincaid Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog, SMOG) measure how easy a text is to read based on sentence length and syllable count. All of these algorithms require plain text input. They count words and sentences — constructs that have no meaning in HTML markup.

Google does not officially use readability scores as a ranking factor, but there is a clear correlation between content that ranks well and content that is easy to read. High-ranking pages on competitive queries tend to score well on readability metrics, which reflects that they are written for humans rather than stuffed with keywords for bots.

Strip your page HTML, paste the result into a readability scorer, and check whether the score matches your audience's reading level. A technical product documentation page can afford a higher grade level; a consumer-facing blog post should target Flesch-Kincaid grade 8 or lower.

Thin Content Detection

Thin content — pages with insufficient or low-quality text — is penalised by Google's Panda algorithm and subsequent core updates. Detecting thin content at scale requires extracting the visible text of each page and measuring its word count. Pages with fewer than 300 words of body content (excluding navigation, footer, and boilerplate) are candidates for consolidation, expansion, or removal.

The Flowfiles tool shows word count in real time. Strip the page HTML, subtract the approximate word count of boilerplate (header nav, footer, sidebars), and compare the remainder to your minimum content threshold.

Preparing Content for Structured Data Extraction

Some SEO tools extract named entities (people, places, organisations, products) from page content for knowledge graph enrichment. These entity extractors require clean text input. Feeding raw HTML causes the extractor to identify HTML tag names and attribute values as entities, polluting the output.

Strip your HTML first, then run the plain text through your entity extractor (Google Natural Language API, spaCy, OpenAI) for accurate results.

Frequently Asked Questions

Should I include navigation text when measuring keyword density?

No. Navigation menus, footers, and sidebars are usually templated content shared across the site. They dilute your keyword density if included. Strip the full page, then manually remove boilerplate sections from the output before calculating density on the body content.

Does Google read JSON-LD schema as content?

No. JSON-LD embedded in <script type="application/ld+json"> blocks is parsed as structured data, not as page content for ranking. The Flowfiles tool excludes script blocks from the output, so your word count reflects only readable content.

What is a good text-to-HTML ratio for SEO?

There is no universally accepted target, but ratios above 25–30% are generally considered healthy for content pages. For e-commerce category pages, 15–20% is common. Very low ratios (under 10%) combined with sparse body text are worth investigating for thin content issues.