← Back to the tool

Stop Words in Text Analysis — What They Are and When to Filter Them

Updated: May 2026

Stop words are the most frequent words in any language — and the least informative for content analysis. Understanding when and how to filter them is essential for getting meaningful results from any word frequency tool.

Analyse text with stop word filter →

Free · No upload · Toggle stop words on or off

What are stop words?

Stop words are common words that function as grammatical connectors rather than semantic carriers. They appear so frequently in every text that including them in a frequency analysis floods the results with noise, obscuring the words that actually convey meaning.

In English, the core stop words are articles (a, an, the), prepositions (in, on, at, of, to, by, for, with), conjunctions (and, or, but, so, yet), pronouns (I, you, he, she, it, we, they, this, that), auxiliary verbs (is, are, was, were, be, been, have, had, do, does, did), and other high-frequency function words (not, also, only, just, very, much, many, some, any).

The term was coined in information retrieval research in the 1950s and 1960s. Early search engines and database systems removed them because indexing them was computationally expensive and analytically useless — you cannot distinguish documents from one another by the presence of "the".

Common English stop words — a reference list

The following words appear in most standard English stop word lists used for text analysis and natural language processing:

a · about · above · after · again · all · also · an · and · any · are · aren't · as · at · be · been · being · both · but · by · can · can't · cannot · could · couldn't · did · didn't · do · does · doesn't · doing · don't · down · during · each · few · for · from · further · get · got · had · hadn't · has · hasn't · have · haven't · having · he · he'd · he'll · he's · her · here · hers · herself · him · himself · his · how · i · i'd · i'll · i'm · i've · if · in · into · is · isn't · it · it's · its · itself · just · let's · me · more · most · mustn't · my · myself · no · nor · not · of · off · on · once · only · or · other · ought · our · ours · ourselves · out · own · same · shan't · she · she'd · she'll · she's · should · shouldn't · so · some · such · than · that · that's · the · their · theirs · them · themselves · then · there · there's · these · they · they'd · they'll · they're · they've · this · those · through · to · too · under · until · up · us · very · was · wasn't · we · we'd · we'll · we're · we've · were · weren't · what · what's · when · when's · where · where's · which · while · who · who's · whom · why · why's · will · with · won't · would · wouldn't · you · you'd · you'll · you're · you've · your · yours · yourself · yourselves

Why stop words dominate frequency counts

In any typical English text, stop words account for 40–60% of all word tokens. A 1,000-word article may contain 450 stop word instances. Without filtering, the top 20 positions in a frequency table will be entirely occupied by function words — "the" alone typically appears 50–80 times per 1,000 words.

This dominance is not an accident of English. It reflects the structure of language itself: grammatical connectors must appear in every sentence, while content words are distributed across fewer sentences. This is Zipf's law in action — a small number of types appear with extremely high frequency, while the vast majority of types appear rarely.

Stop word ratios vary by genre. Academic writing tends toward higher stop word density than fiction. Legal documents have lower stop word density than conversational text. This means comparing stop word percentage across genres requires calibration.

When to filter stop words — and when not to

Filtering stop words is the right choice for most content and SEO analysis tasks. But there are specific contexts where keeping them is analytically important:

  • Filter stop words when: you want to identify the topics, concepts and key terms in a document. When auditing for keyword density. When comparing vocabulary across documents on the same subject.
  • Keep stop words when: you are studying writing style, sentence structure, or authorial voice. Stop word ratios and patterns are distinctive to individual writers — forensic linguists use them for authorship attribution. When working with very short texts (tweets, headlines, product names) where function words carry structural meaning.
  • Keep stop words when searching: for specific phrases that include them — "to be or not to be" is a meaningful phrase where every word matters. The Flowfiles tool filters stop words from frequency analysis but does not remove them from your original text.

Domain-specific stop words: customizing your filter

Standard stop word lists are built for general-purpose analysis. In specialized domains, there are additional high-frequency words that function like stop words for your specific context — they appear everywhere but convey no differential information.

Examples by domain:

  • Legal documents: "pursuant", "whereas", "herein", "thereof", "notwithstanding" — present in every contract but not diagnostic of its subject.
  • Academic writing: "research", "study", "results", "data", "analysis" — in a literature review, these words appear so broadly that they don't help distinguish one paper from another.
  • Marketing copy: "free", "best", "great", "easy", "powerful" — subjective adjectives that appear in virtually every piece of marketing and reveal nothing specific.
  • Customer support tickets: "please", "thank", "hello", "team", "issue" — structural filler that appears in every ticket regardless of its actual problem.

The Flowfiles counter lets you add custom stop words in a comma-separated field. This makes domain-specific filtering fast and repeatable without needing to configure a separate tool or write code.

Stop words in search engines: a brief history

Early search engines like AltaVista and early Google aggressively filtered stop words from queries and document indexes to save storage and improve speed. A search for "to be or not to be" would have been processed as "not" in the early days.

Modern search engines have moved away from blanket stop word removal. Google now indexes function words because they carry meaning in combination with other terms — "how to" versus "what is", "best for" versus "best in". Query understanding is sophisticated enough to parse grammatical structure, not just count content words.

For content optimization purposes, this evolution means you should write naturally rather than artificially avoiding stop words. What search engines want — and what readers want — is coherent, readable prose. Stop word filtering in analysis tools is about data clarity, not about telling you to write without them.