User:Gnomingstuff/AI experiment

From Wikipedia, the free encyclopedia

Loosely following "Why Does ChatGPT Delve So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models" by Tom S. Juzek and Zina B. Ward: comparing human-written wiki articles to probably AI-written wiki articles to see which words have spiked in usage.

Methodology

Sourcing text

"AI" text is selected via the following methods:

  • Drafts: Articles in draftspace, by one contributor in consecutive edits, containing the words "large language model" (which catches the AfC decline tag).
  • Articles by one contributor, confirmed to be AI generated via disclosures
  • Articles tagged AI-generated, where the initial version of the article was AI generated,
  • Text from userpages/sandboxes that seems unambiguously intended to be article drafts

All articles are manually reviewed by me to make sure the tag isn't bullshit. The earliest "complete" version was used in all cases.

"Human" articles are selected from:

  • Random non-stub articles
  • Articles tagged with peacock, promotional, or essay tags, as AI writing often has this tone by default
  • Articles that contain the aforementioned AI "focal words," to counteract over-weighting on them (and because we don't need extra evidence AI uses them)
  • The ~3-5 articles I primarily wrote, because why not

Only article versions prior to mid-2022 are used, to be near-certain the text isn't AI.

Finally, the text includes excerpts from articles where:

  • The article contains a diff adding several uninterrupted paragraphs of new AI-generated text (minor copyedits don't count unless they substantially transform the text, and the original text predates 2022).
  • A pre-2022 version of the article contains a comparable passage of text in length, prose density, and subject matter.

Processing text

All text is sorted into folders by category, to make including/excluding types of text easier in the future. The creation date is appended to the filename, in anticipation of running this on only AI articles from certain years/spans of LLM release dates.

The wikitext of the AI and human articles are lightly cleaned to remove AfC boilerplate and some non-indicative syntax. No other manipulation of words, such as to remove punctuation, normalize capitalization, or tokenize beyond removing whitespace, was done. This causes some data issues (see Limitations) but is deliberate, as there are known wiki-syntactical differences between AI and human articles.

Then, the text of AI articles and human articles are compared with code similar to the code in the original study, extended to also analyze two-, three-, and four-word phrases. Non-indicative or coincidental syntax -- e.g., "2025," "'|access-date=November'" -- is excluded from results.

Code

Adapted from brute_force_div_py on the study authors' GitHub. Full list of excluded tokens here.

code not guaranteed efficient, pythonic, or good

More information Extended content ...
Close

Results

See here (on a separate page as the list is long)

Note: This is a somewhat small dataset (~2,000,000 tokens each human and AI) so far and so there are undoubtedly many flukes. Nevertheless, you will see some old friends.

Only the top of each list is shown. The bottom half (words more common in human text) is not currently useful data as there are too many false negatives and syntax coincidences. Sorting by AI frequency may be most indicative.

More granular comparisons

These results compare smaller slices of the dataset against one another:

AI snippets vs. human snippets

A smaller dataset composed of excerpts rather than entire articles. This includes both AI-copyedited paragraphs versus their older counterparts, and entirely new AI sections versus human sections roughly comparable in length and substance.

Results here.

Newer AI text vs. older AI text

Text from newer chatbots (August 8, 2025 and later) versus text from older chatbots (January 1, 2023 to August 7, 2025). The cutoff is chosen to be 1 day after the release of GPT-5.

Results here.

Newer AI text vs. human text

The newer chatbot dataset compared to the full human dataset.

Results here

Older AI text vs. human text

The older chatbot dataset compared to the full human dataset.

Results here

Grokipedia vs. human Wikipedia

Comedy is now legal on Wikipedia.

Results here

Limitations

Out the wazoo:

  • I am not a statistician, a computational linguist, or any kind of academic at all.
  • Wikipedia articles cover a much broader range of subject matter and verbiage than research abstracts, making overlap less likely and meaning any one article may overwhelmingly influence the results.
  • The training set is small compared to the original study.
  • The training set is not unbiased:
  • The use of draft articles in particular introduces many issues:
  • Since drafts are usually deleted after 6 months, most of the samples are from 2025, which over-represents newer chatbots.
  • AI-tagged drafts are almost all rejected AfC submissions, and thus may not represent AI-generated text that "passes" better.
  • Drafts likely over-represent common subject matter that gets rejected for non-notability, which means associated nouns will be over-represented. For example, the occurrence of the word "AI" is artificially high in part because many deleted AfC drafts are about AI startups, and because AI is talked about more in 2025 than in 2022.
  • Since "clean" AfC submissions by humans (i.e., not abandoned userspace drafts) are scarce due to the 2022 cutoff, text may be listed as characteristic of AI that is actually just characteristic of following AfC guidelines, or of AI generating text specifically with the AfC process in mind.
  • Most of the AI tagged samples are of new articles, and thus do not represent article subjects that already had articles.
  • All of the AI-tagged samples are text that wasn't deleted, and thus may not represent AI-generated text that is older or more obvious, making it more likely someone noticed it.
  • Many of the tagged articles were tagged by me, which means I may have over-focused on certain tells or subjects.
  • All the human articles were either manually chosen or curated.
  • Most human articles have accumulated much more wiki-syntax, templating, and such than newer articles, which means the human token set may be functionally much smaller (as more of the tokens are irrelevant). It also means some words may be under-represented in the human text, since [[Dog]] and Dog are two different tokens.
  • Wiki-syntax has changed over time and is different depending on how articles were created. For example the Wikipedia:Article Wizard imposes its own formatting and has probably changed. Some of this is unique enough to be cleansed but not all.

Policy note

I believe this is acceptable to do per WP:NOTALAB: Research that analyzes articles, talk pages, or other content on Wikipedia is not typically controversial, since all of Wikipedia is open and freely usable.

For fun: The words/phrases with the largest decreases

Don't take this seriously, it is statistically not useful and may be explained by wiki boilerplate, dataset over/under-representation, other coincidences.

  • One word: Also,; apparently; bought; decided; got; hence; households; however; huge; kinds; knows; median; normally; poopoo; probably; probability; quite; reproduction; said,; Sometimes,; stopped; Then,; thrown; till; whilst; worst
  • Two words: A few; and when; any of; are given; be of; but there; came into; do with; few of; following year; granted to; have no; in fact; is able; is called; many different; more of; or not; per cent; results of; right of; saying that; so as; started to; the great; the place; There were; to deal; went to; were able, will have
  • Three words: all of the; all over the; be able to; consists of a; had to be; in charge of; in order to; in the year; is a very; is the only; It has a; large number of; made up of; more and more; rest of the; result of the; set up a; similar to the; 'that there is; the help of; to go to; would be the;
  • Four words: a part of the; as a result of; at the beginning of; be one of the; can also be used; for the first time; in addition to; in an effort to; in the presence of; is also known as; is located on the; on the basis of; the rest of the; There is also a; under the age of

Related Articles

Wikiwand AI