User:Gnomingstuff/AI experiment

From Wikipedia, the free encyclopedia

Loosely following "Why Does ChatGPT Delve So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models" by Tom S. Juzek and Zina B. Ward: comparing human-written wiki articles to probably AI-written wiki articles to see which words have spiked in usage.

Methodology

Sourcing text

"AI" text is selected via the following methods:

Drafts: Articles in draftspace, by one contributor in consecutive edits, containing the words "large language model" (which catches the AfC decline tag).
Articles by one contributor, confirmed to be AI generated via disclosures
Articles tagged AI-generated, where the initial version of the article was AI generated,
Text from userpages/sandboxes that seems unambiguously intended to be article drafts

All articles are manually reviewed by me to make sure the tag isn't bullshit. The earliest "complete" version was used in all cases.

"Human" articles are selected from:

Random non-stub articles
Articles tagged with peacock, promotional, or essay tags, as AI writing often has this tone by default
Articles that contain the aforementioned AI "focal words," to counteract over-weighting on them (and because we don't need extra evidence AI uses them)
The ~3-5 articles I primarily wrote, because why not

Only article versions prior to mid-2022 are used, to be near-certain the text isn't AI.

Finally, the text includes excerpts from articles where:

The article contains a diff adding several uninterrupted paragraphs of new AI-generated text (minor copyedits don't count unless they substantially transform the text, and the original text predates 2022).
A pre-2022 version of the article contains a comparable passage of text in length, prose density, and subject matter.

Processing text

All text is sorted into folders by category, to make including/excluding types of text easier in the future. The creation date is appended to the filename, in anticipation of running this on only AI articles from certain years/spans of LLM release dates.

The wikitext of the AI and human articles are lightly cleaned to remove AfC boilerplate and some non-indicative syntax. No other manipulation of words, such as to remove punctuation, normalize capitalization, or tokenize beyond removing whitespace, was done. This causes some data issues (see Limitations) but is deliberate, as there are known wiki-syntactical differences between AI and human articles.

Then, the text of AI articles and human articles are compared with code similar to the code in the original study, extended to also analyze two-, three-, and four-word phrases. Non-indicative or coincidental syntax -- e.g., "2025," "'|access-date=November'" -- is excluded from results.

Code

Adapted from brute_force_div_py on the study authors' GitHub. Full list of excluded tokens here.

code not guaranteed efficient, pythonic, or good

More information Extended content ...

Extended content

# [GS 10-15-25 - not my comment] in parts AI written

from scipy.stats import chi2_contingency
from itertools import tee, count
from banned import banned_words, banned_dates, banned_syntax, banned_phrases, banned_date_phrases, banned_flukes, banned_dubious
import glob, os
import cProfile

# Occurrences per X
NORMALIZATION_THRESHOLD = 1000000

GPT_4O_PATH = 'GPT-4o 5-13-24'
O1_PATH = 'OpenAI o1 12-5-24'
GPT_5_PATH = 'GPT-5 8-7-25'
GPT_5_1_PATH = 'GPT-5-1 11-12-25'

# https://napsterinblue.github.io/notes/python/internals/itertoolbs_sliding_window/
def sliding_window(iterable, n=2):
    iterables = tee(iterable, n)

    for iterable, num_skipped in zip(iterables, count()):
        for _ in range(num_skipped):
            next(iterable, None)

    return zip(*iterables)

def calculate_frequency(freq_text, freq_text_two_words, freq_text_three_words, freq_text_four_words, paths):
    total_words = 0
    total_twowords = 0
    total_threewords = 0
    total_fourwords = 0
    for path in paths:
        for filename in glob.glob(os.path.join(path, '*.txt')):
            with open(os.path.join(os.getcwd(), filename), 'r', encoding='utf-8') as f:
                lines = f.readlines()
                for line in lines:
                    line = line.strip()
                    # optimization, more so for 2/3/4-word phrases
                    if line == '':
                        continue
                    tokens = line.split()
                    total_words = count_words(freq_text, tokens, total_words, filename)
                    total_twowords = count_phrases(freq_text_two_words, tokens, total_words, 2, filename)
                    total_threewords = count_phrases(freq_text_three_words, tokens, total_words, 3)
                    total_fourwords = count_phrases(freq_text_four_words, tokens, total_words, 4)
    return total_words, total_twowords, total_threewords, total_fourwords

def count_words(dict, tokens, total_words, filename=None):
    for token in tokens:
        if token in banned_words or token in banned_dates or token in banned_syntax or token in banned_flukes or token in banned_dubious:
            continue
        if token in dict:
            dict[token] += 1
        else:
            dict[token] = 1
        total_words += 1
    return total_words

def count_phrases(dict, tokens, total_words, phrase_length, filename=None):
    for token in sliding_window(tokens, phrase_length):
        phrase = ' '.join(list(token))
        if phrase in banned_syntax or phrase in banned_phrases or phrase in banned_date_phrases or phrase in banned_dubious:
            continue
        if phrase in dict:
            dict[phrase] += 1
        else:
            dict[phrase] = 1
        total_words += 1
    return total_words

def normalize_word_frequencies(freq_dict_ai, freq_dict_human, total_words_ai, total_words_human, norm_threshold, freq_threshold):
    normalized_pruned_ai_list = {k: normalize_token(v, total_words_ai, norm_threshold) for k, v in freq_dict_ai.items() if normalize_token(v, total_words_ai, norm_threshold) >= freq_threshold}
    normalized_pruned_human_list = {k: normalize_token(v, total_words_human, norm_threshold) for k, v in freq_dict_human.items() if normalize_token(v, total_words_human, norm_threshold) >= freq_threshold}
    return normalized_pruned_ai_list, normalized_pruned_human_list

def normalize_word_frequencies_old(freq_dict_ai, freq_dict_human, total_words_ai, total_words_human, threshold):
    for token in freq_dict_ai:
        freq_dict_ai[token] = freq_dict_ai[token] / total_words_ai * threshold
    for token in freq_dict_human:
        freq_dict_human[token] = freq_dict_human[token] / total_words_human * threshold

def normalize_token(value, total, threshold):
    return value / total * threshold

def chi_sq_test_for_significance(word, freq_human_text, freq_ai_text, total_words_human_text, total_words_ai_text):
    data = [[freq_human_text[word], total_words_human_text], [freq_ai_text[word], total_words_ai_text]]
    chi2, p_value, dof, expected = chi2_contingency(data)
    return p_value < 0.05

def compare_corpora():
    base_paths = ['./corpus/ai/draft', './corpus/ai/confirmed', './corpus/ai/snippets',
             './corpus/ai/tagged', './corpus/ai/sandbox', './corpus/ai/userpage']
    #base_paths = ['./corpus/ai/draft']
    paths_grok = ['./corpus/ai/grokipedia']

    # WP:IAPEP8
    paths_post_5 = [path + '/' + subpath for (path, subpath) in [(path, subpath) for path in base_paths for subpath in [GPT_5_PATH, GPT_5_1_PATH]]]
    paths_post_4 = [path + '/' + subpath for (path, subpath) in [(path, subpath) for path in base_paths for subpath in [GPT_4O_PATH, O1_PATH]]]
    paths = base_paths + paths_post_4 + paths_post_5

    freq_ai_text = dict()
    freq_ai_text_two_words = dict()
    freq_ai_text_three_words = dict()
    freq_ai_text_four_words = dict()

    total_words_ai_text, total_twowords_ai_text, total_threewords_ai_text, total_fourwords_ai_text = calculate_frequency(freq_ai_text, freq_ai_text_two_words, freq_ai_text_three_words, freq_ai_text_four_words, paths)

    paths = ['./corpus/human/random',
             './corpus/human/own',
             './corpus/human/essay',
             './corpus/human/peacock',
             './corpus/human/promo',
             './corpus/human/snippets',
             './corpus/human/draft',
             './corpus/human/keywords/underscore',
             './corpus/human/keywords/delve',
             './corpus/human/keywords/highlight',
             './corpus/human/keywords/emphasizing',
             "./corpus/human/keywords/fostering",
             "./corpus/human/keywords/align",
             './corpus/human/keywords/pivotal',
             './corpus/human/keywords/interplay',
             './corpus/human/keywords/enduring',
             './corpus/human/keywords/testament']
    #paths = ['./corpus/human/draft', './corpus/human/promo']
    paths_grok = ['./corpus/human/antigrok']

    freq_human_text = dict()
    freq_human_text_two_words = dict()
    freq_human_text_three_words = dict()
    freq_human_text_four_words = dict()

    total_words_human_text, total_twowords_human_text, total_threewords_human_text, total_fourwords_human_text = calculate_frequency(freq_human_text, freq_human_text_two_words, freq_human_text_three_words, freq_human_text_four_words, paths)

    print(total_words_ai_text, total_words_human_text)

    freq_ai_text, freq_human_text = normalize_word_frequencies(freq_ai_text, freq_human_text, total_words_ai_text, total_words_human_text, NORMALIZATION_THRESHOLD, 1)
    freq_ai_text_two_words, freq_human_text_two_words = normalize_word_frequencies(freq_ai_text_two_words, freq_human_text_two_words, total_twowords_ai_text, total_twowords_human_text, NORMALIZATION_THRESHOLD, 1)
    freq_ai_text_three_words, freq_human_text_three_words = normalize_word_frequencies(freq_ai_text_three_words, freq_human_text_three_words, total_threewords_ai_text, total_threewords_human_text, NORMALIZATION_THRESHOLD, 1.25)
    freq_ai_text_four_words, freq_human_text_four_words = normalize_word_frequencies(freq_ai_text_four_words, freq_human_text_four_words, total_fourwords_ai_text, total_fourwords_human_text, NORMALIZATION_THRESHOLD, 1.25)

    print("Normalized/pruned")

    freq_dict = dict()
    freq_dict_two_words = dict()
    freq_dict_three_words = dict()
    freq_dict_four_words = dict()

    for word in freq_ai_text:
        if word in freq_human_text:
            increase = ((freq_ai_text[word] - freq_human_text[word]) / freq_human_text[word]) * 100
            if increase > 500 or increase < -85: # manageable list please
                freq_dict[word] = increase
    freq_dict = dict(sorted(freq_dict.items(), key=lambda item: item[1], reverse=True))

    for phrase in freq_ai_text_two_words:
        if phrase in freq_human_text_two_words:
            increase = ((freq_ai_text_two_words[phrase] - freq_human_text_two_words[phrase]) / freq_human_text_two_words[phrase]) * 100
            if increase > 500 or increase < -80: # manageable list please
                freq_dict_two_words[phrase] = increase
    freq_dict_two_words = dict(sorted(freq_dict_two_words.items(), key=lambda item: item[1], reverse=True))


    for phrase in freq_ai_text_three_words:
        if phrase in freq_human_text_three_words:
            increase = ((freq_ai_text_three_words[phrase] - freq_human_text_three_words[phrase]) / freq_human_text_three_words[phrase]) * 100
            if increase > 500 or increase < -80: # manageable list please
                freq_dict_three_words[phrase] = increase
    freq_dict_three_words = dict(sorted(freq_dict_three_words.items(), key=lambda item: item[1], reverse=True))

    for phrase in freq_ai_text_four_words:
        if phrase in freq_human_text_four_words:
            increase = ((freq_ai_text_four_words[phrase] - freq_human_text_four_words[phrase]) / freq_human_text_four_words[phrase]) * 100
            if increase > 200 or increase < -80: # manageable list please
                freq_dict_four_words[phrase] = increase
    freq_dict_four_words = dict(sorted(freq_dict_four_words.items(), key=lambda item: item[1], reverse=True))

    output_file = open('./change_reversed.txt', 'w', encoding='utf-8') # brute_force_divergence_human_vs_ai-summary-based-abstracts
    output_file.write("{{Collapse top}}\n{|class=\"wikitable sortable\"\n! Word\n! Change\n! Human\n! AI\n! Chi square signif.\n|-\n")
    # raw wikitable markup is unreadable to man
    output_file_readable = open('./change_reversed.tsv', 'w', encoding='utf-8') # brute_force_divergence_human_vs_ai-summary-based-abstracts
    output_file_readable.write("word\tchange (%)\thuman\tai\tsignificant\n")

    for word, increase in list(freq_dict.items()):
        result_is_significant = chi_sq_test_for_significance(word, freq_human_text, freq_ai_text, total_words_human_text, total_words_ai_text)
        if word in freq_ai_text:
            output_file_readable.write(f"{word}\t{increase:.2f}\t{freq_human_text[word]:.2f}\t{freq_ai_text[word]:.2f}\t{result_is_significant}\n")
            if increase > 625:
                # Only use the big jumps on the wikitable
                output_file.write(f"| <nowiki>{word}</nowiki>\n| {increase:.2f}\n| {freq_human_text[word]:.2f}\n| {freq_ai_text[word]:.2f}\n| {result_is_significant}\n|-\n")
    output_file_readable.close()
    output_file.write('|}\n{{Collapse bottom}}')
    output_file.close()

    output_file_twowords = open('./change_reversed_twowords.txt', 'w', encoding='utf-8')
    output_file_twowords.write("{{Collapse top}}\n{|class=\"wikitable sortable\"\n! Word\n! Change\n! Human\n! AI\n! Chi square signif.\n|-\n")
    output_file_twowords_readable = open('./change_reversed_twowords.tsv', 'w', encoding='utf-8')
    output_file_twowords_readable.write("word\tchange (%)\thuman\tai\tsignificant\n")

    for word, increase in list(freq_dict_two_words.items()):
        result_is_significant = chi_sq_test_for_significance(word, freq_human_text_two_words, freq_ai_text_two_words, total_twowords_human_text, total_twowords_ai_text)
        if word in freq_ai_text_two_words:
            output_file_twowords_readable.write(f"{word}\t{increase:.2f}\t{freq_human_text_two_words[word]:.2f}\t{freq_ai_text_two_words[word]:.2f}\t{result_is_significant}\n")
            if increase > 700:
                output_file_twowords.write(f"| <nowiki>{word}</nowiki>\n| {increase:.2f}\n| {freq_human_text_two_words[word]:.2f}\n| {freq_ai_text_two_words[word]:.2f}\n| {result_is_significant}\n|-\n")
    output_file_twowords_readable.close()
    output_file_twowords.write('|}\n{{Collapse bottom}}')
    output_file_twowords.close()

    output_file_threewords = open('./change_reversed_threewords.txt', 'w', encoding='utf-8')
    output_file_threewords.write("{{Collapse top}}\n{|class=\"wikitable sortable\"\n! Word\n! Change\n! Human\n! AI\n! Chi square signif.\n|-\n")
    output_file_threewords_readable = open('./change_reversed_threewords.tsv', 'w', encoding='utf-8')
    output_file_threewords_readable.write("word\tchange (%)\thuman\tai\tsignificant\n")

    for word, increase in list(freq_dict_three_words.items()):
        result_is_significant = chi_sq_test_for_significance(word, freq_human_text_three_words, freq_ai_text_three_words, total_threewords_human_text, total_threewords_ai_text)
        if word in freq_ai_text_three_words:
            output_file_threewords_readable.write(f"{word}\t{increase:.2f}\t{freq_human_text_three_words[word]:.2f}\t{freq_ai_text_three_words[word]:.2f}\t{result_is_significant}\n")
            if increase > 400:
                output_file_threewords.write(f"| <nowiki>{word}</nowiki>\n| {increase:.2f}\n| {freq_human_text_three_words[word]:.2f}\n| {freq_ai_text_three_words[word]:.2f}\n| {result_is_significant}\n|-\n")
    output_file_threewords_readable.close()
    output_file_threewords.write('|}\n{{Collapse bottom}}')
    output_file_threewords.close()

    output_file_fourwords = open('./change_reversed_fourwords.txt', 'w', encoding='utf-8')
    output_file_fourwords.write("{{Collapse top}}\n{|class=\"wikitable sortable\"\n! Word\n! Change\n! Human\n! AI\n! Chi square signif.\n|-\n")
    output_file_fourwords_readable = open('./change_reversed_fourwords.tsv', 'w', encoding='utf-8')
    output_file_fourwords_readable.write("word\tchange (%)\thuman\tai\tsignificant\n")

    for word, increase in list(freq_dict_four_words.items()):
        result_is_significant = chi_sq_test_for_significance(word, freq_human_text_four_words, freq_ai_text_four_words, total_fourwords_human_text, total_fourwords_ai_text)
        if word in freq_ai_text_four_words:
            output_file_fourwords_readable.write(f"{word}\t{increase:.2f}\t{freq_human_text_four_words[word]:.2f}\t{freq_ai_text_four_words[word]:.2f}\t{result_is_significant}\n")
            if increase > 300:
                output_file_fourwords.write(f"| <nowiki>{word}</nowiki>\n| {increase:.2f}\n| {freq_human_text_four_words[word]:.2f}\n| {freq_ai_text_four_words[word]:.2f}\n| {result_is_significant}\n|-\n")
    output_file_fourwords_readable.close()
    output_file_fourwords.write('|}\n{{Collapse bottom}}')
    output_file_fourwords.close()

    freq_ai_text = None
    freq_human_text = None
    freq_ai_text_two_words = None
    freq_human_text_two_words = None
    freq_ai_text_three_words = None
    freq_human_text_three_words = None
    freq_ai_text_four_words = None
    freq_human_text_four_words = None

compare_corpora()
#cProfile.run('compare_corpora()')

Results

See here (on a separate page as the list is long)

Note: This is a somewhat small dataset (~2,000,000 tokens each human and AI) so far and so there are undoubtedly many flukes. Nevertheless, you will see some old friends.

Only the top of each list is shown. The bottom half (words more common in human text) is not currently useful data as there are too many false negatives and syntax coincidences. Sorting by AI frequency may be most indicative.

More granular comparisons

These results compare smaller slices of the dataset against one another:

AI snippets vs. human snippets

A smaller dataset composed of excerpts rather than entire articles. This includes both AI-copyedited paragraphs versus their older counterparts, and entirely new AI sections versus human sections roughly comparable in length and substance.

Results here.

Newer AI text vs. older AI text

Text from newer chatbots (August 8, 2025 and later) versus text from older chatbots (January 1, 2023 to August 7, 2025). The cutoff is chosen to be 1 day after the release of GPT-5.

Results here.

Newer AI text vs. human text

The newer chatbot dataset compared to the full human dataset.

Results here

Older AI text vs. human text

The older chatbot dataset compared to the full human dataset.

Results here

Grokipedia vs. human Wikipedia

Comedy is now legal on Wikipedia.

Results here

Limitations

Out the wazoo:

I am not a statistician, a computational linguist, or any kind of academic at all.
Wikipedia articles cover a much broader range of subject matter and verbiage than research abstracts, making overlap less likely and meaning any one article may overwhelmingly influence the results.
The training set is small compared to the original study.
The training set is not unbiased:

The use of draft articles in particular introduces many issues:

Since drafts are usually deleted after 6 months, most of the samples are from 2025, which over-represents newer chatbots.
AI-tagged drafts are almost all rejected AfC submissions, and thus may not represent AI-generated text that "passes" better.
Drafts likely over-represent common subject matter that gets rejected for non-notability, which means associated nouns will be over-represented. For example, the occurrence of the word "AI" is artificially high in part because many deleted AfC drafts are about AI startups, and because AI is talked about more in 2025 than in 2022.
Since "clean" AfC submissions by humans (i.e., not abandoned userspace drafts) are scarce due to the 2022 cutoff, text may be listed as characteristic of AI that is actually just characteristic of following AfC guidelines, or of AI generating text specifically with the AfC process in mind.

Most of the AI tagged samples are of new articles, and thus do not represent article subjects that already had articles.
All of the AI-tagged samples are text that wasn't deleted, and thus may not represent AI-generated text that is older or more obvious, making it more likely someone noticed it.
Many of the tagged articles were tagged by me, which means I may have over-focused on certain tells or subjects.

All the human articles were either manually chosen or curated.
Most human articles have accumulated much more wiki-syntax, templating, and such than newer articles, which means the human token set may be functionally much smaller (as more of the tokens are irrelevant). It also means some words may be under-represented in the human text, since [[Dog]] and Dog are two different tokens.
Wiki-syntax has changed over time and is different depending on how articles were created. For example the Wikipedia:Article Wizard imposes its own formatting and has probably changed. Some of this is unique enough to be cleansed but not all.

Policy note

I believe this is acceptable to do per WP:NOTALAB: Research that analyzes articles, talk pages, or other content on Wikipedia is not typically controversial, since all of Wikipedia is open and freely usable.

For fun: The words/phrases with the largest decreases

Don't take this seriously, it is statistically not useful and may be explained by wiki boilerplate, dataset over/under-representation, other coincidences.

One word: Also,; apparently; bought; decided; got; hence; households; however; huge; kinds; knows; median; normally; poopoo; probably; probability; quite; reproduction; said,; Sometimes,; stopped; Then,; thrown; till; whilst; worst
Two words: A few; and when; any of; are given; be of; but there; came into; do with; few of; following year; granted to; have no; in fact; is able; is called; many different; more of; or not; per cent; results of; right of; saying that; so as; started to; the great; the place; There were; to deal; went to; were able, will have
Three words: all of the; all over the; be able to; consists of a; had to be; in charge of; in order to; in the year; is a very; is the only; It has a; large number of; made up of; more and more; rest of the; result of the; set up a; similar to the; 'that there is; the help of; to go to; would be the;
Four words: a part of the; as a result of; at the beginning of; be one of the; can also be used; for the first time; in addition to; in an effort to; in the presence of; is also known as; is located on the; on the basis of; the rest of the; There is also a; under the age of

Sourcing text

Processing text

AI snippets vs. human snippets

Newer AI text vs. older AI text

Newer AI text vs. human text

Older AI text vs. human text

Grokipedia vs. human Wikipedia

Related Articles