Wikipedia talk:WikiProject AI Tools

From Wikipedia, the free encyclopedia

This is the talk page for discussing WikiProject AI Tools and anything related to its purposes and tasks.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Looking for participants in a GenAI factuality study

Hi. I’m working with a team from Columbia University, funded by a Wikimedia Foundation rapid grant. We are seeking Wikipedia editors who are willing to participate in study on GenAI reliability, with a commitment of 10 - 20 hours in mid December - mid January 2026, and a symbolic stipend to compensate for your time.

The Research Project. Our goal is to find out if using a Wikipedia-inspired fact-checking process can increase the reliability of chatbots responding to queries related to Wikipedia’s content. The study uses open-source language models and frameworks, and our full results will be openly shared, with the aim of finding better methods for addressing AI hallucinations that are inspired by the well-established and highly successful practices of Wikimedia projects.

Please note that this project is a ‘’’pure and contained experiment’’’ for analyzing how far or close large language models are to editor-level factuality. We don’t plan on implementing any live tools at the moment.

The Task. The task required from participants is to fact-check an AI-generated response to a general knowledge question. This will be done checking whether each claim in a paragraph-long response is supported by the provided sources (each paragraph will be supported by up to 3 citations, the text of each citation is up to a few paragraphs).

Each participant will be asked to fact-check about 50 samples, with flexibility to do a bit more or less according to your availability. We recognize that this will be a demanding task, which is why we’re offering a stipend to those willing to make the time. The amount of the stipend is based on the amount of samples fact-checked.

Privacy & Security. If you choose to participate, we’re open to either crediting your efforts in our paper, or maintaining your full anonymity, whichever you prefer.

We adhere to the Wikimedia Foundation’s privacy policy. Participants may be asked to provide basic demographics for research purposes, which will be completely discarded after research concludes in early 2026.

Participation. All Wikipedia editors are eligible to participate. For methodological purposes, we may prioritize editors with expertise in specific subject matters, a higher Wikimedia project editor experience, or a focus and interest in fact-checking. If interested, please take a few minutes to submit the form! (Qualtrics external link). If you’re not comfortable filling out an external form, you may just send the answers to me directly using the EmailUser.

Happy to share the research proposal or answer any questions! –Abbad (talk) 00:27, 26 November 2025 (UTC).

@عباد ديرانية Well that seems like a major waste of time. The page you linked to says We'll build an experimental AI assistant for readers that exclusively draws answer from Wikipedia pages, and integrates an explicit and novel fact-checking step into its architecture that's inspired by Wikipedia's own fact-checking process by editors. and This assistant is not intended for public use but only as a time-bound experiment, which will be used for rigorous testing and evaluation of this model's reliability compared to Wikipedia's baseline of reliable information and using open source large language models (LLMs) as fact-checkers that can provide a reliable paraphrasing of Wikipedia's content

it won't be able to differentiate between its training data and the Wikipedia pages its supposed to use as sources.
current LLM technology can't reliably paraphrase or summarize content
training models requires copyright infringement on a massive scale, or it will be inferior to alternatives, which already have an established install base and a trillion dollars; kinda difficult to compete with.
doesn't it make more sense to actually check the sources and verify if they support the claim made in the article, instead of having yet another chatbot which can do something any chatbot can do, but worse?
2300 dollar is not enough to achieve something meaningful.
sample size is tiny
moderate agreement is a very very low bar
We'll consider this a success if more than two thirds of respondents support further experiemntation in the future. Makes no sense, of course 100% will support further experimentation, I do too, but not of this dead-end street. Having people support further experimentation does not mean this was a good idea.
It will just be another lossy unreliable vague layer between users and reliable sources, like Wikipedia often is. We need less of that (e.g. by using the |quote= parameter), not more.
This sounds like "I want to use AI, let me invent a usecase" not "I have a problem, let me fix it with whatever the best tool for the job is".
It is unclear what the results will be used for. The output will just be some numbers, which are meaningless by themselves.
It is unclear what an explicit and novel fact-checking step into its architecture that's inspired by Wikipedia's own fact-checking process by editors. means. Using MiniCheck isn't novel and "We'll ask an AI model to check the work of an AI model" leads to diminishing results. If MiniCheck can do verification, why can't the original model incorporate fact checking. The root problem is that the base model generates facts, half-truths and nonsense. Instead of trying to sort fact from fiction the goal should be to create a model that can verify its own output during generation but that is far outside the scope of the WMF.
A binary metric (true/false) is clearly inadequate when checking if the paraphrasing is any good. A good summary doesn't leave out important facts; yet the proposal only measures pure falsehoods instead of omissions of important stuff, distortions, cherry picking, loss of nuance, synthesis et cetera. Pure hallucinations are a minority of the mistakes an LLM makes, but according to the proposal they're the only ones being measured.
We already had this same discussion , for example over at Wikipedia:Village_pump_(technical)/Archive_221#Simple_summaries:_editor_survey_and_2-week_mobile_study. So when the response was universally negative, and we already know why this can't work, why try again?
Why ask for volunteers and WMF money when Wikipedia doesn't benefit from the results? Why ask Wikipedians, who have a lot of stuff to do, to volunteer to do stuff that doesn't help Wikipedia? Its not like the AI companies will improve their products based on the results, and one can't improve Wikipedia based on the outcome, so who benefits?
The proposal says We'll build an experimental AI assistant and if that was true testing it would make sense. But it also says the plan is to just mash some pre-existing stuff together. If so, why ask volunteers to check how good or bad Llama and MiniCheck are? Shouldn't Meta Platforms employees test Llama? Shouldn't Mistral AI SAS employees test Mixtral? These are commercial companies who can surely hire some people to test their stuff, if they wanted. If there is no plan to add anything new that should improve performance, why bother testing? One datapoint is no datapoint. I already know the outcome: current AI tech is not as good as humans, especially not the nerdy type who edits Wikipedia, and attempts to quantify the difference are pointless because they are just a weighted random number generator one could build a narrative around. In order to make it slightly less meaningless you'd have to keep doing it with each new version and track performance over time, but that would only help AI companies, not Wikipedia.

You can't measure success by comparing this chatbot against commercially available chatbots. The correct baseline is Wikipedia itself, which anyone can access already and read what it actually says.

Showing that this chatbot produces fewer errors than commercial LLMs only proves that it is slightly less bad than commercial LLMs, not that it is a good approach to deliver Wikipedia content to users.

If any hallucinations or distortions are added by the chatbot, then it is worse than just reading Wikipedia yourself.

The interesting variable is how many hallucinations/misrepresentations/distortions are added compared to just reading the Wikipedia articles; how the chatbot compares to commercial LLMs is irrelevant to us.

I may be stupid but I don't get it. Polygnotus (talk) 00:49, 26 November 2025 (UTC)

@DSaroyan (WMF) and FElgueretly-WMF: Please explain why this is a good idea. Which technical experts has the Review team consulted? It would be nice to hear from them as well. It is also unclear to me how a Rapid grant can be awarded to a project that is ineligible: Applications to complete proposed research related to the Wikimedia movement are not eligible. Please review the Wikimedia Research Fund for these funding opportunties. --meta:Grants:Project/Rapid#Eligibility_requirements Thanks, Polygnotus (talk) 01:06, 26 November 2025 (UTC)

This was also posted over at Wikipedia:Village_pump_(miscellaneous)#Looking_for_participants_in_a_GenAI_factuality_study. Doubleposting is generally discouraged because it wastes people's time. Polygnotus (talk) 03:29, 26 November 2025 (UTC)

@Polygnotus I appreciate the thoughtful critique. To what I interpret as your main point - yes, any hallucinations are bad. However, LLMs are already prevelant in the industry and academia, as you must know, and from our daily observations, their use almost completely lacks any sense of responsibility towards reliability. Honestly, Wikipedia itself, as a teritary soruce, shouldn't even be the ideal baseline for factuality, but we recognize that research is an incremental endavour, therefore our approach is start with introducing a methodical way to improve over the status quo of LLMs usage. Realistically, we can't even expect LLMs to improve without such experiments. Please note that because Wikipedia is our chatbot's source, it is effectively a baseline for this study well.

In-line responses:

Points 1-3: We examined the differentiation between retrieval and training data in-depth when scoping our research, and we have two consideraions: A.From our literature review, we're our aware of methods that aim exactly to differentiate when an LLM's answer is grounded in the provided context versus training data. If our resources allow, we do aim to implement the methodology from this paper in drawing this differentiation. However, this is a challenging set up, and our team is ~~100%~~ volunteer-based (or more like 90%, we've had a little budget planned for some team members, but with fiscal sponsorship + paying evaluators + computing, we now expect a couple hundred USD surplus only), so even with the humble grant we may not be able to go that far. B. The eventual purpose of this study is to evaluate the factuality of LLMs in practice. Whether they make errors due to their training data, architecture, or Wikipedia-grounded context, it's eventually an error.
Re: Point 4: 100% agreed, and honestly my original idea was to build something exactly like the Source Verification tool using the MiniCheck model, which is open source, very lightweight, and has shown imprssive accuracy in dozens of experiments that I did with it. My fellow researchers recommnded a RAG approach because it has much more impact on the irresponsible use of chatbots in the industry, which is true. Also, because I discovered now that the Source Verification tool exists, I'm not sure if this approach is any different. I do still hope to run a methodical experiment, once we're done with this project, by: A. Extracting the full-text of some citations (e.g. a book), B. Extracting instances where they're cited on Wikipedia pages, 3. Running the full-text + cited phrases through MiniCheck to see how accurate it is. I believe the results coudl be impressive.
Re: Points 5 - 6: Indeed! That's why all the researchers are ~~100%~~ volunteer. We're doing what we can with our budget, but we also understand that the community may not support pouring larger resources into experimental research at this point
Re: Point 7: This is almost exclusively the annotation baseline from other LLM research we ran across. I'll do more homework on this, but please feel free to advise if you're aware of alternatives.
Re: Point 8: This is a goal to determine the success of the grant itself, so it needed to be experiment-tied, and a user-testing goal seemed appropriate. You're right, though, and I'm open to revise. I'm hestitant to set a specific goal of factuality improvement because we won't know, obviously, until we conduct the experiment.
Re: Point 9: While I don't disagree, lossy middle layers are not only a reality, but a necessity. As you mention, Wikipedia itself is a mediator of information, simply because most people lack the depth of knowledge and/or the time to digest information directly from secondary sources. LLMs, as far as we know, are here to stay, and this is debate of that reality rather than how it can be improved.
Re: Points 10-11: This is clearly a huge use case, which is literally why we opted for it (over, as I mentioned above, what could be personally more interesting to me in terms of a tool to fact-check Wikipedia sources). For example, my company, which is not special in this in any way, pays for what's easily hundreds of millions of LLM queries a month, mainly to power chatbots. As of now, the vast majority of these chatbots on the internet barely make any attempt at truth-seeking that's analoguous to what we're proposing. The results from our study have the basic purpose of proving or disproving that the approach we're trying can have an impact on factuality. In case it does, that's an improvement on the status quo that will affect millions of users.
Re: Point 13: Yes, strictly speaking, this is a factuality-centered study. Other aspects would fall under a summarizing task.
Re: Points 14 - 16: This is very intentionally designed as an experiment of how existing tools like MiniCheck work. MiniCheck has already been developed, but how do we know if it's doing its job well? The fact that these LLMs have been developed by labs has little to do with who's using them, which extends to researchers, educators, non-profits, amd even Wikimedians. However, the commercial labs obviously don't care that much about how factual their models are in an academic sense, and have had little work in this avenue (otherwise, we would have seen way fewer hallucinations). We're volunteering our time for this because we feel like it's a critical under-researched area, and you're free to think it's worth or not worth your own time. Because this is such a small study, the impact won't be astronomical, but we believe it can be very singificant for both Wikipedia contributors, because our results will show how effective MiniCheck can be as a fact-checker. This will be evidence of whether or not it's usable for Source Verification tool, rather than the simple fact that it exists. Did anyone else systematically test whether the fact-checking framework of that tool is consistent and useable?

~~TBC - the are lots of good points here, I'll come back for the rest as soon as I have the chance!~~ Answered --Abbad (talk) 21:24, 26 November 2025 (UTC).

@عباد ديرانية ReDeEP looks cool but if I were you I would completely ignore Mixtral and stick to LLaMA. I do not think ReDeEP will be able to fix the problem that the model will mix training data and Wikipedia content.

Please correct me if I am wrong, but if I am reading between the lines I think we mostly agree on the facts (although I would recommend using a different tactic).

While LLM factuality is interesting (and annoyingly under researched by the guys with the big bucks), most Wikipedians are always gonna be more interested in using MiniCheck to determine if a claim in a Wikipedia article is supported by the source (or not).

We Wikipedians are a very simple people of humble peasant farmers like myself who just want results; not an academic study.

So while you do your thing, can you please allow others to use MiniCheck as well? You already know exactly how I want to use it.

Adding "MiniCheck was correct" and "MiniCheck was wrong" buttons is not very complicated.

If we can show the masses practical results, it is much much easier to get them to volunteer/contribute/whatever.

That way we have both academic validation and real-world testing, which benefits both.

I do not agree that our results will show how effective MiniCheck can be as a fact-checker because that is not what is being tested (and you wouldn't need such a complex pipeline to test just that).

Testing whether a complex AI pipeline produces fewer (or filters out more) hallucinations than the base model is interesting, but not relevant to Wikipedia.

I think the study needs to benefit Wikipedia, not just use it as a testbed, before you should be able to get WMF money or Wikipedia volunteers. And I don't really see it doing that at the moment. Polygnotus (talk) 07:15, 27 November 2025 (UTC)

~500 responses total need evaluation.

At least 300 of those need ≥3 evaluators.

Lets say the remaining 200 get one evaluation each.

So at least 1100 evaluation tasks.

I don't agree that a simple true/false evaluation will lead to meaningful results (point 13 above), but let's assume it is fine.

Each participant will be asked to fact-check about 50 samples (according to your comment above) so you need about 22 people.

Your comment talks about a commitment of 10 - 20 hours in mid December So 220 - 440 hours of volunteer time? Assuming an 8 hour work day we are talking 1.25-2.5 workdays per person and between 27.5 and 55 8-hour days of work sequentially... I am not sure why evaluating 50 samples should take 10-20 hours (12-24 minutes per evaluation for a simple yes/no on a short bit of text??).

The budget talks about 10 evaluators doing 100 responses each in 5 hours, so 3 minutes per evaluation. That is 100 evaluations short and doubles the workload per person. So if the budget allows for 5 hours per evaluator, why ask for 10-20 hours? The budget is $1000 dollar for 10 people doing 100 responses each so that is $1 per evaluation.

The form says The rate will be 100 USD / 30 fact-checked samples, with payment prorated according to the completed samples. but there is only 1000 dollars in the budget allocated so that does not compute. You can only buy 300 evaluations for that money, but you need 1100 evaluations. That is $3.33 per evaluation.

Did an LLM come up with these numbers? Is the plan to pay people 33% of what was promised to them, or to run out of money after 300 evaluations? What will happen if someone did 50 samples and wants the $166.67 that was promised to them?

I find it extremely difficult to outsource items on my todolist, both irl and on Wikipedia. Finding 22 Wikipedians who are willing to spend a significant amount of time doing a very boring task that does not benefit Wikipedia is gonna be real hard. I don't think a symbolic stipend is gonna do much to motivate em.

In summary, the study as proposed won't work. But installing MiniCheck somewhere and giving me an API endpoint and credentials is a good idea. Polygnotus (talk) 08:09, 27 November 2025 (UTC)

@Polygnotus The discrpancies in numbers are because we decided to increase the pay for evaluators as much as possible, at the cost of minimizing any share we take (practically none at this point). As you rightfully say, we realized that this is a difficult and boring task, and therefore thought it appropriate to increase the amount to at least 3 USD per sample, thus a total of roughly $100 / 30 samples. Indeed, this will reduce the amount of total samples we can analyze, but that's better than unappreciated labor. We will increase the eval share to 1,600, and will be able to thus fund about 500 examples. I admit that the numbers got a bit jumbled (my fault, if LLMs were used I may have gotten them more in line!).

I find it a bit confusing that you agree with this being a symbolic stipend and barely enough motivation, but also disagreed that making this hard evaluating will take 12 - 24 minutes. Anyhow, the hourly ranged are very rough, and I stretched them to be extra safe.

Re: MiniCheck, I'm more than happy to collaborate if you want. MiniCheck is available through HuggingFace, we already have the subscription (which is a negligible 9 USD / month). It's pretty easy to grant access, if all you need is the API, I'll be in touch. The hard part is actually evaluating the results, and methodically checking if they work as well as we'd like them to.

We're already committed to a chatbot experiment for this round of funding, so we do need to proceed with our current methodology in principle. I'm quite happy, though, to work together on a study dedicated for MiniCheck as a standalone (as I transparently mentioned, it is what I'm personally interested in as well!). If I manage to squeeze in any other funding, also happy to make that a proper study, if it's of interest to you --Abbad (talk) 22:05, 27 November 2025 (UTC).

@عباد ديرانية Yeah it looks like the plan evolved over time, like all good plans do.

The good news is that Wikipedians are usually pretty good with intrinsic motivation.

disagreed that making this hard evaluating will take 12 - 24 minutes. According to the budget it will take only 3 minutes. And the comment near the top of this page says paragraph-long response is supported by the provided sources (each paragraph will be supported by up to 3 citations, the text of each citation is up to a few paragraphs). and you only want a true/false. It seems very very unlikely that it would take me anywhere near 24 minutes on average to read 3x a few paragraphs and decide if they support a paragraph-long LLM response. 3 minutes on average sounds more realistic although it may be too short. I think the number will be somewhere in between. I would probably ask Claude to find the relevant text in those sources, which would speed up the human part of the equation.

it is what I'm personally interested in as well!) Exactly, so you understand why I am far more excited about playing around with MiniCheck. One of my, probably many, flaws is that I am unfit for academia. Although I am very curious if and how well the ReDeEP approach actually works (or you know, SEReDeEP, if we wanna stay up to date). Polygnotus (talk) 23:18, 27 November 2025 (UTC)

This WikiProject is just getting started (8 members already) and this page doesn't get that many pageviews. Our conversation may be confusing to potential volunteers so I unhatted (incorrect terminology but whatever) the doublepost. You know where to find me for the MiniCheck stuff. Good luck! Polygnotus (talk) 01:41, 28 November 2025 (UTC)

This study appears to have the goal of encouraging the use of LLMs, based on 'fact-checking' using Wikipedia as a source. Given that Wikipedia makes it entirely clear that it does not consider itself as a reliable source, the study is clearly ill-thought out, or at best, engaging in wishful thinking. And furthermore, any encouragement of this misleading LLM use can only make things worse for Wikipedia itself, as it faces a deluge of LLM-generated garbage, generated by a technology which routinely hallucinates (as has been demonstrated to be mathematically inherent in such software), engages in synthesis, contrary to Wikipedia policy, and mangles source citations to the extent that even if they originate from something genuine (and meeting wikipedia sourcing policy, which LLM citations routinely don't) the amount of effort required to find the actual source is totally disproportionate to their utility. I would advise anyone contemplating engaging with this study to question whether it is in the interest of Wikipedia's contributors, and perhaps more importantly its readers, to do so. AndyTheGrump (talk) — Preceding undated comment added 03:57, November 26, 2025 (UTC)

Checking offline sources

Hi @Polygnotus, I've tried you AI Source Verification tool and it works really well for online sources. Of course in many content areas the majority of sources would be paper books and so it'd nice if the tool supported offline sources too in some way. Have you planned something to make it possible? The simplest approach would be to allow the user to paste the text (I actually built a toy standalone app using this approach) or upload a source. Any other ideas how it can be tackled? My assumption is that many editors would be able to access offline sources, whether using Wikipedia Library, Google Books or some other digital library. Alaexis_¿question? 11:20, 29 November 2025 (UTC)

@Alaexis Hiya! That is a good idea, and since you forgot to copyright it I will immediately steal it.

It might also partially solve the paywall problem.

I am currently playing around with User:Polygnotus/CitationVerification.

Where can we find this standalone app of yours? Polygnotus (talk) 11:40, 29 November 2025 (UTC)

No worries at all, happy to suggest improvements :) My app is here - I've just added BYOK, hopefully it hasn't caused any issues. It's very much a beta version.

I think that an addon works much better for Wikipedia editors. I had in mind a different target audience - readers rather than editors - hence a standalone app.

One thing I couldn't find a good solution for is multiple references supporting a single claim. As far as I can see your tool also looks at each reference individually which will produce false positives if a source supports only a part of the claim. Alaexis_¿question? 12:26, 29 November 2025 (UTC)

@Alaexis Ah thats really cool! I've just added BYOK, hopefully it hasn't caused any issues. It works fine over here. Perhaps you can add it to Wikipedia:WikiProject AI Tools?

It must be possible to deal with 2 refs together supporting 1 claim, but I haven't looked at it yet. Thanks for sharing! Polygnotus (talk) 12:40, 29 November 2025 (UTC)

Repository of prompts?

The thread immediately above led me to inspect User:Polygnotus/Scripts/AI Source Verification.js to see what prompts they were giving to the APIs. Separately, I've been finessing my instructions for a "Wikipedia research assistant" for initial sanity checks, hosted by Kagi. Maybe this project could have a page for sharing or workshopping examples like this. ClaudineChionh (she/her · talk · email · global) 13:23, 29 November 2025 (UTC)

That's a good idea. The prompt I used for my citation checker app can be found here . Alaexis_¿question? 14:02, 29 November 2025 (UTC)

Personally I find it helpful to provide examples of request-response pairs though I'm not sure if it would work for your use case. Alaexis_¿question? 14:03, 29 November 2025 (UTC)

Oh yes, my file is more like a default set of instructions as a starting point. I can dig into my chat history for more specific examples of prompts and responses. ClaudineChionh (she/her · talk · email · global) 00:10, 30 November 2025 (UTC)

I've been thinking about having an API platform that could cache AI outputs used to review specific revisions (in case multiple editors send the same AI query, e.g. when patrolling recent changes) and simplify development workflow for new tools, and that could be a helpful use for it!

Beyond that, we don't have a guide yet, and "prompting tricks" would definitely be an essential part of it: feel free to start it! Chaotic Enby (talk · contribs) 20:35, 5 December 2025 (UTC)

Generating the infobox from the article text?

Has there been any exploration into using AI tools to generate the infobox from the article's existing text? Whatisbetter (talk) 11:24, 2 December 2025 (UTC)

@Whatisbetter, nothing I'm aware of but it should be pretty straightforward. What kind of articles do you have in mind? Alaexis_¿question? 22:35, 2 December 2025 (UTC)

There are absolutely no circumstances where it would be appropriate to use AI to "generate the infobox from the article's existing text". AI (or at least LLM's, which are presumably what is being referred to) cannot be trusted. They synthesize. They 'cite' things that don't remotely support the text being cited for. They routinely hallucinate. Wikipedia content (including that in infoboxes) needs to be written by contributors who can ensure that it is correct per a valid source, and are prepared to take responsibility for doing so. If you want LLM-generated content, look elsewhere. AndyTheGrump (talk) 22:42, 2 December 2025 (UTC)

@AndyTheGrump is correct: current AI technology is unable to summarize a text, or to find the interesting bits. Crafting infoboxes will remain a human task for the foreseeable future. Polygnotus (talk) 04:05, 3 December 2025 (UTC)

@AndyTheGrump, @Polygnotus That's just simply not true. A simple prompt of "1. research and understand how wikipedia infoboxes work. 2. read the article "1234" and tell me if there's any information already present in the article that could be used in the infobox. do not pull from any other resources." And then the HUMAN editor reviews the results.

It's really that simple. --skarz (talk) 13:45, 6 March 2026 (UTC)

@Skarz do not pull from any other resources. Commercially available AI models cannot do that, currently. They don't differentiate between user input and the dataset they are trained on. Polygnotus (talk) 14:17, 6 March 2026 (UTC)

@Polygnotus That's funny, ChatGPT did exactly what I asked it to (accurately) right in this conversation.

Where are you even getting your information from???

--skarz (talk) 14:20, 6 March 2026 (UTC)

@Skarz God told me in a dream. jk, check out my userspace. I am not exactly opposed to LLMs, but I think it is important to understand their limitations. Polygnotus (talk) 14:23, 6 March 2026 (UTC)

@AndyTheGrump, you're right about the hallucinations and other issues. I do not suggest or condone any violations of the policy. However, I think that it's possible to use LLMs to generate a draft which would have to be checked by a human editor.

Also, a few approaches have been suggested to control hallucinations. Here's one that might work though I haven't tried it myself AI Driven Citation: Controlling Hallucinations With Concrete Sources. It's suggested by Gavin Mendel-Gleason who is working with Peter Turchin. Alaexis_¿question? 06:54, 3 December 2025 (UTC)

@Alaexis Have you tried MiniCheck? Polygnotus (talk) 06:57, 3 December 2025 (UTC)

@Polygnotus, not yet, how do I run it? Alaexis_¿question? 15:14, 3 December 2025 (UTC)

@Alaexis

If you want to run MiniCheck on your own computer, then the answer is this, but this is a binary yes/no.

https://www.bespokelabs.ai/bespoke-minicheck gives out free API keys.

The answer to the question depends very much on how nerdy you are.

How familiar are you with Python? Do you want to run it on your own pc?

It is usually easier to just use their free API. I don't know what Operating System you use (*nix, MacOS, Windows) but usually if you Google "run python script" with the name of your operating system it should provide instructions.

The code on that page is pretty outdated btw. Polygnotus (talk) 15:23, 3 December 2025 (UTC)

Thanks @Polygnotus, I reviewed the docs and it seems pretty straightforward, though I'm not sure I'm going to use it right now - they claim that it's only marginally better than Claude Sonnet 3.5. Alaexis_¿question? 10:18, 5 December 2025 (UTC)

@Alaexis In my extremely limited testing I can't really tell if its better or worse than Claude, but Claude is clearly better for our purposes because it can (I am intentionally using a word incorrectly here) explain its thinking. Polygnotus (talk) 10:31, 5 December 2025 (UTC)

I had found that populating an infobox from Wikidata works OK (when there is data available). Major manual editing and checking will be required, but beats the starting form scratch and actually reading the Wikidata format in infobox instructions. I never tried to use the text as a starting point, but expect it to work too. The modern high-end engines (I mostly use the Google Gemini 3) do a quite decent job in very unexpected areas, search for information in a structured fashion is one of them. I haven't seen an outright hallucination for months. Ask a random question, you might get a random answer. Ask when the term Net load was coined, you will get a wrong answer, but it will point you to quite solid WP:RS, so human can make the same mistake, too - I wouldn't count this type of errors hallucinations. Викидим (talk) 08:25, 3 December 2025 (UTC)

I take it you are aware that since WikiData isn't WP:RS, you can only use it indirectly, where it actually cites a valid source? Anyway, your 'Major manual editing and checking will be required' comment points to what is likely to be a major issue with AI-assisted infobox generation, given how Wikipedia currently operates in practice - far too many people will simply assume that the AI has got it right, and not check it. AndyTheGrump (talk) 11:27, 3 December 2025 (UTC)

To me Wikidata item is like an article in foreign language - a source for translation that should be checked. AI helps to navigate through quite complex set of infobox templates, each with its own parameter quirks. AI does not do a good job populating these fields, but it is an OK way to get to the starting point. Викидим (talk) 17:19, 3 December 2025 (UTC)

Example of using AI to fix the text involving adding infobox: before / after / changes / history. Manual checking and reworks were necessary - but I would have never attempted this repair without an AI assistance (it would be too much hassle). Викидим (talk) 21:31, 4 December 2025 (UTC)

Discussion at Wikipedia:Village pump (idea lab) § Scope of AI tool use

You are invited to join the discussion at Wikipedia:Village pump (idea lab) § Scope of AI tool use, which is within the scope of this WikiProject. Cf. the previous discussion about whether generating an infobox from an article text would be acceptable. Chaotic Enby (talk · contribs) 20:37, 5 December 2025 (UTC)

Invite template

@Chaotic Enby It would be useful to have an invite template that can be posted on user talkpages to invite people. Polygnotus (talk) 08:47, 5 January 2026 (UTC)

Good idea, I'll work on it! Chaotic Enby (talk · contribs) 09:06, 5 January 2026 (UTC)

@Chaotic Enby It may be a good idea to add User:Overandoutnerd/Scripts/articleSummary and invite Overandoutnerd using the template. Polygnotus (talk) 19:13, 4 February 2026 (UTC)

Sent them the invite, great catch! Chaotic Enby (talk · contribs) 20:26, 4 February 2026 (UTC)

Maybe we should just make some insource: links with intitle:.js that search the user namespace for relevant key words like Gemini and Claude et cetera https://en.wikipedia.org/w/index.php?search=insource%3A%22Gemini%22+intitle%3A%22.js%22&title=Special%3ASearch&profile=advanced&fulltext=1&ns2=1 Polygnotus (talk) 20:41, 4 February 2026 (UTC)

Great idea! Chaotic Enby (talk · contribs) 20:55, 4 February 2026 (UTC)

[Research] Preliminary analysis of AI-assisted translation workflows

Note: To keep the conversation organized, I have primarily posted this at the Village Pump. I encourage any questions or discussions to take place directly there.

I’m sharing the results of a recent study conducted by the Open Knowledge Association (OKA), supported by Wikimedia CH, on using Large Language Models (LLMs) for article translation. We analyzed 119 articles across 10 language pairs to see how AI output holds up against Mainspace standards.

Selected findings:

LLMs were found to be significantly better than traditional tools at retaining Wikicode and templates, simplifying the "wikification" process.
26% of human edits fixed issues already present in the source article (e.g., dead links), showing that the process improves the original content too.
Human editors modified about 27% of the AI-generated text to reach publication quality.
We found a ~5.6% critical error rate (distortions or omissions). This confirms that "blind" AI publication is not suitable; human oversight is essential.
Claude and ChatGPT led in prose quality, while Gemini showed a risk of omitting text. Grok was the most responsive to structural formatting commands.

Acknowleging limitations: We consider these findings a "first look" rather than a definitive conclusion. The study has several limitations, including:

Subjectivity: Error categorization is inherently dependent on individual editor judgment.
Non-blind testing: Editors knew which models they were using, which likely influenced their prompting strategies.
Sample size: While we processed over 400,000 words, the data for specific model comparisons across all 10 language pairs is insufficient.

Our goal is to provide some data for the community as we collectively figure out the best way to handle these tools. The full report, including the error taxonomy and raw data logs, is available on Meta. 7804j (talk) 21:00, 20 January 2026 (UTC)

LLMs benchmarking

I've done some benchmarking which may be of interest to the builders of AI tools here. I wanted to know how well various LLMs can handle source verification: checking whether a given source supports the claim it's attached to (User:Alaexis/AI_Source_Verification). I tested a few open-source models hosted by PublicAI and Claude Sonnet 4.5 as a SOTA model.

No surprises - Claude was the best but the difference in performance was not huge. I had 16 "not supported/partially supported" cases in my dataset and different models found between 7 and 12 of them (still too few for statistically significant comparisons of specificity) maintaining decent false positive rates (<15% for Claude, <20% for the next best model). The full results can be found here.

Some of the "not supported" cases are quite interesting (, citation 32 - try to figure out yourselves what's wrong with it). This exercise wasn't meant to detect the rate of inaccurate citations but it does make you think about all those inaccuracies lurking out there. Alaexis_¿question? 20:45, 27 January 2026 (UTC)

Notice: Planning help pages for AI workflows

There is a discussion at the Help Project on which help page about "how to edit Wikipedia with AI assistance" to draft first. — The Transhumanist 13:58, 19 February 2026 (UTC)

Help needed with archive links

We need about 600,000 archive URLs replaced; see Wikipedia:Archive.today guidance. One of the slow bits is checking to see whether another archiving website contains the relevant content (or, e.g., if it just contains a mostly empty page or an error message). Is this something that some AI tools could help with?

Is this something that some AI tools could help with? Short version: No. Long version: Fuck no. But also, somehow, yes and maybe.

Note that Wikipedia:Archive.today_guidance#How_you_can_help is possibly a bad idea in the context that humans are messy, I would do as much as possible botwise and then use humans for the difficult stuff.

There is a dump of archived urls but its like 6mo out of date.

And CitationVerification should of course be used for a lot more than just a project like this, but it could be a start.

See User:Polygnotus/CitationVerification and that one WMF guy on Phabricator who made something similar. T414816

But the homie @DVRTed: is cooking up something magic. Also pinging the homie @Alaexis: who is also doing work in this area. Polygnotus (talk) 06:56, 23 February 2026 (UTC)

A part that AI might be helpful with is assessing the content at the alternative archive. Someone could write a bot to fetch the content from the other archive site, post it to a GPT-5-like model. That would cost some money but, I bet the task could be done with an older, less-expensive model.

I think that assessment is a good role for AI because the consequences for being wrong are pretty low. I’ll read the project page. Dw31415 (talk) 13:57, 23 February 2026 (UTC)

@Dw31415 Problem is, most people don't use |quote=. And 600,000 times a large amount of tokens is a large number, allegedly. I bet the task could be done with an older, less-expensive model. Those generally suck at "which string fully supports this claim, if any?" and if you don't know for sure that you found a string that fully supports the claim made you can't tell for sure if you've succesfully replaced the archive, if no |quote= is present (because there are quite a few cases where an archive was added a long time after a claim was added to Wikipedia, in which time the source could've changed). The other approach is taking whatever they've archived (except the offending code) and putting it elsewhere but that also has various problems (like provenance, scraping services not accepting stuff you scraped for them, and scraping probably being against a ToS somewhere, and perhaps even unethical yada yada). Polygnotus (talk) 14:38, 23 February 2026 (UTC)

@Dw31415, @Polygnotus I think that the best way to check whether the task could be done with an older, less-expensive model is to try. PublicAI host a few open-source models and they allocated enough tokens to run an experiment (e.g., would the AI make the same decisions as the editors who replaced archive.today links, whether using Netha's tool or not). I compared the performance of open-source and SOTA models for citation verification which is a similar problem and the open source models weren't that far behind .

@WhatamIdoing, I'm not entirely sure the AI is necessary here. Netha's tool somehow manages to filter out invalid snapshots (though take this with a grain of salt, my sample size is N=1). It seems like simpler heuristics would suffice. Alaexis_¿question? 20:06, 23 February 2026 (UTC)

I'm not sure that it's necessary, either, but having used Netha's tool (twice), the process looks like:

Open the tool and search for something (e.g., an article)
Open the Wikipedia article in a new tab to figure out what the source is supposed to be supporting
Open the suggested archive.org link to figure out whether the link has relevant content

and I was thinking: what if you could see all of that on one screen? WhatamIdoing (talk) 21:18, 23 February 2026 (UTC)

To be honest that's not how I understood the process. I think it's enough to verify that archive.org snapshot isn't empty/error and is identical to the archive.today snapshot. Are you sure that we need to re-check citations?

Source Verifier can help check citation accuracy but it's a user-sxcript for manual verification of individual citations. If someone is going to build a tool where you could see all of that on one screen, I'd be happy to share my experience, code and contacts. Alaexis_¿question? 22:11, 23 February 2026 (UTC)

I'm not interested in going to the archive.today websites, so I'm not checking against those, but I think that would work for people who are willing to do that. OTOH, when I've set out to fix a few links, I not infrequently find that the original source was weak, or the article significantly out of date. For example, Netha's tool flagged a source in Tracheal intubation. I fixed the archiving problem very easily thanks to her tool, but the source itself is a random website from 1998, which is nowhere near the MEDRS ideal. WhatamIdoing (talk) 22:29, 23 February 2026 (UTC)

If all we need to do is check whether the main contents of two web pages match, we don't need an LLM for that - you can extract body content with Beautiful Soup (HTML parser) and run a file comparison algorithm. I put a few more details about this concept here: Wikipedia talk:Archive.today guidance#Bot for checking for identical text. Dreamyshade (talk) 01:55, 28 February 2026 (UTC)

Doesn't really make sense to mention this here since it has nothing to do with AI (and I don't really plan on adding that), but I was pinged here, so this is what I'm currently working on: User:DVRTed/ArchiveBuster. — DVRTed (Talk) 21:33, 24 February 2026 (UTC)

I'd been planning something similar, so it's good to see this! Would you say your script is ready for others to use? ClaudineChionh (she/her · talk · email · global) 22:19, 24 February 2026 (UTC)

I wouldn't consider it "stable" by my own standard just yet. I'll make a sandbox to cover some template params edge-cases I can think of and finalize the script documentation (oh, the horror!). Bug reports would be very helpful, if there are any. — DVRTed (Talk) 22:31, 24 February 2026 (UTC)

I just replaced 42 archive.today links; this took me a bit more than 2 hours. Even if it was just me who'd replace the same amount of links daily with the same delay, we'd be done in about 3 years, 9 months and 14 days.

Don't try to shove AI into everything. sapphaline (talk) 14:48, 24 February 2026 (UTC)

@Sapphaline How much do you charge per hour? How many hours per day and days per week do you work? Polygnotus (talk) 14:50, 24 February 2026 (UTC)

"Even if it was just me who'd replace the same amount of links daily with the same delay", i.e. if I would replace 42 links every day with the same 2 hour delay. sapphaline (talk) 14:52, 24 February 2026 (UTC)

@Sapphaline 3 years, 9 months, 14 days ≈ 1,384 days = 33,216 hours. At 42 links per ~2.25 hours, that's roughly 42 × (33,216 / 2.25) ≈ 619,000 links

So now we just need to know how much you charge. Our budget is ~250 million USD. Polygnotus (talk) 14:54, 24 February 2026 (UTC)

@Sapphaline, how many of those removals 'stuck'? I've seen one report of IABot replacing them. WhatamIdoing (talk) 18:06, 24 February 2026 (UTC)

What do you mean by "stuck"? sapphaline (talk) 18:27, 24 February 2026 (UTC)

Remain unreverted. WhatamIdoing (talk) 18:32, 24 February 2026 (UTC)

Remaining archive.today links? None. sapphaline (talk) 18:33, 24 February 2026 (UTC)

Nice work @DVRTed! Any sense how long it would take 100 volunteers to use your tool to replace the 600k links?

I think a Bot could still be very useful, potentially as:

A) A compliment to ArchiveBuster (and/or the GitHub one). Maybe something like Wikipedia:FRS, a bot that assigns a list of articles to volunteers. It posts a list of links to the volunteers talk pages.

B) A bot that uses an AI pipeline to evaluate the alternatives and just replace the links.

C) A toolforge hosted Interstitial webpage. A bot that replaces the archive today links with an interstitial page (IP) link. The IP warns the user about archive today. Gives a link on how to volunteer to fix it.

I’m not free to code until Sunday at the earliest. @Poly, your thoughts? Dw31415 (talk) 08:09, 25 February 2026 (UTC)

@Dw31415 Since having a large list of articles and assigning parts of it to volunteers who should then be able to mark em as completed is a recurring need we should probably build something for it.

I was talking with DVRTed about how it would be cool to query the archive.org api for when a particular URL was scraped so that you can then send the user the most likely option. Polygnotus (talk) 08:19, 25 February 2026 (UTC)

Any ideas on marking it done? Is that really needed? As an alternative, “VolunteerAssignmentBot” could just say, “these have been assigned to you for 5 days” and then assign them to someone else if the pages still meet the search criteria. The volunteer could reply “more please” and the bot could assign more. Just brainstorming Dw31415 (talk) 08:51, 25 February 2026 (UTC)

@Dw31415 Well I would like it to be re-usable for other tasks. And with some tasks you can actually check if someone did the work using software (are all links to archive.is|today gone), but with others you cannot. For example with typofixing it is possible that they did check but that it was a false positive. Polygnotus (talk) 08:54, 25 February 2026 (UTC)

I guess the bot could post a table to a user subpage. The user could edit the table. Or the bot could create a topic on their talk page with one reply per page. The user could reply to each message with “done” Dw31415 (talk) 09:03, 25 February 2026 (UTC)

You'll be able to use Category:CS1 maint: deprecated archival service soon. WhatamIdoing (talk) 21:41, 25 February 2026 (UTC)

I was also thinking of presenting a link to the Wayback Machine with the best timestamp (close to archive-date if possible, otherwise access-date if that's available). Once the date is parsed it should be simple to send a request to the Wayback API with the best timestamp. ClaudineChionh (she/her · talk · email · global) 08:49, 25 February 2026 (UTC)

@ClaudineChionh User:Polygnotus/tmp/Archive.js Polygnotus (talk) 07:28, 26 February 2026 (UTC)

Worth mentioning, that is exactly how the IABot works. You can examine retrieveArchive() in https://github.com/internetarchive/internetarchivebot/blob/master/app/src/Core/APII.php and see that:

Pass 1: Find the Wayback snapshot closest BEFORE the citation's access date

Pass 2: If nothing found, find the closest AFTER the access date

This means if a citation says |access-date=2019-03-15, IABot asks the Wayback API for the snapshot nearest to March 15, 2019 theoretically preserving the temporal context of the citation.

--skarz (talk) 13:43, 27 February 2026 (UTC)

Need a proxy server on toolforge to access the archive.org API. Anybody know the correct venue to make a request for this? (cc @Novem Linguae?) — DVRTed (Talk) 11:36, 26 February 2026 (UTC)

@DVRTed Well my script above works fine. It uses my VPS. I can put the proxy on Toolforge if you insist, but it makes no technical difference. Polygnotus (talk) 11:38, 26 February 2026 (UTC)

Yea, that'd be great; primarily because it'd be an open source server. I guess I'll insist. ;) — DVRTed (Talk) 11:42, 26 February 2026 (UTC)

@DVRTed https://en.wikipedia.org/wiki/User:Polygnotus/Data/ArchiveProxy Polygnotus (talk) 11:44, 26 February 2026 (UTC)

Also made a cloudflare worker thing:

Polygnotus (talk) 20:25, 27 February 2026 (UTC)

@WhatamIdoing Is this something worth exploring still? Do you have an examples? Is validating archive dot org links enough to be helpful? Dw31415 (talk) 04:05, 4 March 2026 (UTC)

You may be interested in another tool I am working on - an Electron app where you can enter the name of an article and it allows you to compare cited archived pages on archive.today and its equivalent on the Wayback Machine side-by-side visually. I posted more details and a screenshot at https://en.wikipedia.org/wiki/Wikipedia_talk:Archive.today_guidance#Archival_Comparison_Tool

Let me know what you think! --skarz (talk) 15:25, 4 March 2026 (UTC)

How I use Claude Code for RC patrol

Just wanted to share something I've been experimenting with, and that's how to use Claude Code to patrol recent changes. The short version is I have a python script that grabs recent changes with various filters applied, stores the diffs as .json files, then Claude analyzes them in to priority for review. Claude does not make any changes, it just contextually analyzes the change. I used up most of my session limit writing and modifying the code; I don't think this is too token intensive because most of your time would theoretically be actually going and verifying/correcting the suspicious edits.

I actually had Claude turn the action in to its own plugin so I can just input /analyze-diffs and it will run automatically. Here's an example of the output:

Wikipedia Vandalism Review — 2026-03-05

  47 diffs analyzed. 3 confirmed, 11 suspicious, 33 OK.

  ---
  [FLAG] CONFIRMED / REVERT-WORTHY

  ---
  007 — Novoozerne | Editor: ~2026-14297-92 | No summary
  - Changed the settlement name from "Novoozerne" to "Маратий-сити" (Maratiy-siti) in the infobox name and native_name fields.
  - Clear vandalism. A Ukrainian settlement's name was replaced with what appears to be a nonsense/invented name. Temp user, zero explanation.

  ---
  027 — Peter Pan (2003 film) | Editor: ~2026-29295-2 | Summary: "Zack Gauthier"
  - Appended ", Mr. & Mrs. Darling's grandchildren." to every single Lost Boy actor entry (Slightly, Tootles, Curly, Nibs, the Twins).
  - Factually wrong: the Lost Boys are feral children in Neverland, not the Darlings' grandchildren. The Darling children are Wendy, John, and Michael. Summary is meaningless. Temp user.

  ---
  031 — Fives (sport) | Editor: ~2026-14169-17 | No summary
  - Introduced deliberate grammatical corruption into the hatnote: "the French town" → "the Frenchs town", "Fives, Nord" → "Fives, Nords", "the Irish sportsman" → "the Irishs sportsmans(GAA Handball)".
  - Textbook vandalism. Pluralized proper nouns incorrectly, inserted junk text. Temp user, no summary.

  ---
  [FLAG] SUSPICIOUS — Needs Human Review

  ---
  001 — Christianity and other religions | Editor: Augy31 | Summary: "References."
  - Appended a bare external URL from Crisis Magazine (a traditionalist Catholic publication) directly to paragraph text about Buddhist-Christian relations in Sri Lanka.
  - Not cited in proper <ref> format — just tacked onto the end of a sentence. The source appears ideologically motivated (article title: "Buddha's Fist: Persecuting Christians in Sri Lanka"), potentially a
  POV citation in an article that already notes Christian extremism in South Korea. Registered user, not clear vandalism, but questionable sourcing practice.

  ---
  008 — Al Sherrod Lambert | Editor: ~2026-14244-01 | Summary: section header only
  - Changed "backing vocals" to "vocals" for Kelly Rowland and Beyoncé.
  - Changed wikilink for "Billboard Hot 100" to point to [[Billboard Hot Gospel charts|Billboard Hot 100]] — linking the Hot 100 text to the Gospel charts article. This is a factual error embedded in a
  wikilink. Temp user.

--skarz (talk) 18:49, 5 March 2026 (UTC)

@Skarz Hiya, sounds interesting, do you have a github/lab somewhere I can look at? Thanks, Polygnotus (talk) 01:05, 6 March 2026 (UTC)

It's just one file: https://gist.github.com/comaeclipse/81917a63c8ae9d7585a2e277d71956d0IfIf you provide that to any agentic LLM with the explicit instruction to run python diffchecker.py --rc --recent 50 then analyze the diffs itself (not writing another script to use regex or something) it should perform well. For instance with Claude Code I instruct it to use its Read and Glob tool to analyze each .json file. skarz (talk) 14:39, 6 March 2026 (UTC)

@skarz Heya, I've actually been working on something similar, some things I would suggest are to not tell it temp users are more likely to vandalize (as it will just flag every temp user and I assume you are doing this based on the "temp user" responses), and I would also make sure it knows to AGF, like, you really need to drill the AGF part into it's skull. – LuniZunie_(talk) 03:00, 6 March 2026 (UTC)

Thanks for the advice! skarz (talk) 14:39, 6 March 2026 (UTC)

AI scripts not working anymore following CSP changes

There was a security incident on March 5, 2026. Following the incident, user scripts were first disabled but are now accessible again. However, it seems that the Content Security Policy has changed, see Wikipedia_talk:User_scripts#CSP_restrictions and Wikipedia:Village_pump_(technical)#Today's_outage_—_user_scripts_are_disabled. As a result, it seems that it's not possible anymore to communicate with various LLM providers. For example, trying the js code await fetch("https://api.openai.com/v1/chat/completions") now results in the error message "Refused to connect because it violates the document's Content Security Policy."

For my scripts, this affects WikiChatbot, SpellGrammarSuggestions, and SpellGrammarSuggestionsList. I would be curious to hear whether there are any feasible workarounds and whether anyone knows about planned changes to the new CSP rules to address this problem. @Alaexis and Polygnotus: I would be interested in your thoughts. Phlsph7 (talk) 10:39, 8 March 2026 (UTC)

@Phlsph7 Some thoughts:

It was claimed that they were testing how many scripts would be affected by new CSP. But importing a large amount of random scripts isn't how you do that. You just search for scripts that contain the string 'http' (and then perhaps filter out jsdeliver.net etc).
They needed to take decisive action for the optics, but this doesn't really protect us.
An attacker would always prefer local scripts and not having to deal with CSP. This wasn't an attacker but just some script someone had laying around.
Therefore CSP could only be useful if it prevented data exfiltration, but it does not.
The WMF says they now whitelist stuff but when I asked to whitelist my VPS the response was: we're not going to allowlist IP addresses, for instance, which change hands even more fluidly than domain names so they are making up the rules as they go along. They do not yet actually have a procedure to whitelist stuff.
Me buying a domain for 5 bucks a year does not enhance safety. Nor does me throwing the same code on a Cloudflare Worker.
We are not allowed to put proxies on Toolforge, which makes sense.wikitech:Wikitech:Cloud_Services_Terms_of_use#4.5_Using_WMCS_as_a_network_proxy. But we need to be able to contact external services. And there are proxies on Toolforge.
Hosting tools on Toolforge is a huge barrier to entry for most script developers. Instead of writing a bit of JS you need to jump through many bureaucratic hoops and know many things unrelated to JS development.
Hosting tools on Toolforge may not even be possible in some cases, e.g. with code you don't want to opensource or data you don't necessarily "own". For example I extracted all possible words from a Hunspell dictionary but I probably don't own that data in any meaningful sense.
I think this was a knee-jerk reaction to a silly mistake that does not help protect us.
I may be able to work around it, but many others may not. And lots of people won't even bother. I think this is a huge blow to script development.
I was told that what matters is not if you mess up once in a while, because we all do, but how you react to it when you do. I don't think this was a good way to deal with it.

Polygnotus (talk) 10:52, 8 March 2026 (UTC)

As you probably saw, we have allowlisted the requested domains, along with others that users reported were broken. Our goal in the CSP we deployed (which was based on an analysis of CSP reports over the last few months) was to minimize breakage to existing scripts. If you see other breakage of existing scripts, please do open a phab task for us to review.

Also, a couple points I want to make in response:

Just searching for links in the text of user scripts is not sufficient. Code can load other code that loads other code, and script URLs can be obfuscated (even benevolently). That was why script execution was being captured -- though which, as we've tried to be clear about, absolutely should have not been done in the environment/account/method in which it was done. It was not a good mistake to make, and we are sorry for the disruption.
The user script involved in last week's incident did in fact ping third party domains, two of which would have been stopped by the CSP we have now. But that was not the driving motivator, as we are not responding just to the behavior in that particular script - there are other many kinds of attacks possible in user scripts (including things involved in past real-world attacks on our users) that are made worse by being able to ping arbitrary third-party domains on the internet. CSP does not stop all malicious script behavior, but it is a significant and effective security control.

I recognize that new sudden restrictions are frustrating, especially when they cause some breakage and happen in the circumstances that they did. We too would certainly have preferred something more orderly and publicly explained before deploying it. But an enforcing CSP like the kind we have deployed (and are updating to avoid breakage) was planned, and is needed as a foundational step toward securing the user script system. EMill-WMF (talk) 01:21, 10 March 2026 (UTC)

@Phlsph7, I very much agree with u:Polygnotus's sentiment.

Practically speaking, adding a comment here would make it more likely that they whitelist api.openai.com sooner rather than later. Alaexis_¿question? 11:59, 8 March 2026 (UTC)

Another solution is to use a publicai-hosted open-source model via my proxy which has been whitelisted (https://github.com/alex-o-748/public-ai-proxy). Assuming that these models work for your use case and the number of requests is reasonable, this should work.

Long-term, the proper solution would be to have a "call LLM of your choice as a Wikipedia user" primitive available as a service somewhere where every developer can use it without jumping through hoops. Alaexis_¿question? 12:05, 8 March 2026 (UTC)

It is often a mistake to focus too much on the actions of one person, and forgetting to look at what institutional changes should be made to prevent stuff like this in the future.

In my world it makes sense to enforce the principle of least privilege. You want to do something that could be dangerous? You have to jump through a hoop. And you can't just do all your day to day activities with a privileged account. No sudo bash-ing.

I do not understand why there is no rate limiting on destructive API actions.

They are now talking about implementing having to re-enter your password before being able to edit a sites most important files. That is a good idea. Polygnotus (talk) 12:34, 8 March 2026 (UTC)

Thanks for the feedback and the helpful links, I left a comment at the phabricator discussion. I agree that this is frustrating. While I understand that there are security concerns, this change seems to be an overreaction. LLMs have a lot potential for helping with Wikipedia, but the difficulties of having each user bring their own API key and now of not even being able to access the API endpoints are not particularly innovation-friendly. Phlsph7 (talk) 13:23, 8 March 2026 (UTC)

Just a short update: the CSP rules were adjusted again to allow communication with some of the main LLM providers (api.anthropic.com, api.openai.com, api.publicai.co), see here. Phlsph7 (talk) 10:07, 10 March 2026 (UTC)

AI aid for citations

I just stumbled upon a use case for AI tools: improving citations. I entered this search isbn Maggiore, Michele (2008). Gravitational Waves: Volume 1: Theory and Experiments. and the AI summary started

The ISBN for Michele Maggiore's Gravitational Waves: Volume 1: Theory and Experiments (2007/2008, Oxford University Press) is 978-0198570745 (Hardcover). The book is also associated with the ISBN 978-0191717666 for the eBook version.

With the ISBN a citation can be created (tooling for that is not 100% but very good). Of course I can do this manually but integration with a tool like Wikipedia:ProveIt or even a bot would be great. Sorry if this is duplicate suggestion. Johnjbarton (talk) 16:38, 11 March 2026 (UTC)

Related Articles