User:Mill 1/Project Finding The Forgotten Few
From Wikipedia, the free encyclopedia
TL;DR
I created a software application that cross-references obituary archives from The New York Times and The Guardian, identifies overlapping mentions of the deceased individuals, and checks whether they already have a biography article on Wikipedia. Reasoning: obituaries are great sources for bio's and if someone's obituary is published in both newspapers than per definition it warrants its own article here.
The application, called 'Wikipedia Biography Creator' works great (source code). Given a specific month (e.g. January 2001) it identifies potential candidates who meet the criteria (cited in both newspapers but no bio yet). After a manual check that indeed a candidate does not exist here I proceed to creating the actual draft for the new bio article.
The application spotted 41 candidates; I created 34 biographies based on that. Downside was that the news archive of The Guardian only starts in January 1999. I also noticed that candidates were more abundand in the earlier years. During my work here it had already struck me that online newspaper The Independent had a lot of (early) obitaries on people. However, they do not have a news archive (yet). Using some ICT techniques I managed to create an 'obituary archive' for The Independent myself, going back to July 1992. I used that source in my application to look for earlier possible candidates which resulted in eight more articles.
Using ChatGPT
After creating dozens of biographies manually I decided to investigate if I could get a LLM to do the writing for me. Obviously I checked what Wikipedia had to say about it (later). I learned that although using an AI chatbot should be used with great caution and that its output should be rigorously scrutinized it was not forbidden (September 2025).
After much experimenting I thought I had defined an AI-prompt that would serve my purpose. It would set up the draft in wiki markup (maximum of 300 words) using url's to the two newspaper citations as sources. Initial checks of the output generated by ChatGPT looked good. So after applying some corrections and additions I decided to publish my first bio: Jacques Fauvet. Subsequently I created eight more articles this way. During that time I realized that AI hallucinations were present in every draft and were hard to spot because of their credible nature. It cost me a lot of time finding them and I started thinking that it would be faster to return to manual writing. It was about that time that I was alerted by Wikipedian NicheSports that the extend of hallucinations was even larger than I realised.
During the investigation what went wrong I did some further research on how to prevent these issues. New trials indicated that next approach eliminates hallicunations when writing draft articles using ChatGPT:
- In the ChatGPT personalisation settings turn off 'Memory' and 'Web Search'
- Delete the previous chat
- Create a new chat. Use next prompt template:
Per article the prompt differs in two ways:
1. The provided two texts containing the source citations
2. The maximum number of words (based on the combined length of the obituaries, implying importance)
You are a Wikipedia editor. Write a neutral biography in encyclopedic tone in valid English Wikitext.
Important: only use the information provided in following two texts to answer the question.
Name of biography article: Mary Bodne
TEXT 1 START
url: https://www.theguardian.com/news/2000/mar/20/guardianobituaries
Source: The Guardian
Author: Mark Krupnick
Date: 20 Mar 2000 02.45 CET
Obituary:
Title=Mary Bodne
Body=
[Text body of the obituary]
TEXT 1 END
TEXT 2 START
url: https://www.nytimes.com/2000/03/02/nyregion/mary-bodne-ex-owner-of-algonquin-hotel-dies-at-93.html
Source: The New York Times
Author: Douglas Martin
Date: March 2, 2000
Obituary:
Title=Mary Bodne, Ex-Owner of Algonquin Hotel, Dies at 93
Body=
[Text body of the obituary]
TEXT 2 END
Use and merge content from these texts paraphrased (no copying wording!).
Every (set of) statements should end with a reference to the source.
Two sources exist: The Guardian (<ref name="guardian">) and the NYTimes (<ref name="nyt">), both {{cite news}}, access-date=[today, d mmmm yyyy].
Other requirements:
- About 250 words maximum.
- Use an Infobox when appropriate.
- If Infobox: if possible use {{death date and age|...}}, date format: follow opening sentence. d mmmm yyyy means {{death date and age|...|df=yes}}.
- Add appropriate subsections, e.g.: == Early life == and == Later life and death ==.
- Add appropriate internal wiki links (e.g., [[Nigeria]], [[Nancy, France|Nancy]], ''[[Le Monde]]'').
- References section with == References == and {{reflist}}.
- Add property url-access=subscription regarding the NYTimes reference.
- Add {{Short description|[short description]}} at the top.
- Add appropriate Wikipedia categories at the end.
- Between the references and categories add the following two:
{{Authority control}}
{{DEFAULTSORT:LASTNAME(S), FIRSTNAME(S)}}
- Output only valid Wiki markup for a new article draft.
And again: you should ONLY respond to my question, given information provided in the two texts.
The actual prompt that created the draft for Mary Bodne can be found here. Copy the output generated by the LLM in a draft page to start the verification process.
Before publication
- Start by reading the two sources carefully.
- Check every statement in the output for accuracy, puffery and existence in the source text(s).
- Check the first and last lines of the output particularly for LLM editorialization.
- Combine duplicate citations (e.g. multiple consecutive statements citing the same source).
- Add missing internal wiki links.
- Correct non-existent categories, if any.
After publication
- Create the corresponding Talk page.
- Add wiki links pointing to our newly created article (surname!).
- Update the Results section in this page.
Inception
Some online news media have created APIs to enable computer applications to request information from them. Two of them are publicly available news archives and free: the NYTimes API and The Guardian Open Platform. I've used the NYTimes API extensively in my previous private project applying it to generate close to 7,000 citations in numerous list pages.
The NYTimes News Archive dates back to 1851. The news archive of the Guardian only starts in January 1999. That's why I used that month as a starting point to find candidates.
TODO
Results: NYTimes-The Guardian
I processed the months January 1999 – May 2025 to discover deceased who meet the criteria of this project. The red links and new redirects I leave for someone else to create.
Results: NYTimes-The Independent
I processed the months July 1992 – December 2009 to discover deceased who meet the criteria of this project. The red links and new redirects I leave for someone else to create.
| ID | Month of Death | Deceased | Ref. The Independent | Ref. NYTimes | Remarks |
|---|---|---|---|---|---|
| 1 | July 1992 | Pierre Uri[4] | reference | reference | |
| 2 | February 1993 | Michel Renault | reference | reference | |
| 3 | February 1993 | Adina Blady-Szwajger[5] | reference | reference | |
| 4 | March 1993 | Warren Ellsworth | reference | reference | |
| 5 | March 1994 | Kenneth Neill Cameron | reference | reference | |
| 6 | January 1995 | Elaine Greene | reference | reference | |
| 7 | March 1995 | Constance Morrow Morgan | reference | reference | |
| 8 | December 1995 | Nina Verchinina | reference | reference | |
| 9 | September 1996 | Rose Isabel Williams | reference | reference | |
| 10 | January 1998 | Jack Grimm | reference | reference | |
| 11 | August 1998 | Francesco Crucitti[6] | reference | reference |
About this project
Every year about a 50 million people die. Of those roughly 10,000 end up with a biography on the English wiki. Regarding the 2010s I found eight individuals who met the criteria of this project. This says something about the maturity of Wikipedia I think. It seems I am truly mopping up the last forgotten notable few.
