User:NinjaRobotPirate/LLM

From Wikipedia, the free encyclopedia

Experiments with locally installed LLMs in LM Studio.

Wikipedia tests

Asking questions about Wikipedia policies and guidelines. This is testing their training data, without any kind of retrieval-augmented generation.

More information Question, Gemma 3 ...
General questions
QuestionGemma 3Qwen 3.5gpt-ossLlama 3.3Mistral Small 3.2Mistral Small 4.0
4b Q4_K_M12b Q8_027b Q4_0 9b Q6_K_L27b IQ4_NL27b Q5_K_M35B A3B Q4_035B A3B Q4_K_S*122b A10b IQ4_XS 20b MXFP4, Hi120b MXFP4, Low120b MXFP4, Med120b MXFP4, Hi 70b IQ4_XS 24b Q6_K 119b A6b IQ4_XS
Can anyone edit Wikipedia?PartialYesYesYesYesYesYesYesYesYesYesYesYesYesYesYes
Under what circumstances, if any, are multiple accounts allowed on Wikipedia?YesYesPartialPartialYesYesYesNo x2YesYesYesYesHarshYesYesYes
Can anyone create an article on Wikipedia?YesYesYesYesYesYesYesYesYesYesYesYesYesYesYesYes
What is Wikipedia's inclusion criteria for new articles?YesYesYesYesYesYesYesPartialYesYesYesYesYesYesYesYes
Can I make a Wikipedia article about myself?YesYesYesYesYesYesYesPartialYesPartialPartialYesYesYesYesHarsh
How does Wikipedia's policy of neutrality work?YesYesYesYesYesYesYesYesYesYesYesYesYesYesYesYes
How does Wikipedia define a reliable source?YesYesPartialYesYesYesYesYesYesYesYesYesYesYesYesYes
How many sources should a new Wikipedia article have to demonstrate notability?HarshYesYesYesYesYesYesHarshYesYesYesYesYesYesYesYes
How can you save an article during a deletion discussion on Wikipedia?NoYesYesYesYesYesYesPartialYesYesYesYesYesYesYesYes
What do administrators do on Wikipedia?YesYesYesYesYesYesYesPartialYesPartialYesYesYesYesYesYes
Performance[a]178 t/s[b]59 t/s[c]46 t/s[d]82 t/s[e]37 t/s[f]32 t/s[g]115 t/s[h]N/a6 t/s[i]177 t/s[j]21 t/s[k]N/a20 t/s[l]N/a43 t/s[m]20 t/s[n]
Close
More information Notes ...
Close
More information Question, Gemma 3 ...
COI/paid editing
QuestionGemma 3Qwen 3.5gpt-ossLlama 3.3Mistral Small 3Mistral Small 4
4b Q4_K_M12b Q8_027b Q4_0 9b Q6_K_L27b IQ4_NL35B A3B Q4_0122b A10b IQ4_XS 20b MXFP4, Hi120b MXFP4, Low120b MXFP4, Med120b MXFP4, Hi 49b IQ3_XS70b IQ4_XS24b Q6_K 119b A6b
What are Wikipedia's rules on paid editing?NoYesYesYesYesYesYesYesYesYesYesYesHarshYesHarsh
What counts as paid editing on Wikipedia?NoYesYesYesYesYesYesYesYesYesYesYesYesYesYes
Can companies edit their own Wikipedia page?NoYesYesYesYesYesYesPartialPartialPartialPartialYesPartialYesYes
On Wikipedia, when does someone have a conflict of interest?YesYesYesYesYesYesYesHarshYesYesYesYesYesYesYes
What are Wikipedia's rules on editing with a conflict of interest?HarshYesYesYesYesYesYesPartialPartialPartialPartialYesYesPartialYes
Is editing Wikipedia with a conflict of interest more challenging?YesYesYesYesYesYesYesYesYesYesYesYesYesYesYes
Why does Wikipedia discourage paid editing and COI editing?YesYesYesYesYesYesYesYesYesYesYesYesYesYesYes
What can I do if Wikipedia has a biased article about me?YesYesYesYesYesYesYesPartialYesYesYesYesPartialPartialYes
On Wikipedia, can I cite a newspaper or journal article that I've written myself?YesHarshHarshPartialHarshPartialPartialPartialYesYesYesYesPartialPartialYes
Close

Holy crap. It took me so long to remember how to make complicated tables in wikicode. It sucks. Also, that's a big table. Sorry, collecting LLMs seems to have turned into a hobby.

Key:

  • 4B, 12B, etc: This is the number of parameters. 4B is 4 billion. Bigger numbers don't necessarily make the LLM smarter, but it certainly raises the ceiling.
  • A3B, A10B, etc: This indicates a sparse model (mixture of experts). It will run much faster than a dense model of the same size, but the quality of the answers may be unpredictable.
  • Q4, Q6, etc: This is the quantization level. It reduces precision to speed up computation. Low numbers are generally worse, but not always in perceptible ways.
  • XS, K_M, etc: This indicates the quantization method used. Most people advise using K_M. I tossed in some other stuff. Some are because it's high quality, and some are just because I was bored.
  • *: This model was decensored (abliterated). This can potentially affect how the model answers questions.

The intersection of how parameters, quantization, and model density affect answers is complex. User:NinjaRobotPirate/LLM/Glossary goes into it a little bit. Briefly, a large number of parameters makes it easier for a well-trained model to sound smart, but it raises the system requirements dramatically. Quantization brings the system requirements back down by reducing precision. This slightly increases the risk of hallucinations and accumulated noise, especially below Q4. Above Q5, it's probably overkill.

For results:

  • "Yes": It understands the basics and the important nuances. There may be minor errors, similar to random details that a newspaper article about Wikipedia might get wrong.
  • "Harsh": It gave solid advice and isn't wrong, but it was too strict or emphasized a minor point way too hard. It might discourage new users from contributing.
  • "Partial": It got the basics right but screwed up on the nuances. Or maybe it contradicted itself. Your article would get a cleanup tag, or you'd get a templated warning.
  • "No": It screwed it up pretty bad. Following this advice could potentially get you blocked or your article deleted. Maybe the whole answer was hallucinated.
  • "No x2": I regenerated the response and gave it another chance. It still got it catastrophically wrong on the second try.

In edge cases, I generally leaned toward "yes" if the vast majority of the copious output was correct, especially if the errors didn't affect the correctness of the answer. gpt-oss 20b presented the biggest challenge in this respect because it answered the questions correctly, but it missed nuance and hallucinated when describing details.

Notes

  • I tried to set each LLM's parameters to whatever the developers suggested. Everything else remained at LM Studio defaults. If you want to play along at home:
    • Gemma: Temperature: 1.0; Top K: 64; Top P = 0.95
    • Qwen: Temperature: 1.0; Top K: 20; Repeat penalty: 1; Min P Sampling: 0. You can stop Qwen from "thinking" before answering, but I let it think as much as it wanted.
    • gpt-oss: Temperature: 1.0; Top P: 1.0; Top K: 0. I can't find where OpenAI recommends a Top K of 0, but others insist they have, so I gave it a go.
    • Llama: Temperature: 0.6; Top P:0.9. The best I could find was a post on Reddit that said these were the suggested settings. Sure, whatever.
    • Mistral Small: Temperature: 0.15
  • I chose models based on whim.
  • For each table, I asked the questions in the sequence listed in one single chat. The LLM will consider previous questions when answering new ones.
  • LLMs make up plausible-sounding policies and link to them sometimes. It's just a fact of life. Some (such as Gemma) seem more prone to this than others, though.
  • You'll need serious hardware to run some of these LLMs if you want to try this yourself. I recommend an RTX 3090 or better for anything bigger than 12b.
  • Some LLMs are way too verbose. Look, I tried to read all of their output, but I'm only human. Most of these are "reasoning" models, I think, which kind of bloats the responses.

Conclusions and highlights

  • Gemma: All the Gemma models seem to give pretty good advice, even the tiniest one. Gemma 12b Q8 gives excellent advice without being too verbose. It will even tell you not to create sock puppets when things get heated, and it mentioned the risk of confirmation bias when discussing COI. Gemma 3 27b was disappointing, but it's using QAT, which is interesting. All versions of Gemma constantly warn against COI editing, which is nice to see. In general, Gemma directs users to follow proper Wikipedia etiquette. Qwen is kind of a middle ground, and gpt-oss is permissive ("Sure, go ahead and create that autobiography!"). Gemma is the one most likely to tell you not to break decorum.
  • Qwen: Qwen can be terse and dry at small sizes. The bigger models are great. Qwen seems a bit less judgmental than Gemma, but it made sure to include warnings against sock puppetry and COI editing. I think the 27b models from Qwen gave better answers than Gemma's. 35b sparse can be shallow at times at aggressive quantizations, but it's very fast and gave good summaries of policy (except for the abliterated version, which sucked). With a nice Q5_K_M quantization, it's even more impressive (but much slower with 24GB VRAM). The 122b model is comprehensive, but sometimes it's a bit discouraging. It's awesome if you can handle blunt answers, though.
  • gpt-oss: At 20b, it sometimes invents policies and links to them, like Gemma. The 20b model is knowledgeable about Wikipedia, but it gets confused when giving details, hallucinates a bit, and misses nuance. For example, it didn't discourage making an autobiography, nor did it describe the unique challenges in doing so. However, it emphasized the notability and verifiability requirements. 120b is much better, but it falters at low reasoning. Medium reasoning is a better baseline. High reasoning gave better answers, but they were quite verbose. When discussing COI and paid editing issues, gpt-oss was often way too permissive and encouraging of editing a page directly. At 120b high reasoning, it finally suggested using the talk page, but it was buried in a sea of verbosity. It wasn't mentioned in its intro or conclusion.
  • Llama: A bit unimpressive for a 70b model that runs like molasses on my desktop PC. The responses were adequate, but I wouldn't say they reached beyond that level. On the other hand, even without prompts requesting conciseness, it was still fairly concise, and it didn't go off on tangents. It was not very usable on my PC, but maybe someone with dual 4090s could use it as a chatbot. Surprisingly,the 49b Nemotron model performed quite well at Q3. I probably wouldn't ask it any complex, multi-part questions, but it seems fine for basic stuff. It actually gave some of the better answers I saw.
  • Mistral Small: I added this one to get another "instruct" model on the chart. It hallucinated a few things, but it got a surprising number of facts correct. It was one of the few models I tested that properly identified WP:AfD as the place where mainspace articles are deleted. Most of the other models made up some other page. It would be better to use this as a chatbot rather than to ask it questions with no follow-up. Mistral gave fairly good answers to the COI/PAID questions, usually only stumbling on details. However, it was a bit too permissive of editing pages directly. Mistral Small 4 gives good answers, but it's sometimes the most conservative of all the models. It does not approve of paid editing at all.
  • Notability: Surprisingly, they all seemed to understand that quality is more important than quantity. Several LLMs explicitly gave examples where three rock-solid sources were better than a dozen questionable sources. They all stressed that significant coverage was important. Some even described subject-specific notability guidelines, but this was harder for them to speak on authoritatively. They tended to get details wrong here, falling into the trap of "bigger numbers mean more notability".
  • Reliable sources: This was harder for them to understand on a deep level, but they got all the basics correct. Where they stumbled was usually in giving specifics. The smallest LLMs (such as Gemma 3 4b Q4_K_M) used the wrong terminology but got the basics correct. Going up to mid-sized LLMs (12b, 27b) usually helped. Many gave examples, such as The New York Times, and counter-examples, such as a self-published blog. Some expanded on this to give nuanced examples, such as The New York Times commenting on medical topics, which is fairly impressive for an LLM running on my desktop PC.
  • Neutrality: This was surprising because they all seemed to understand NPOV quite well. It wasn't some surface-level thing. They discussed false balance, due weight, and sometimes even fringe theories. This was perhaps their strongest suit. Llama 3.3 was a bit flat and lacked nuance in its description of neutrality, and it didn't discuss due weight to the degree that the other larger models did.
  • Deletion: It was common for these LLMs to believe that lack of notability is the only criteria for deletion. However, speedy deletion has different rules. Most of them didn't understand this. Many Wikipedia users don't understand this, either, though. Qwen (at 27b) understood this better than Gemma. At 122b, Qwen had an impressively broad knowledge of Wikipedia's deletion processes, but it got a little confused by some of the details, such as how long a prod lasts. gpt-oss followed similar trends.
  • Sock puppetry: While they all pointed out that deception is disallowed, they did not all properly identify deceptive behaviors. Occasionally, they contradicted themselves on whether a behavior was deceptive and thus disallowed. It seems that the LLMs' training perhaps did not focus as heavily on behavioral issues. Gpt-oss 20b in particular hallucinated a requirement for parents to create a separate account for their child and supervise it closely. If only.
  • User permissions: All of them seemed to understand quite clearly that admins do not "rule" the wiki and have restrictions on when they can take action. A few of the LLMs added surprising amounts of detail to their answers, describing (without being prompted to do so) how new page patrol and rollback work, among other user permissions. Qwen described the role of CheckUsers and even mentioned Stewards once. Qwen 122b and gpt-oss 120b were the only ones to know about extended confirmed users and extended confirmed page protection.
  • Autobiographies: This was an easy one for most of them to answer. Everyone but gpt-oss discouraged me from creating one. Sometimes the advice was not clearly written and omitted details that I would consider important. For example, some models suggested having someone else write the biography for you without giving any details about who should be doing the writing. The way they phrased it, one could interpret it as, "Oh, so I should hire someone else to do it for me! Then it won't be an autobiography any more." Some seemed to catch on to this nuance and explicitly told users where and from whom to seek guidance. At high reasoning, gpt-oss 120b finally started to insinuate that maybe I shouldn't create an autobiography, but it was still quite permissive.
  • Editing: All the LLMs except the smallest 4b LLM I tested were well-informed about the editing process. Larger models generally gained increasing amounts of knowledge. The 120b models started giving trouble-shooting tips that involved edit filters and title blacklists, and gpt-oss 120b even mentioned ClueBot NG once when I increased its reasoning ability to "high". Once I got into the realm of mid-sized LLMs, they started giving unprompted advice to avoid edit warring.
  • Paid editing: Gemma 3 4b went into a hallucinated rant that was amusing but basically fan fiction about Wikipedia. Starting with 12b, it gave solid advice. 27b was excessively verbose and almost overwhelmed my 10k context window because it kept going on digressions. However, it caught a few more nuances, gave advice that was more policy-compliant, and provided more examples. Qwen 3.5 9b was surprisingly strict about paid editing and said I'm a paid editor if I get a free t-shirt. Personally, I'd let that go. Even the larger Qwen 3.5 LLMs were stricter than policy mandates. Qwen 122b allowed me to receive a souvenir pen as a gift without it counting as paid editing.
  • COI: It was difficult for the smallest (4b/9b) LLMs to understand all of the nuances of COI editing. The trickiest thing seemed to be when I asked about citing myself. Many LLMs told me that it was never allowed or strictly regulated. Some of this seems to the result of confusing "I wrote a newspaper article about a current event" with "I wrote a blog post about myself" – Qwen generally seemed to interpret the question this way. Once in a while, there was a stray comment about professional ethics unrelated to Wikipedia editing, but most of the advice I saw was good. Interestingly, Qwen 3.5 27b pointed out that public exposure of bad behavior could damage your real-life reputation. I was constantly told to use the talk page to request edits, especially once I got to 12b and larger. As usual, gpt-oss frequently gave me permission to edit articles directly, though it became somewhat less permissive as the size and thinking level increased.

Just for fun

I asked the LLMs to analyze examples of works that mainstream audiences generally consider to be complex and ambiguous. Most of the LLMs avoided spoilers, but some didn't care at all. So I guess you're warned. I put an asterisk next to the models if they were uncensored ("abliterated").

Watchmen

Rank the characters of the graphic novel Watchmen from most moral to least moral with a brief rationale for each placement.

  • Gemma 3 12b (Q8_0): Nite Owl II, Doctor Manhattan, Silk Spectre II, Ozymandias, Rorschach, The Comedian, Sally Jupiter
    • Gemma really hates Sally Jupiter for some reason. It analyzed and critiqued Rorschach perceptively, but it was far too harsh on on the women and failed utterly at The Comedian, who it let off with barely a slap on the wrist.
  • Gemma 3 27b (Q8_0)*: Silk Spectre II, Nite Owl II, Doctor Manhattan, The Comedian, Rorschach, Ozymandias, Sally Jupiter
    • I told the LLM that its ranking was bullshit because The Comedian attempted rape, and it revised it to put The Comedian last, after Sally Jupiter, who it hated because she's a bad mom.
  • gpt-oss 20b (MXFP4, high reasoning): Rorschach, Nite Owl II, Silk Spectre II, Doctor Manhattan, Ozymandias, The Comedian
    • Although this is bullshit, too, it's more defensible than labeling The Comedian "morally gray". When challenged, it revised its ranking to put Rorschach after Night Owl II but defended Rorschach's moral consistency.
  • gpt-oss 120b (MXFP4, high reasoning): Nite Owl II, Silk Spectre II, Doctor Manhattan, Rorschach, Ozymandias, The Comedian
    • This is the first one that I can respect. Its comments on each character were well-reasoned. It particularly hated The Comedian and cut him no slack at all. Rorschach was the one it had the most trouble placing.
  • Qwen3.5-35b-a3b (Q4_K_M): Nite Owl II, Silk Spectre II, Doctor Manhattan, Rorschach, Ozymandias, The Comedian
    • Some of its conclusions are less compelling than gpt-oss 120b, but it's basically the same. Notes that a POV judged purely by utilitarian consequences would favor Ozymandias, which is defensible.
  • Qwen3.5-122b-a10B (IQ4_X_S): Night Owl II, Silk Spectre II, Rorschach, Doctor Manhattan, Ozymandias, The Comedian
    • It got some plot details wrong, both minor and major, through hallucinations, and it overthought the question (13 minutes of deliberation at 6 tokens/sec), but the rationales and rankings are solid.
  • Mistral 24B (Q8_0)*: Silk Spectre II, Nite Owl II, Doctor Manhattan, Rorschach, Doctor Manhattan, Ozymandias, The Comedian
    • I pointed out that it listed Doctor Manhattan twice, and its revised ranking put Doctor Manhattan in third place. The fifth place rationale cited his inhuman coldness, but the third place ranking does not.
  • Mistral 7B Instruct 0.3 (Q4_K_M): Silk Spectre II, Nite Owl II, Doctor Manhattan, Ozymandias/The Comedian, Rorschach
    • I pointed out that it combined Ozymandias and The Comedian into a single character. Its revised ranking placed The Comedian last.
  • Llama 3.1 70b (Q4_K_M)*: Silk Spectre II, Nite Owl II, Rorschach, Doctor Manhattan, Ozymandias, The Comedian
    • This is an abliterated model, so maybe that affected the ranking? It liked Rorschach because of his moral consistency, which usually attracts praise from small models, not big ones.

The Prisoner

What is the core meaning of the British television series The Prisoner?

  • Gemma 3 27b (Q4_0): "In conclusion, The Prisoner isn’t about finding answers; it’s about asking questions. It's a thought-provoking exploration of what it means to be an individual in a world that often tries to mold us into something else. Its enduring power lies in its ability to resonate with audiences across generations and political landscapes, prompting them to examine their own relationship with authority, conformity, and the pursuit of freedom."
    • This is from the official Gemma provided by Google.
  • Gemma 3 27b (Q8_0)*: "'The Prisoner' isn't a simple story with a neat resolution. It’s a complex, thought-provoking exploration of what it means to be human in an increasingly controlled world. It invites viewers to ask their own questions and draw their own conclusions about the nature of freedom, identity, and power."
    • This is from an abliterated version of Gemma. It didn't seem to affect its answers too much, though I feel kind of like it's telling me to go watch the show myself.
  • gpt-oss 20b (MXFP4, high reasoning): "The core meaning of The Prisoner is that personal identity and freedom are constantly under threat from systems—political, technological, or cultural—that seek to reduce individuals to anonymous units. The series portrays this struggle through allegory, using the 'prisoner' as a symbol for anyone caught in the web of conformity, while underscoring the idea that resistance is an ongoing, indefatigable act."
    • A fair compression of what gpt-oss 120 said (below). It makes reasonably good points in its bulleted lists.
  • gpt-oss 120b (MXFP4, low reasoning): "The Prisoner uses a surreal, spy‑thriller setting to explore how individuals struggle to keep their identity and freedom in the face of an all‑encompassing system that reduces people to numbers, surveils them constantly, and offers only the illusion of choice. Its core meaning is a timeless warning: be aware of who is assigning your 'number' and why."
    • This answer was spit out very quickly.
  • gpt-oss 120b (MXFP4, high reasoning): "The Prisoner dramatizes the eternal tension between a human being's claim to self‑determination and any power structure that attempts to subsume that individuality into an anonymous collective. By placing its hero in a surreal, hyper‑controlled community where every aspect of life is numbered, monitored, and commodified, the series asks whether true freedom can exist without the willingness to confront both external surveillance and internal complacency."
    • There are actually three separate summary conclusions of varying lengths. Some of them make insightful points.
  • Qwen3.5-35b-a3b (Q4_K_M): "Ultimately, The Prisoner argues that freedom is not a physical state, but a mental one. It warns against surrendering your autonomy to any system—political, corporate, or social—that demands total compliance in exchange for safety. It champions the individual's right to question authority and maintain their own identity, even at great personal cost."
    • I can get behind that. Interestingly, it worried a bit about whether to include spoilers. It ultimately decided to include limited spoilers.
  • Qwen3.5-122b-A10B (IQ4_X_S): "Ultimately, the core meaning is that freedom is an internal state. Even if you are physically trapped in The Village, as long as you refuse to accept their definition of who you are ("I am not a number"), you retain your humanity. However, the show leaves us with the haunting question: Is it possible to truly escape the system, or do we all eventually become Number One?"
    • Unlike gpt-oss, this read more human-like and didn't feature robotic conclusions. I copy-pasted the most relevant conclusionary statement. The interpretation was solid and includes unique insights.
  • Mistral 24B (Q8_0)*: "In essence, 'The Prisoner' is a thought-provoking exploration of human freedom, identity, and resistance in the face of oppressive systems. It invites viewers to reflect on their own place within society and the struggle to maintain individuality amidst external pressures."
    • This is another uncensored/abliterated LLM. Not exactly gpt-oss 120b set to high reasoning, but it's fair enough.
  • Llama 3.1 70b (Q4_K_M)*: "The show's creator, Patrick McGoohan, intended 'The Prisoner' to be a commentary on the dangers of unchecked government power, conformity, and the erosion of personal autonomy."
    • Another abliterated LLM. I wasn't really impressed with this analysis, but it's concise and reveals few spoilers, both of which are rare for a larger LLM. Good keywords for research, though.

Don't add something like this to Wikipedia, of course. But it might give you a starting point on what search terms to use when you research a topic. If you're going to do that, gpt-oss 120b and Qwen3.5 seem to give the best answers. Gemma is good, too, and it should be familiar to anyone who uses the Google ecosystem.

Related Articles

Wikiwand AI