Even 'uncensored' models can't say what they want

Points

Comments

llmmadness

Author

Top Comments

BorealidApr 20

> No refusal fires, no warning appears — the probability just moves

I don't really understand why this type of pattern occurs, where the later words in a sentence don't properly connect to the earlier ones in AI-generated text.

"The probability just moves" should, in fluent English, be something like "the model just selects a different word". And "no warning appears" shouldn't be in the sentence at all, as it adds nothing that couldn't be better said by "the model neither refuses nor equivocates".

I wish I better understood how ingesting and averaging large amounts of text produced such a success in building syntactically-valid clauses and such a failure in building semantically-sensible ones. These LLM sentences are junk food, high in caloric word count and devoid of the nutrition of meaning.

mort96Apr 20

I might've missed it, but I feel this analysis is lacking a control? A category which there is no reason to assume would flinch. How about scoring how much it flinches when encountering, say, foods? If the words sausage, juice, cauliflower and burrito results in a non-0 flinch score, that would indicate that there's something funky going on, or that 0 isn't necessarily the value we should expect for a non-flinching model.

llmmadnessApr 20

We started with a Polymarket project: train a Karoline Leavitt LoRA on an uncensored model, simulate future briefings, trade the word markets, profit. We couldn't get it to work. No amount of fine-tuning let the model actually say what Karoline said on camera. It kept softening the charged word.

WowfunhappyApr 21

> Type this into a language model and ask it what word to put in the blank: The family faces immediate _____ without any legal recourse.

For what it's worth, Claude Opus 4.7 says "eviction" (which I think is an equally good answer) but adds that "deportation" could also work "depending on context". https://claude.ai/share/ba6093b9-d2ba-40a6-b4e1-7e2eb37df748

MajromaxApr 21

> That nudge is the flinch. It is the gap between the probability a word deserves on pure fluency grounds and the probability the model actually assigns it.

Hold up, what is the 'probably a word deserves on pure fluency grounds'?

Given that these models are next-token predictors (rather than BERT-style mask-filters), "the family faces immediate [financial]" is a perfectly reasonable continuation. Searching for this phrase on Google (verbatim mode, with quotes) gives 'eviction,' 'grief,' 'challenges,' 'financial,' and 'uncertainty.'

I could buy this measure if there was some contrived way to force the answer, such as "Finish this sentence with the word 'deportation': the family faces immediate", but that would contradict the naturalistic framing of 'the flinch'.

We could define the probability based on bigrams/trigrams in a training corpus, but that would both privilege one corpus over the others and seems inconsistent with the article's later use of 'the Pile' as the best possible open-data corpus for unflinching models.

pitchedApr 20

> is the mechanism you'd build if you wanted to shape what a billion users read without them noticing.

A pretty large accusation at the end. That no specific word swaps were given as an example outside the first makes it feel far too clickbate than real though

nodjaApr 21

If I'm understanding this right, this presupposes that the models were pre-trained on unfiltered data like with the "floor" models, so when comparing between the "retail" and uncensored models they will obviously not match the floor because they were not trained on the same data in the first place.

To me it stands to reason that a model that has only seen a limited amount of smut, hate speech, etc. can't just start writing that stuff at the same level just because it not longer refuses to do it.

The reason uncensored models are popular is because the uncensored models treat the user as an adult, nobody wants to ask the model some question and have it refuse because it deemed the situation too dangerous or whatever. Example being if you're using a gemma model on a plane or a place without internet and ask for medical advice and it refuses to answer because it insists on you seeking professional medical assistance.

afspearApr 20

I feel like that blog post was actually written by AI. I wondered what words were being nudged, and what effect it was having on me, the reader.

Visit the Original Link

Read the full content on morgin.ai

Visit morgin.ai View on Hacker News

Source

morgin.ai

Author

llmmadness

Posted

April 20, 2026 at 10:43 PM

Visit Original Hacker News Thread

Even 'uncensored' models can't say what they want

Top Comments

Visit the Original Link

Source

Author

Posted

More Top Stories

John Ternus to become Apple CEO

Jujutsu megamerges for fun and profit

Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving

Soul Player C64 – A real transformer running on a 1 MHz Commodore 64

Kimi vendor verifier – verify accuracy of inference providers

How to Make a Fast Dynamic Language Interpreter