Just finished reading Saha et al. arXiv 2506.07001 on adversarial paraphrasing for AI detector evasion.
Key claim: detector-guided paraphrasing with RoBERTa as reward reduces TPR by 87.88 percent across Binoculars, Fast-DetectGPT, Ghostbuster, RADAR, GPTZero. Universal, training-free.
What surprised me: the approach works even on detectors that were trained with adversarial examples baked in. Suggests the discriminator signal is fundamentally narrower than the generator space.
Open questions:
- Does this generalize to detectors using surprisal variance (DivEye 2509.18880)?
- Multi-LLM round-robin generation: would mixing 3-4 models in pipeline give even more headroom?
- Token-level homoglyph substitution (SilverSpeak) is trivially detectable via Unicode normalization, but adversarial paraphrasing leaves no such forensic signal.













