Academics focused on artificial intelligence have taken to using generative AI to help them review the machine learning work of peers.
A group of researchers from Stanford University, NEC Labs America, and UC Santa Barbara recently analyzed the peer reviews of papers submitted to leading AI conferences, including ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023.
The authors – Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, Daniel A McFarland, and James Y Zou – reported their findings in a paper titled “Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews.”
They undertook the study based on the public interest in, and discussion of, large language models that dominated technical discourse last year.
The authors found a small but consistent increase in apparent LLM usage for reviews submitted three days or less before the deadline
The difficulty of distinguishing between human- and machine-written text and the reported rise in AI news websites led the authors to conclude that there’s an urgent need to develop ways to evaluate real-world data sets that contain some indeterminate amount of AI-authored content.
Sometimes AI authorship stands out – as in a paper from Radiology Case Reports entitled “Successful management of an Iatrogenic portal vein and hepatic artery injury in a 4-month-old female patient: A case report and literature review.”
This jumbled passage is a bit of a giveaway: “In summary, the management of bilateral iatrogenic I’m very sorry, but I don’t have access to real-time information or patient-specific data, as I am an AI language model.”
But the distinction isn’t always obvious, and past attempts to develop an automated way to sort human-written text from robo-prose have not gone well. OpenAI, for example introduced an AI Text Classifier for that purpose in January 2023, only to shutter it six months later “due to its low rate of accuracy.”
Nonetheless, Liang et al contend that focusing on the use of adjectives in a text – rather than trying to assess entire documents, paragraphs, or sentences – leads to more reliable results.
The authors took two sets of data, or corpora – one written by humans and the other one written by machines. And they used these two bodies of text to evaluate the evaluations – the peer reviews of conference AI papers – for the frequency of specific adjectives.
“[A]ll of our calculations depend only on the adjectives contained in each document,” they explained. “We found this vocabulary choice to exhibit greater stability than using other parts of speech such as adverbs, verbs, nouns, or all possible tokens.”
It turns out LLMs tend to employ adjectives like “commendable,” “innovative,” and “comprehensive” more frequently than human authors. And such statistical differences in word usage have allowed the boffins to identify reviews of papers where LLM assistance is deemed likely.
Word cloud of top 100 adjectives in LLM feedback, with font size indicating frequency (click to enlarge)
“Our results suggest that between 6.5 percent and 16.9 percent of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates,” the authors argued, noting that reviews of work in the scientific journal Nature do not exhibit signs of mechanized assistance.
Several factors appear to be correlated with greater LLM usage. One is an approaching deadline: The authors found a small but consistent increase in apparent LLM usage for reviews submitted three days or less before the deadline.
The researchers emphasized that their intention was not to pass judgment on the use of AI writing assistance, nor to claim that any of the papers they evaluated were written completely by an AI model. But they argued the scientific community needs to be more transparent about the use of LLMs.
And they contended that such practices potentially deprive those whose work is being reviewed of diverse feedback from experts. What’s more, AI feedback risks a homogenization effect that skews toward AI model biases and away from meaningful insight. ®