I have been a strong proponent of the idea that we cannot identify LLM generated text on the basis that I could push any instruction set into an LLM to get it to mock a style, or take specific instructions into account.
This post from Adam Day - https://clearskiesadam.medium.com/genai-detection-that-actually-works-edc562581fed, led me to look at what Pangram are doing - https://www.pangram.com, which led me to look at this paper - https://arxiv.org/abs/2502.12150 "Idiosyncrasies in Large Language Models". It's a very good paper and makes a good skim.
The authors threw the kitchen sink at the problem, looking at distribution of markdown formatting, frequency distribution of individual letters, prevenelce of first words in response, along with top twenty most commonly used words per model.
They even took responses, had the response paraphrased by another model, and were still able to show that they could identify which model was being used to generate the text. As they say in the paper:
"This remarkable ability to classify the summarized texts shows the high-level semantic difference in LLM-generated responses."
So I am going to shift my opinion on this question, with the caveat that there are a lot of models out there, and it would still be possible to use a small language model with a very specific set of custom instructions to create a text that might fall outside of the window of detectability of any given specific tool.
However when it comes to majority use of the main LLMs this paper shows that it is indeed possible to automatically detect output from those models.
This should form a good option to include in weighing up what one might want to do around decisions with regards to submissions, but I still maintain that that is not the be all and end all of what is important in triaging submitted papers.
It is though refreshing to have my opinion changed on a topic, and the paper is really clear and convincing.
This post from Adam Day - https://clearskiesadam.medium.com/genai-detection-that-actually-works-edc562581fed, led me to look at what Pangram are doing - https://www.pangram.com, which led me to look at this paper - https://arxiv.org/abs/2502.12150 "Idiosyncrasies in Large Language Models". It's a very good paper and makes a good skim.
The authors threw the kitchen sink at the problem, looking at distribution of markdown formatting, frequency distribution of individual letters, prevenelce of first words in response, along with top twenty most commonly used words per model.
They even took responses, had the response paraphrased by another model, and were still able to show that they could identify which model was being used to generate the text. As they say in the paper:
"This remarkable ability to classify the summarized texts shows the high-level semantic difference in LLM-generated responses."
So I am going to shift my opinion on this question, with the caveat that there are a lot of models out there, and it would still be possible to use a small language model with a very specific set of custom instructions to create a text that might fall outside of the window of detectability of any given specific tool.
However when it comes to majority use of the main LLMs this paper shows that it is indeed possible to automatically detect output from those models.
This should form a good option to include in weighing up what one might want to do around decisions with regards to submissions, but I still maintain that that is not the be all and end all of what is important in triaging submitted papers.
It is though refreshing to have my opinion changed on a topic, and the paper is really clear and convincing.