Ian Mulvany

May 12, 2026

So, science and software are actually not all that similar.

Why science is actually not all that like software engineering any more, and why it might have been a little bit similar in the past, but less so now.

I used to say peer review in science was a bit like code review in software development, in that you have a community of professionals sense checking each other works through a gated process. It was often a useful analogy to bring out when working with development teams who work for research companies, or publishing companies.

There would often also be calls for someone to create a “github” for science, indicating that there was some influence going on in the other direction.

We are seeing LLMs having a massive impact on both software development and the speed it now takes to write papers and do literature reviews. We expect LLMs to permeate through more knowledge based disciplines.

There are things that we are learning specifically about the impact LLMs are having on the software side, and so it might be useful to take a moment to look at what is different between software and science so that we can ask how applicable are these lessons that are emerging on the software side of things for what is happening in science?

So I’ve talked about how both use peer review. I think there is a similarity here, and in both LLMs are going to simultaneously speed up the process of review, and flood the process with way more information than we would have looked at in the past. Managing this is going to need similar approaches on both sides.

Now for some differences.

First, it is far far easier to change code than it is to change the scholarly record. If you like you can think of a code commit as being a hypothesis about how to change the underlying system. A paper is a commit with a hypothesis about how reality works. They both land into a context, and if they are wrong they have different consequences. If the code commit is wrong the code hits a failure mode (breaking, doing nothing, or in the worst case doing something that looks right but is wrong.) We can now build robust tests for all of these things, and we can actually run the code. Our ability to test the reality universe of the code is high. Now with LLMs we can also change large swathes of the codebase quickly. The consequence that we are finding with LLMs is that developers can create far more code than ever before, they can built out tests, and they can feel more productive, but the downside is that the underlying codebase is churning (I suppose that the reality of the codebase could be repressed in some kind of abstract vector action space, and the code itself is a kind of map into that space).

We can now change the code so quickly that while it can be tested and verified (see dark factories), the maintainability of then underlying code base is getting harder – huge chunks of code can be pulled in and out of the codebase – specific lines of code are now more transitory than ever (we heard this from Thoughtworks who have some direct measures on this).

Now for science the papers are a way of pinning our claims about reality, but unlike in software it is incredibly difficult to take a science paper and verify it. It remains extremely difficult to do the very hard work of replication, even in an LLM era. Moreover our process are not setup for fast removal of “bad lines of code”, I mean papers that need to be retracted. We have a much narrower bottleneck around retractions, and informing the rest of the scientific record through citation updates, that a piece of research has been found to be no longer valid. So in a world where we can now more easily create papers at higher speed, we are left with the existing limit on how fast we can correct the scholarly record. I think that means that in the short term the record will become less coherent with more errors in it.

Sitting between these two is the rate at which frontier models can be created. They are laggy and as they coming online, they will be increasily trained on a scholarly record that has more edge cases, and more errors in it, because the training run time for LLMs, slow as it is, is still faster than the time for correcting the record.

The time it takes to correct the record could become a key constraint – systems behaviour typically tend to get constrained most by the tightest bottlenecks.

How would we tell? Here are some things that might emerge as indicators (I don’t think they currently exist)

  • Increasing measure of inaccuracy from “AI scientists”
  • Increase in Psudosciecne classed responses from chat LLMs
  • Continued increase in bad citations in the scholarly record

I predict we will see those, and if they become an acute problem then more should be done to invest in fast systems for correcting the record, and signalling such corrections.


About Ian Mulvany

Hi, I'm Ian - I work on academic publishing systems. You can find out more about me at mulvany.net. I'm always interested in engaging with folk on these topics, if you have made your way here don't hesitate to reach out if there is anything you want to share, discuss, or ask for help with!