Scientific code quality
2 min read

Scientific code quality

Was recently talking to a friend pursuing a physics Ph.D. about bad scientific code quality and the possible reasons behind it. We came to the conclusion that it's most likely equal parts culture, motivation, and experience. What follows is that chat distilled.

Looking over some scientific code written in Python with a friend doing his Ph.D. I couldn't not notice how bad its quality was. This in turn lead to a discussion around coding culture and a biased scientific stance on coding.

The coding culture

One of the possible reasons for horrible code is that, for most of it, academic code tends to be throwaway code. In academia you write chunks of code (sometimes fairly big chunks of code) to test your hypothesis, or to prove that some algorithm works, but rarely if ever will anyone, most likely including the one who wrote the code, will ever read or modify it; this in turn means that it doesn't make any sense to optimize for maintainability, weed out bugs or use proper engineering practices such as doing code review, writing documentation, tests, CI/CD flows, etc.

This coding culture goes hand in hand with...

Scientific upbringing

Maths (and by extension physics and other hard sciences) is terse, so it's more natural for someone heavily steeped in math to begin writing code that resembles it (also a reason I think Python was adopted is because it looks more like math than other languages). That means terribly unreadable code for someone not intimately acquainted to the problem space; so you'll see lots of single character variables in functions that look like submissions to a code golf competition.

To give even more context to terse variable names, sometimes they are certainly cultural (e.g. fhat = ... for Fourier coefficients), but more often than not scientists are implementing variable names that read straight out of a paper; this makes sense, as it makes it easier to follow the code in parallel to the paper. There's little chance to read some paper talking about a quantity x_ij and name it something else, there's usually just no point. Even if it did read smoother in the code alone, you can lose some readability that goes along with the paper.

Of course when the quantity is something physical and simple it's not a big deal, like temperature instead of T, but often these variables represent some really random thing that would be silly to name. I mean if we're talking about a second order mixed partial there's no getting around it, it's getting named f_xy, code reviewers be damned.

This means that once you know the underlying math, a lot of scientific programming is pretty straight-forward. There usually aren't a lot of inter-connected moving parts, so they never really have the need to discover design patterns, standards, and best practices common in the software world, which takes us back to the coding culture.