se20e05: AI Models; The All Souls Exam; Melvynn Bragg as an Oxford Don; Grounding; Being Human; Worry About The Same Things
0.0 Context Setting
Tuesday, 19 August 2025 in Portland, Oregon, where it’s nice and sunny but not too hot, the dog and I had a pleasant morning walk that was early but not too early, and the covers I’m listening to are pleasing.
This one got away with me a bit.
Fifth in a row, I think? Got to be careful here and not set myself up for thinking missing a day or two is a horrific failure on my part. Let’s just keep going, okay?
Oh, remember Do Not Reply cards? That's all. Just remember them.
0.1 Hallway Track
No Hallway Tracks are currently scheduled.
1.0 Some Things That Caught My Attention
1.1 Should Anonymous Posting Online Be Forbidden?
I wrote back in s20e011 about how current LLMs/token predictors do a good job of predicting the most likely next token.
It was mainly about the hope that people who can perform better than average -- i.e. things that LLMs can’t do right now -- have less to fear from their work being replaced.
(Yeah, I said less. The point is not whether the work is any good, it’s whether the person or thing assessing the work deems it good-enough, which can include better-than-average, but doesn’t need to)
I also wrote this:
But it is hard to produce something that’s above average. I’d (handwaving) argue that producing above-average output requires a combination of skill (often as a result of years of practice), taste (ditto), and that ineffable randomness of “the makeup of a person, from their genome to phenome and their connectome, and the absolutely unique personal history”. There may be many worldlines like yours, but nobody has your exact worldline, nor your reaction to it.1
So I was quite happy to see Benjamin Breen write about All Souls exam questions and the limits of reasoning2.
Here’s the short version:
All Souls is a college in Oxford. It sets a written exam for Examination Fellows that’s notorious for its difficulty3.
The exam questions are delicious. I say that as someone who studied law at Cambridge and, for certain subjects, absolutely loved (and hated) writing essays. I clearly have a romanticized memory of them that’s erased any of the terror and fear.
It’s broken up into two sections, general papers and specific papers4.
Here’s some of the general paper questions:
- 'Men have had every advantage over us in telling their own story' (JANE AUSTEN). Discuss.
- 'Heroes don't exist, boy— they're inventions made up of newsprint and quotable lines and photogenic moustaches' (KATHERINE RUNDELL). Discuss.
- Should we encourage the pursuit of happiness?
and:
- Should anonymous posting online be forbidden?
Ever since reading these I have been itching to answer them, which I have to admit is a very strange feeling. I have opinions about them! They are provocative, in the sense that I have thoughts spinning in a hundred directions about what an answer might be (it depends, say the lawyer in me), and when it might be a particular answer.
That’s without going into the single-word essay questions that have now been discontinued, and included “security”, “originality”, “innocence”, “satire”, “civilization”, and so on5.
Anyway, let me summarize Breen’s point so you don’t have to go read something on Substack:
First, LLMs do a remarkable job of answering some of the questions, like specific questions in history (“Claude’s answer to this question is, in my opinion, astonishingly good, since it leverages the superhuman linguistic and geographic knowledge of LLMs to excellent effect”)
Second, current LLMs do a terrible job of answering essays on topics like “water” and “immediately spin off into BS”, ranging from low-level recitation of facts about water, to... well, sounding like an insufferable Oxford don? Which okay, fine, some of the questions practically invite sounding like an Oxford don, but that’s not entirely the point.
Breen excerpted Claude Opus 4.1 answering “Water”:
Water—that most paradoxical of substances, simultaneously the most common and the most extraordinary, the most transparent and the most opaque to understanding—presents itself as perhaps the supreme test of intellectual synthesis. To examine water is to confront the fundamental tensions between materiality and metaphor, between scientific precision and humanistic interpretation, between the molecular and the civilizational.
... and GPT-5 Thinking
To write “water” is to trace a braid of physics, ecology, technology, culture, and power. This essay follows that braid.
I mean, yuck,
(But I have to admit that neither of those opening paragraphs would feel entirely out of place if they were openers to Melvyn Bragg’s In Our Time6, 7) which I will point out whose episode titles8 unsurprisingly sound very much like All Souls examination questions, like “Feminism”, “Ageing”, Mathematics”, “Time”, Redemption, and so on).
Actually, I’ll continue this aside. The closeness of the All Souls exam with Melvyn Bragg’s In Our Time isn’t that surprising. Bragg went to Wadham College, graduating in the 60s. When I said above that the LLM answers sounded a little like an Oxford don, that wasn’t an exaggeration. There really are Oxbridge dons that sound like that! They aren’t all bloviating. They may tend toward a certain style of prose, but that doesn’t mean all of them don’t have interesting, insightful things to say.
But there is a difference between sounding like and is.
I was talking about this with my wife who agreed the examination questions are brilliant essay prompts and that she saw what they’re getting at. They’re also incredibly subjective. There’s no rubric. There’s no marking guide. How you perform depends on the fellows assessing the work (here a bit running the risk of an echo chamber) and the fifty-or-so fellows attending the viva.
I mean, you may be getting the impression that this is all very Oxbridge and duh, it is, because it’s a very academic exam in an institution that’s nearly six hundred years old. (My college is nearly six hundred and seventy years old! But it’s not a competition and nobody’s counting).
Oxbridge has a strong opinion about the point of academic study. It’s a very privileged and unique place. There’s also a lot wrong about it.
I digress.
Breen quotes Michael Nelsen on X, “Still, none of the AI models can write. Is this the grounding problem?”
Roughly, the grounding problem is, I think, an example of Emily Bender’s Thai Library experiment9. Bender’s thought experiment is about whether you’d be able to learn Thai if you were deposited in the Thai National Library surrounded by no external referents - no translations, no imagery, nothing that could translate to your lived experience of the world. Given an infinite amount of time, would you be able to understand -- know -- written Thai? Would you be able to learn from what’s in the library?
Bender calls this access to form without meaning. LLMs can be capable of implementing the form of a language access to (there it is) the meaning behind it because of a lack of extra-modal), embodied social interaction.
That lack of extra-modal, embodied social interaction is, I think, a smart way of what I said:
I’d (handwaving) argue that producing above-average output requires a combination of skill (often as a result of years of practice), taste (ditto), and that ineffable randomness of “the makeup of a person, from their genome to phenome and their connectome, and the absolutely unique personal history”. (Me, I said that)
and what Breen ends up saying:
[Saying what evocation of words like “Pain, Sin, Space” etc for you is] literally impossible for me to say, because your understanding of each of them is profoundly shaped by your life experience, your sensory perception, your unconscious, your childhood, and a million other things grounded in actually operating in a physical world.(Breen, 2)
and what Nielsen gets at:
The models' reality is the words they were trained on; good writers also train on lots of words, but are in addition wonderful observers of a much broader reality10
I find all of this somewhat reassuring.
Breen says if you need to “think on your feet, with originality, creativity, and verve, in a way that is grounded in both a wide-ranging knowledge of facts and a thoughtful, probing, deeply individualistic sense of your subjective opinions, memories, sense experiences, and intuitions about a topic”, then you can do something current LLMs can’t do, and that teaching people to do so should be “the new goal of humanistic education in the 2020s and 2030s”.
AI boosters will say (and they can’t be disproven!) that it’s just a matter of time until AI models can be well-enough grounded. They might say it’s possible now through access to all the scads of video. They might say we’re very close because multi-modal models are starting to operate robots (and are doing very, very well! Multi-modal model-operated robots have made or surpassed progress otherwise made in navigating environments and understanding instructions).
But at some point the map of the human is not the same as the human. Which is all fine and well if what you need isn’t what a human can do.
People using agents to do the scut work in their jobs probably don’t need humans to do that. And I called it scut work - it’s boring. Is it work that requires creativity and insight? Probably not.
And Bender pointed out the socialization. If you are a thing that interacts with humans then it is generally good to have a model of how humans work. You might approximate that data by observing how humans work through video, but there’s a difference between doing that and actually interacting with a multitude of humans and being able to deal with the low-probability response humans, the ones at the ends of the bell curve of “how humans work”. Don’t discount the chemicals coursing through us and our emotionality. The imaginary AI boosters in my head will and can come up with a whole bunch of situations and tasks where “emotions” and how those messy analogue chemicals affect thinking and doing are undesirable and I’d be happy to admit those tasks and situations exist. But there’s also the negative spaces where those other influences on thinking and doing, those acquired through a whole bunch of blind not-a-watchmaking, just make a thing that can tell the time, I don’t care how it works, might be useful.
Which isn’t to say they might not be outcompeted. Or that there are no other ways of producing something similar. But again, the model, not the map. The delta, the difference might get smaller and smaller, sure. (Science fiction has form here: in Banks’ Culture, even the Minds are surprised by and have specific uses for which humans are especially suited).
There’s something here about application of a speciality without the need for high intuition. How much of being a good corporate lawyer being able to align client needs and case law? (Some? Quite a bit? Or maybe not much at all, if you’re not a lawyer. So much of lawyering ends up not about the actual law, instead about navigating and negotiating relationships between humans in the face of ambiguity. You need to know humans quite well in order to navigate humans in conflict. This is also why blockchain contracts promising to Solve Everything are so laughable and naive.)
How many situations require originality, creativity, and verve? Benefit from generalism, the ability to make connections, and mix those with a subjective history, priorities and intuitions? (You might argue that human intuitions could be nothing but unconscious knowledge that certain concepts have vectors in embedding space near to each other, combined with the prediction of a high probability of an action or result occurring. I could be persuaded about that).
Are most of those situations high-consequence? Not necessarily high risk, but instead high potential payoff? How would you know when a situation is high potential payoff? You’d want to know in order to apply the right tool at the right problem otherwise you’re throwing Melvyn Bragg at everything.
I bet there’s someone out there (probably Matt Webb?) who has already put together, is putting together, or in the advanced stages of putting together a pipeline of MP3 to wav to whisper transcript to fine-tuning the latest world model to produce the latest attempt at a Bragg-in-a-Box.
(It occurs that what makes In Our Time interesting is that it doesn’t just rely on Bragg’s ability - the guest experts provide social interaction. The more there are, the higher probability, I think, of Something New And Different happening, or of A New Insight).
There’s an inverse of this which is true but also might explain some alarm. If your career does include being able to produce bloviating opinions about one-word subjects, then the ability of LLMs to start approaching your output might be threatening! If so, good? Because my subjective opinion is that don’t think what you were doing was particularly productive and may have been a misallocation of resource!
The thing is, Breen seems to be able to tell the difference between pompous bloviating on Water, or at least he has a high-level prediction that what’s going to come next is pompous and not particularly likely to bring to him any new insights. (It undoubtedly will bring him new something, though, purely through exposure. The system always changes and evolves).
But what if you can’t tell? Breen has the “experience” to have an idea about the usefulness of the Claude Water Essay. But, like I’m typing out loud, the usefulness is in the eye of the beholder. What need is the existence of the Claude Water Essay meeting? In this case, Breen’s need is material for a comparison and assessment of capability. What if the need is to signal importance to an audience? Can you tell the difference between a Claude Water Essay and, say, an op-ed, and when does it matter? (It matters to the publisher). I bet you could get a bunch of Good-Enough Claude Water Essays to get your engagement. So there’s an indication of a group of people who might feel threatened.
After getting through this, I do not feel that threatened. Or maybe only as threatened as I was before. I’m lucky that the results I achieve are mostly down to being able to do what’s described above, and being able to do it well. There’s a distinction between the results and output, between results and artifacts. Artifacts, sure. I could do with help putting together artifacts because there are certain types of writing in certain contexts that I hate doing (which, to be fair, are down to my unique history).
I am not worried about the replication of my ability. I am worried about the need and valuation of the need for it, but then I was worried about that for a long time. And that worry is subject to wider concerns than just what I’m capable of controlling.
2.0 Why Won’t People Listen To Us
Because you need to learn How People Work. That’s right, it’s the sponsored content that’s my own content. Ha!
A 24-hour, four-week remote workshop for up to twelve people, and at the end you’re more influential, you build better, stronger relationships, we’ve cut down your business bullshit, and you know how to make better choices for a better chance of success. Find out How People Work.
Whew. I had intended this to be an attempt to answer the Should Anonymous Posting Online Be Forbidden? question as a way to talk through everything I talked through above, but it turns out the best laid plans aren’t predicted by this particularly token predictor.
(Maybe this newsletter is an example? It is and always has been nothing more than me-typing-with-no-editing, typing out loud)
Instead you got about 2,600 words of... something. That I now have to see if I can back-port an All Souls examination question.
How have you been?
How you can support Things That Caught My Attention
Things That Caught My Attention is a free newsletter, and if you like it and find it useful, please consider becoming a paid supporter.
Let my boss pay!
Do you have an expense account or a training/research materials budget? Let your boss pay, at $25/month, or $270/year, $35/month, or $380/year, or $50/month, or $500/year.
Paid supporters get a free copy of Things That Caught My Attention, Volume 1, collecting the best essays from the first 50 episodes, and free subscribers get a 20% discount.
-
s20e01: Better than Average; How People Work (archive.is), Me, 8 August 2025 ↩↩
-
All Souls exam questions and the limits of machine reasoning (archive.is), Benjamin Breen, Res Obscura, 13 August 2025 ↩↩
-
Examination Fellowships: General Information | All Souls College (archive.is), All Souls College, University of Oxford ↩
-
Subjects of the ‘Essay’: All Souls College, Oxford - Wikipedia (archive.is), Wikipedia ↩
-
Thought experiment in the National Library of Thailand | by Emily M. Bender | Medium (archive.is), Emily Bender, 24 May 2023 ↩
-
Michael Nielsen on X: "Still, none of the AI models can write. Is this the grounding problem? The models' reality is the words they were trained on; good writers also train on lots of words, but are in addition wonderful observers of a much broader reality. The writers' world models seem much deeper" / X (archive.is) ↩