s19e04: Just Enough World Model; Actually, it’s about Federalism
0.0 Context Setting
It’s Monday, October 28 2024 and I’m writing this in Portland, Oregon, where just five miles away a ballot box was set on fire destroying over 200 ballots. So the election is off to a great start.
0.1 Events
Nothing to report at this time.
1.0 Some Things That Caught My Attention
Two things today, one that I thought was going to be short, and another that I really tried to keep short.
1.1 Just Enough World Model
Here, follow my train of thought, which to be honest is a bit of “greatest hits of stuff that has stuck in Dan’s brain over the decades”1
- For starters, Douglas Adams’ 1990 BBC documentary Hyperland2. I did a rewatch of Hyperland back in 20143.
- General Magic4, Magic Cap5, and Telescript.6
- The high-level concepts of embodiment, bayesian reasoning, predictive processing11 as popularised by Andy Clarke12, and Karl Friston’s related free-energy principle13 as global theories for how brains think and do
- Large Language Models as a concept to glom onto is a distraction, when what’s really important or interesting is the statistical modeling of token streams7 (i.e. orthogonal agreement with the concept of stochastic parrots8)
- Videoscraping, a method of using a screen recording with a token stream modeler to produce structured data from unstructured video, as demonstrated by Simon Willison9.
- Letting token stream modeler drive browsers, as demonstrated in Simon Willison’s early use of Anthropic’s Claude 3.5 API, Computer Use10
I tried to figure out if there was a good order to put those things that have stuck around in my head other than the order in which I encountered them, but it turns out that I think it works in terms of building... not an argument, but at least a thing that I think is interesting?
World model is a specific term of art for “a description of an environment so that it (the model) can be used by something else to do things based on that description”.
Despite the name, a world model doesn’t have to be of the world, like, “the world that you and I inhabit and experience”, it can also mean the world of a specific environment or domain, like “law”. Or “what non-cancerous and cancerous cells look like in samples”.
To be really useful though, as useful as possible, you’d probably want a description of as much as possible, because anything we’re interested in is necessarily a thing that exists in, uh, our world.
Large language models -- the parrot part of stochastic parrots -- and the bit of things like ChatGPT that generate text are those models of the world trained on as much text as they can get their hands on.
Sure, okay. But you teased me at the beginning of this with the section header “just enough world model”. I’ll try and be quicker, but there’s still a bunch of foundational concepts, I think.
The way “AI” systems work right now is that you give them a glob of stuff (a whole bunch of copyrighted text that you haven’t licensed, for example), and then you “do a maths on it” so that you can usefully compare words. (And you’ll have your own meaning for “usefully”)
(Actually, it’s not even words. It’s “tokens”, which are fragments of text, which don’t necessarily always have to 1:1 correspond to a word in human language).
So again, the smart thing is the maths that you do so you can compare those words (again, actually, “tokens”).
Wait. Think of the practice of affinity mapping and how word clouds look (but not how they’re made!). Imagine I give you 100 random words, and I tell you to group similar words together. That’s super easy for you to do, and you might get frustrated because a bunch of words might end up super close to each other and practically, you don’t have enough space to put them together. The word cloud on top of that is that you might have a bunch of words together like queen prince king castle crown heir princess arthur charles elizabeth throne and so on that you might group together, and on top of that group you might whack a really big word like “royalty” that describes all of them together.
But royalty only describes all of those words together in one particular context, right?
Because say I also gave you a bunch of words like ravi amelia john nyla frank luis bruno and so on, that you recognize as a first names. You might want to put those close to where you put arthur charles elizabeth, and then whack on top of those “first names”.
There’s a lot of contexts where words can have different meanings.
If I flew you up to the international space station, or somewhere with very low to zero gravity, then I could give you all those words and you could put them together in 3d space, right? That would give you one more dimension in which to arrange and group the words I give you.
Three dimensions to arrange the words might not be enough, though. Unfortunately it’s relatively hard for us to think of and work in more than three dimensions. That’s a lot of maths.
Wait, one day we cast a spell on sand and figured out how to make it count really really quickly for us.
Maybe a hundred dimensions would do it. That would be better.
No, wait. Maybe a thousand.
No, wait. Maybe ten thousand.
No, wait. Maybe a million.
No, wait. Maybe five hundred million.
No, wait. Make it five billion.
No, wait. Make it seven billion.
No, wait. Make it seventy five billion.
No, wait. Make it one trillion.
No, wait. Make it one-and-a-half trillion.
So now you’ve got as many words as you’ve gotten your hands on, and you have one-and-a-half trillion ways you’ve calculated how they are to each other (how “near” they are to each other), and then how they might be grouped.
Let’s say that now you’ve grouped all those words you also do some maths to statistically predict what the next word might be, based on or in a certain context.
Ta-da, now you have a large language model.
Do that again, but for pictures.
Do that again, but for, say, how proteins fold.
Do that again, but for, say, music.
And then for all the other kinds of music you can think of that you can also get your hands on.
And then do that again again, but for video. Which is even harder, because now you’re also dealing with where a thing is -- like a word -- in time, not just space. Just like with pictures, you probably also want to label as many of those things as possible with words.
Combine all of those together.
Do it again, but for people talking.
All the while, use maths on those models to get it to predict the next token, and then for the maths or values that get the prediction more right, use those more.
Now you have a model that based on what you trained it on has a bunch of video, audio, and text tokens arranged in, say, similarity, to best produce the next best prediction in a stream.
That model is your world model.
Remember, this is all token prediction.
Stochastic parrots are another way of describing “super fancy autocomplete”: your prompt -- the thing you “tell” ChatGPT or picture you draw or whatever is the first part, and then the autocomplete is a predicted answer.
Sam Altman thinks that if he uses enough data to create all these models, then the thing that uses the model to predict the next tokens will be “super-intelligent” and we’ll all be nerd raptured.
Ugh, I totally did not intend this to be talking and thinking out loud about the statistical modeling of token streams.
The last thing in my list of “stuff that stuck in my head” at the beginning of this piece was Claude’s new feature of “computer use”.
More or less, you can think of Anthropic having trained a model on “what happens when you use a computer”, where in that case “use a computer” includes “one with a web browser”.
So “computer use” is a (somewhat disguised?) version of “let an AI use a computer”.
(There’s some specifics here in terms of what that computer is able to do -- it’s a containerized Ubuntu image with a browser that you should be able to sandbox, right?)
So now I’m finally at this section’s title.
What’s just enough world model to drive a web browser to do just enough things to be useful for just enough people?
Now remember nowhere here have I said anything about “understanding” or “consciousness” or “what is it like to be a bat”. This is all statistical prediction based on, very roughly the similarity of a bunch of “things”, whether those things are text, video, image, sound.
The only similarity that has been computed though, is based on the training data. So maybe it’s easier now to understand why a bunch of smart people are totally concerned about what goes into a model.
You don’t need to understand something to predict what might happen next. Computing that is statistics.
People were kind of super surprised at how effective large language models were because they got better at predicting the “correct” stream of tokens for questions like “if Alice has three balls and Bob has seven balls but nine years ago he had sixteen balls, how many balls to Alice and Bob have together?”
(spoiler: right now, they’re not good at correctly completing the answer to that question because the predictions don’t take into account you throwing in a wildcard irrelevant fact)
Anyway.
There’s a whole thing in Enterprise Computing that’s a lot to do with modernizing super old systems called Robotic Process Automation. It’s essentially replacing a human who enters information and presses buttons with, well, a robot (a script!) that will do it for you. You know: click here, put this value in, do these checkboxes, copy that value out there, and so on.
You could, though, try using something that can statistically model tokens to do that instead. (I hope you are fucking smart enough to do that to write a script that can be executed, rather than burn FLOPs every single time).
I’m genuinely interested in the minimum amount of modeling needed to do “general purpose” web browsing in the least bad way for enough people to start using it.
This approach is exactly the kind of thing the Rabbit R1 was claiming to do with models for each application: the combination of a general model plus a specific model trained on a specific application if needed, and all of that “driving” a phone or a computer and pretending to be a person. Or, you know. Being a bot.
Simon’s example above of videoscraping a bunch of data from his emails so he can use it as structured data feels (perhaps inaccurately!) in my head like some sort of P=NPish problem.
Say you use one of these just-enough models to do your computer/phone using for you for a task:
- it does it faster than you would
- but you still need to check it
Will it always take longer for you to check the answer than it would for you to complete the task in the first place? Your answer is probably going to depend on the nature of the task - for some jobs you might be happy with some margin of error and it’d be good enough.
But thinking aloud, what about that nutso Rabbit example of “book me a plane ticket and a hotel”: the margin or tolerance for error there was that you had a person who genuinely didn’t give a fuck or didn’t need to care if the plane ticket and hotel was a couple hundred dollars more if it was still in their budget (if they set one!). And in that particular example, it would save them time, I bet! So it comes down to trust.
What sort of knowledge does something need about the world -- and I’m not even sure here if I mean “knowledge” -- for you to trust it for certain jobs? What about for “most” jobs, apart from ones you’d still choose to do yourself?
The other part of this is whether “just” the training data of all the text you can get your hands on and all the video and images and sound you can get your hands on (without asking for permission, because you’re a tech leader on a mission) is enough.
If it is enough, then that’s really interesting. I have a suspicion that it will be enough for many people for most tasks -- but say, I don’t know, at a 90-95% success rate or truthiness or whatever.
You might think that type of training data isn’t enough if you’re one of the people who thinks embodiment -- having a physical presence in the physical world -- is important for intelligence and navigating this world. Not just reading about it or watching it or listening to it, but the immediate feedback of navigating and doing things in the world and learning from what happens afterwards.
Now there is a class of problem where lots of people do happen to agree that embodiment is critical: driving a car. Tesla is (attempting to?) train its full self driving on the data of, I think, every single Tesla out there, because that’s the telemetry all those cameras are feeding back. If you drive a Tesla you’re both mapping the world and delivering the data for a model for driving. Nobody thinks they can create a self-driving car by just training it on whatever videos they can find on YouTube or wherever of people driving cars.
The bet that Altman and company are making is that the type of training data they have is enough to be able to make inferences about things you haven’t seen examples of. Machine learning scientists and engineers get super excited when they achieve things like one-shot learning, where a model is able to make predictions based on just one piece of exposure. This is great! It’s certainly better than, say, babies (we think), who pay a lot of attention about what’s going on out there and need reminding. Much like adults.
But I suppose my point here is that I don’t think you can make a good enough model of the world just from disembodied information, but I’m also kind of worried that you can? What I’m talking about here is the ability of a model to infer behavior and understand things like intent, the do what I mean, not what I say problem that Mickey encounters in the Sorcerer’s Apprentice and the source of be careful what you wish for and the overly-literal cautionary tale for children Amelia Bedelia.
So. In conclusion, the Oxford English Dictionary defines “world model” as --
just kidding, this isn’t that kind of phoned-in essay.
I’m genuinely interested in what a model is able to infer just through the artifacts and information we have produced about the world we live in, without actually being present in that world. It’s the ultimate “yes actually I am an expert despite never having done the thing because I read about it”.
1.2 Actually, it’s about Federalism
It’s a truth universally acknowledged that someone living in the United States will get annoyed at some sort of government computer system and wonder why they are, nearly universally, so terrible.
So they might ask: well, self, why is this so terrible?
And they might also ask: why is it so hard to make something that works?
And I really do think the answer is the entire point of the United States in the first place: federalism.
There’s that xkcd cartoon14 about having 14 competing standards and people reasonably thinking that you should have one standard that incorporates them all because that would be better. Well done. Now you have 15 standards. The other 14 are still there.
The U.S. wants standards. Everyone wants their own standard.
- Lots of people think it’s bad for the Federal government to set a standard
- The whole point of states is to let groups of people decide what their standards should be too, from “should women be allowed to control their own bodies” to “here’s what counts as a disabled or veteran-owned business”
- Counties get to decide their own standards, too.
- And cities.
- And school boards. Or districts.
- And, and, and, and.
I mean, philosophically, this is a valid point of view!
I think this the first time I’ve specifically tied the concept of federalism to a bunch of stuff that software engineers will understand. The systems of government are complicated because everyone wants their own rules. LA county wants its own rules separate from San Francisco! And in America: why not?! That’s the point of this country.
Competing standards are encouraged here, seen as an ideological good, and they are why everything is complicated and why getting things to talk to each other is hard, which is pretty much a lot of what government has to do.
You did it to yourselves.
OK, that’s it. Over 3,200 words this time! I’m sorry.
How are you doing? I’m doing... okay.
Best,
Dan
How you can support Things That Caught My Attention
Things That Caught My Attention is a free newsletter, and if you like it and find it useful, please consider becoming a paid supporter.
Let my boss pay!
Do you have an expense account or a training/research materials budget? Let your boss pay, at $25/month, or $270/year, $35/month, or $380/year, or $50/month, or $500/year.
Paid supporters get a free copy of Things That Caught My Attention, Volume 1, collecting the best essays from the first 50 episodes, and free subscribers get a 20% discount.
-
Holy shit I feel old ↩
-
Episode Twenty Three: The Difficulty (archive.is), me, Things That Caught My Attention, 24 February 2014 ↩
-
Jim White presentations (archive.is) (Telescript White Papers #1-4 ↩
-
A quote from Andrej Karpathy (archive.is), Andre Karpathy via Simon Willison ↩
-
On the Dangers of Stochastic Parrots | Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (archive.is), Emily M. Bender, Angelina McMillan-Major, Margaret Mitchell, 2021 ↩
-
Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent (archive.is), Simon Willison, 17 October 2024 ↩
-
Initial explorations of Anthropic’s new Computer Use capability (archive.is), Simon Willison, 22 October 2024 ↩
-
Surfing Uncertainty: Prediction, Action, and the Embodied Mind | Reviews | Notre Dame Philosophical Reviews | University of Notre Dame (archive.is), reviewed by Michael Rescorla, Notre Dame Philosophical Reviews, 2017 ↩
-
Surfing Uncertainty: Prediction, Action, and the Embodied Mind | Oxford Academic (archive.is), Andy Clarke, 2015 ↩
-
The free-energy principle: a unified brain theory? | Nature Reviews Neuroscience (archive.is), Karl Friston, Nature Neuroscience, 2010 ↩