What if everything interesting in LLMs is already an AI Safety problem? or Security as a way to box-in LLM behavior
I spend most of my time thinking about how to conceptualize LLMs, understand their behavioral dynamics, and measure what they are capable of. I want to find clear ways of thinking about LLMs: do instruction-tuned LLMs just have low-diversity or are they really more consistent than base LLMs. I want to know how LLMs propagate information and adjust to it: if an LLM is finetuned on something about the genetics of all birds, will it be able to answer for a specific bird? What if it needs to condition on some phenotype of the specific bird to know how it should express this new information? I want to know what LLMs can do that humans would recognize as interesting: could we just automate the paperback romance novel industry right now with current models and some carefully designed inference algorithms?
I'm not really an alignment or AI safety person but a few weeks ago I went to the Bay Area Alignment Workshop hosted by FAR.AI, who very kindly invited me. One thing I found interesting is that AI Safety seems to be escaping its original echo chamber, which focused on very specific kinds existential risk, even if it might be in the process of forming a new one. (We’ll have to wait and see.)
What feels different this time is the significant outward pressure to keep things open because everyone is worried about AI in one way or another these days. At the workshop (~160 people), I noticed a dynamic of differing opinions allowing for more professed uncertainty and exploratory thinking, much more so than in many of other events I've attended over the last couple of years. Academic conferences, for all their professed love of the pursuit of truth, are often full of opinions that are held too strongly given the evidence available.
I also observed more traditional security thinking being applied to AI safety at the workshop—identifying unintended information channels, trying to define what contextual integrity would mean in LLMs, and so on. My own thinking about how security approaches might apply to NLP has shifted over the years, and I'm increasingly interested in exploring parallels like viewing LLM interpretability through the lens of traditions such as cryptanalysis.
While benchmarks have been crucial for progress (and we need more of them), we need better ways to construct and validate them. To facilitate that, I think we need frameworks for discussing problems that aren't yet fully formalized.
This is where I found the AI Safety discussions particularly interesting. Consider how the NLP/ML applied to language has recently evolved: Concepts that might have been dismissed as too informal a few years ago are now part of regular technical discussions, such as the term “alignment”. While this might seem like a decrease in rigor, I would argue that it’s likely necessary for making progress on problems that we haven’t had the chance to familiarize ourselves with the facts of the matter yet.
Early physicists and biologist had the advantage that we implicitly knew a great deal about the physical and biological worlds. Linnaeus named many of the branches of the tree of life before we knew its roots. But we we are empirically bottlenecked: we need to know more about generative models to be able to theorize them properly. In the meantime, having a little “give” in our vocabulary is ok and even good—as long as we admit we’re being fuzzy.
I noticed that while many talks focused on catastrophic risk scenarios like Rogue AGI, the informal discussions often centered on more immediate, practical concerns: How do these models actually behave? What patterns emerge in their interactions? What drives their responses?
This might be causally connected with the fact that the incentive structures varied significantly across participants—academics, large companies, small startups, grantmakers. The combined effect was that it felt like a special moment where AI Safety might become a meeting ground for people with different stakes in AI, if the folks already in the community maintain the kind of openness I saw on display. Folks across these spaces are all worried about these systems—possibly to different degrees and from different angles—but with an understanding that communication with a broad audience is now required for progress to be useful.
AI Safety's expansion has created space for less formal theorizing—not by decreasing standards, but by acknowledging the need to start with rough concepts and refine them. This tension feels productive. As more researchers enter the field and AI's impact becomes more tangible, there's growing acceptance that both rigorous foundations and reasoning about partially-formed ideas can contribute to fleshing out (a) what's going on and (b) what we want to happen. This contrasts with the status quo which I think has been far too engineering focused. (Answering ”What increases efficiency?” does not directly describe capabilities even if it sometimes elucidates them!). I also think the status quo of different communities has been surprisingly complacent, often insisting on working only with classical, well-defined tasks that no longer address the actual uses of this incredibly new technology or the questions we have about it.
The combination of theory-building about LLMs and security mindset seems particularly promising because it encourages thinking at the right level of abstraction. Instead of just solving isolated technical problems, it pushes toward understanding the equilibrium of a complex game: How do multiple actors interact? What drives their behavior? What invariants hold across different scenarios? What are the fundamental limits of oversight?
Many researchers have been studying these things for a while—but it feels like a moment where everyone can suddenly see the relevance and buy into this way of viewing things. Safety provides both a natural meeting point to talk about these problems, but also an environment that forces us to put skin in the game when it comes to our interpretations and conceptualizations of LLMs; to reason about them not just in the toy setting but in a rough and interactive real world with tangible risks. I'm starting to believe that thinking about things in security terms might be the path towards a behavioral theory of LLMs that's actually useful, because “boxing in” systems safely is the right level of abstraction for this new animal we’ve discovered through gradient descent. For me, personally, that's a huge update.