Softer Software
This work is more an exploration of an idea of the emerging idea of software as a language specification.
In my first job, we detected unexpected behavior on our internally-trained rerankers that were serving ~100 million daily active users. Our GPUs were allocated elsewhere, so we wanted a method that was ideally (a) gradient-free and (b) could be embarassingly parallelized so we could finetune/harden our models against such attacks.
The obvious candidate is asynchronous, evolutionary discovery of failure modes. We burned immense amounts of compute, and a lot of work went towards controllably generating diverse meta-prompts, edge-case handling, etc, but by the end we had a brittle-but-functioning system of a large LLM trying to propose jailbreaking prompts to our reranker, evaluating if the reranker was fumbling, and iteratively mutating and updating a NxN grid of proposed adversarial prompts, where N is a set of adversarial "directions" we had identified, each seeded by a prompt of a given mode of failure we had seen our models struggle with in prod (one can see this as rainbow-teaming [Samveylan et al] for encoder-only models).
Note that this is embarrassingly parallel, as the grid updates in async, and easily scalable, because at each step the LLM can have a fixed prefill and a variable decode.
Note here that we may frame the jailbreaking of an LLM as a search problem over the promptspace -- i.e, we claim there is an input that makes our machine, the LLM, behave visibly differently. In other words, we may look at the output and know whether or not we have successfully jailbroken it. Importantly, in this case, so can a neutral LLM, by comparing it with a given baseline behavior. In this manner, we may desire a computer program, or specifically, a piece of software, that takes as input an LLM, and gives us the prompt that jailbreaks it -- which the software verifies by comparing it against a baseline that it may be given, or develop by itself (trivially, the baseline behavior is that which is abundantly visible).
What is Softness
Per SICP, a computer program is defined as a precise description of a process. More generally, a program is a description of a computational process that evolves in time according to some given rules. Ostensibly, software is a set of instructions/specifications that dictate and compose sets of computer programs.
A continuous relaxation of a "hard" function is oftentimes called its "soft" version (softmax, softgating). These relaxations typically trade some precision for differentiability, which allows gradient-based optimization, which allows them to scale; this is a recurring theme across Computer Science: many randomized and approximation algorithms allow sub-exponential convergence to otherwise intractable problems as long as we allow space for some error. Could software too admit a similarly "softer" interpretation?
We may see LLM as a stochastic interpreter for an informal language (such as English) (we may perhaps argue that since the language is informal, the interpreter has to necessarily be stochastic, to account for a problem-search on imprecise instruction). The LLM interprets a prompt and defines a computational process that evolves a state autoregressively over time. The output of the LLM, then, is the execution trace of this program, wherein the LLM has "searched" through the programspace to find something fulfils the instruction levied via the prompt.
Of late, LLMs equipped with a harness may use/write python tooling and recursively spawn subagents with tailored context.
- Randomly pick a cell , note seed prompt and , and extant prompt
- L mutates , , into a new preliminary prompt
- seededby , L produces a set of diverse prompts ,
- Feed all prompts in into S,
- L judges results, compare with extant baseline on ,
- L scores prompts according to rollout quality
- L updates grid with current winners, store [score, winning_prompt].
If is removed, note that we can get away with storing NxN prefills for a good fraction of the pipeline and save on a large amount of compute. All that needs to change is the seed at decode. This means massive inference savings are possible due to the structure of
This can be done asynchronously, without oversight, in parallel; after each round, we have a gather, where we sync across parallel subagents, and then we continue, much like how a SIMD processor would process a program of this nature.
<----- draft --->
Think of how a GPU operates: Many small matmuls in parallel, orchestrated and specified by CUDA.
The LLM has to do the same: many same subagents in parallel, orchestrated and specified by an LLM.
If you can prefill cache the subagents, and allow the LLM to cache subroutines, this becomes a metaprocessor.