srush 4 days ago

Full blog is here: https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling...

Happy to answer any questions about these methods.

  • d4rkp4ttern 2 days ago

    The big question is whether or not o3 is using any type of “meta-generation” algorithm at inference time, I.e are there multiple invocations of the LLM generation at all, or does it generate an insanely long reasoning trace in a single autoregressive stream that some somehow implicitly has search-like behavior? In other words, is the search-like behavior learned entirely in post-training and only implicitly exhibited at inference time, or is it explicitly done at inference time?

    Given the enormous compute costs of o3, my speculation has been that search is explicit, but I’ve seen this post from Nathan Lambert for example that speculates (in the context of o1) that it’s possible for search to be entirely “baked-into” a single single stream roll-out (which would depend on significant long-context innovations):

    https://www.interconnects.ai/p/openais-o1-using-search-was-a...

    If true this would be extremely interesting.

  • dinp 4 days ago

    Great work! When I use models like o1, they work better than sonnet and 4o for tasks that require some thinking but the output is often very verbose. Is it possible to get the best of both worlds? The thinking takes place resulting in better performance but the output is straightforward to work with like with sonnet and 4o. Did you observe similar behaviour with the 1B and 3B models? How does the model behaviour change when used for normal tasks that don't require thinking?

    Also how well do these models work to extract structured output? Eg- perform ocr on some hand written text with math, convert to html and format formulas correctly etc. Single shot prompting doesn't work well with such problems but splitting the steps into consecutive api calls works well.

    • srush 4 days ago

      That's a good point. We don't see that in our experiments because it's all in the math domain. However for OAI it's plausible that training for o1 might conflict with standard instruction training, leading to less human preferred output style.

    • amitness 3 days ago

      OpenAI recommends using o1 to generate the verbose plan and then chain the verbose output to a cheaper model (e.g. gpt-4o-mini) to convert it into structured data / function calls / summary etc. They call it planner-executor pattern. [1]

      [1] https://vimeo.com/showcase/11333741/video/1018737829

    • dimitry12 4 days ago

      In this paper and HF's replication the model used to produce solutions to MATH problems is off-the-shelf. It is induced to produce step-by-step CoT-style solutions by few-shot ICL prompts or by instructions.

      Yes, the search process (beam-search of best-of-N) does produce verbose traces because there is branching involved when sampling "thoughts" from base model. These branched traces (including incomplete "abandoned" branches) can be shown to the user or hidden, if the approach is deployed as-is.

  • mccoyb 4 days ago

    In the blog post, learned verifiers are mentioned. Are these learned offline using data, and is the intent to learn a scoring heuristic to help the search?

    • dimitry12 4 days ago

      Verifier is trained with soft values of reward-to-go for each solution-prefix, obtained from monte-carlo rollouts of step-by-step solutions sampled from the "base" model.

      In other words: 1) sample step-by-step solutions from "base" model; 2) do it at non-zero temperature so that you can get multiple continuation from each solution-prefix; 3) use MATH-labels to decide if full solution (leaf/terminal node in MC rolloout) has reward `1` or `0`; 4) roll up these rewards to calculate reward-to-go for each intermediate step.

      Yes, verifier trained in this manner can be used to score solution-prefixes (as a process verifier) or a full-solution (as an outcome verifier).

      In the original paper (https://arxiv.org/abs/2408.03314) they fine-tune a fresh verifier. HF's replication uses an off-the-shelf verifier based on another paper: https://arxiv.org/abs/2312.08935

  • OakNinja 4 days ago

    Excellent and interesting post!

    Minor gripe - The best-of-n | beam search illustration is not compatible with red-green color blindness. I can literally not see the difference between the Rejected and the Selected dots even if I zoom in.

    • srush 4 days ago

      Thanks for the feedback, and not minor. Sorry about that.

mentalgear 4 days ago

Happy to see "inference time compute" term being used primarily nowadays - it's a much more precise and appropriate term compared to the unwieldy "test-time compute" that openai used to call it when they thought they "invented" scaling inference time.

boroboro4 4 days ago

What's a point of such inference time compute if verifier is 8B model itself? Am I missing something?

  • dimitry12 4 days ago

    I believe this is a valid point: HF's replication indeed uses larger off-the-shelf model as a verifier.

    In contrast, in the original paper, verifier is a fine-tune of the exact same base model which is used to sample step-by-step solutions (="solver").

    • boroboro4 4 days ago

      Using different 1B model as verifier makes sense, yes. Using Llama 8B finetune as verifier to compare 1B inference time scaled in comparison with 8B makes little sense to me.

      Using 3B model with 8B verifier against 70B model would make sense too. This being said their performance barely crossed 70B line with 256 examples. This is 256*(8+3)/70 ~ 40 times more computationally expensive than running 70B model as is.

      • dimitry12 4 days ago

        "1B solver + 8B verifier + search" beating 0-shot 70B is nice, agree.

        "1B solver + 8B verifier + search" beating 1B-0-shot or 1B-majority as baselines isn't illustrative imo. In other words, by using larger verifier, HF's replication fails to establish a "fair" baseline. Still an awesome blog and release/repository from HF's group - I love it!

    • zackangelo 4 days ago

      Where did you see that? I thought they used an 8b model for their reward model?

      > To guide our search strategies, we used RLHFlow/Llama3.1-8B-PRM-Deepseek-Data, an 8B reward model that has been trained using process supervision

bilsbie 4 days ago

Eli5?

  • srush 4 days ago

    For problems that require multi-step reasoning, standard LLMs seem to be stuck. The field is increasingly interested in models like o1 that output many "guesses" to find the right one. Currently open-source does not know how to do this, but we are reimplementing several possible directions to try. This replicates one important path using search and a verifier model.

  • dimitry12 4 days ago

    To spend more compute at inference time, at least two simple approaches are readily available:

    1) make model output a full solution, step-by-step, then induce it to revise the solution - repeat this as many times as you have token-budget for. You can do this via prompting alone (see Reflexion for example), or you can fine-tune the model to do that. The paper explores fine-tuning of the base model to turn it into self-revision model.

    2) sample step-by-step (one "thought"-sentence per line) solutions from the model, and do it at non-zero temperature to be able to sample multiple next-steps. Then use verifier model to choose between next-step candidates and prefer to continue the rollout of the more promising branches of "thoughts". There are many many methods of exploring such tree when you can score intermediate nodes (beam search is an almost 50 years old algorithm!).

  • loudmax 4 days ago

    If I understand correctly, Hugging Face is exploring approaches to tuning the output quality of a given model by tuning how long to let it run.

    Normally when you run an LLM, you set your prompt and whatever tunable parameters, and the LLM software (eg. lamma.cpp) spits out tokens at whatever rate it can. If you want higher quality, you run a bigger model (though you're limited by the amount of memory you have available). If you want higher speed, you run a smaller model. Hugging Face seems to be looking at ways to make this tradeoff without switching between different models.

    • yeahwhatever10 4 days ago

      So I can get LLM results from an SLM if I run it long enough?

      • vlovich123 4 days ago

        They show Llama 3.2 1B with chain-of-thought that outperforms Llama 3.1 8B and 3.2 3B that outperforms 3.1 70B. It’s less clear whether you actually inference time is faster for CoT 3B using 256x generations vs 70B if you have enough RAM. Basically a classical RAM/compute trade off

        • dimitry12 4 days ago

          From a practical standpoint, scaling test-time compute does enable datacenter-scale performance on the edge. I can not feasibly run 70B on my iphone, but I can run 3B even if takes a lot of time for it to produce a solution comparable to 70B's 0-shot.

          I think it *is* an unlock.

      • beardedwizard 4 days ago

        I struggle with this idea of "run it long enough", or another description I have heard "give the model time to think" it's not a thing - it takes as long as it takes. What im taking away from this is two things:

        1. the reason for generalizations like 'long enough' and 'think more' are apparently because the methods are somewhat obscure 2. those methods are being explored by hugging face to make them less obscure

        am I getting that right? I have been struggling to see past the metaphors and understand exactly what additional computation is being done - and here I read its something like multiple guesses being fed back in and chosen among which means its just multiple inferences in series that are all related to solving 1 problem.