> I assumed all of the models were doing that, using at least Web Search tools.
Sometimes. The other week I was asking ChatGPT about the UK PM, and had to stop the generation early because it started ~"Prime Minister Rishi Sunak…"
The unreliability is also why techniques as simple as "ask 5 times and have it take a vote of its own answers" boost performance. Or "thinking" modes which are approximately just replacing the end token with "Wait." and continuing for ten rounds.
My hunch of why Grok's other model performed top-3 was due to access to Tweets, which are sentiment analysis gold mine for ticker symbols.
> I assumed all of the models were doing that, using at least Web Search tools.
Sometimes. The other week I was asking ChatGPT about the UK PM, and had to stop the generation early because it started ~"Prime Minister Rishi Sunak…"
The unreliability is also why techniques as simple as "ask 5 times and have it take a vote of its own answers" boost performance. Or "thinking" modes which are approximately just replacing the end token with "Wait." and continuing for ten rounds.