AI version of three cobblers, ChatGPT/Gemini/DeepSeek combined to get the highest score in AGI test

07-09

This article is machine translated

Show original

The conversational fluency of ChatGPT, Gemini's multimodal capabilities, DeepSeek's long-context analysis... Could they be combined to solve problems together?

The star AI company Sakana AI, founded by one of the Transformer authors Llion Jones, proposed a new method AB-MCTS, with the core idea being:

The greatest achievements often stem from collaboration between different ideas, and we believe this principle applies equally to artificial intelligence.

AB-MCTS, which stands for Adaptive Branching Monte Carlo Tree Search, is an algorithm that enables multiple AI models to simultaneously handle problems. Models exchange and refine suggestions, working collaboratively, just like a human team.

In the challenging ARC-AGI-2 benchmark test, multiple LLM AB-MCTS solved more problems than any single model working alone (Single-LLM AB-MCTS).

There are situations where only a combination of different models can yield the correct answer.

Sakana AI has open-sourced the algorithm under the name TreeQuest, with the link available at the end of the article.

Two Search Strategies

AB-MCTS combines two different search strategies: it can refine existing solutions (depth search) and also try entirely new approaches (breadth search).

The main technical challenge is introducing unbounded branching into MCTS.

Standard MCTS only selects and expands leaf nodes (with each node expanded at most once), and expansion adds a fixed number of child nodes. However, since each LLM query can produce different outputs from the same prompt at non-zero temperature, the branching factor is theoretically infinite.

To fully leverage the potential performance improvement of MCTS's unbounded branching, AB-MCTS allows nodes that have been expanded once to be expanded again and further branched, and introduces GEN nodes to explicitly represent the action of generating new child nodes.

In the AB-MCTS search tree, each node N is accompanied by a GEN child node. When a parent node with a GEN node is selected, a new child node is generated from N.

Unlike traditional MCTS, AB-MCTS does not fix width as a static hyperparameter.

Instead, at each node in the search tree, AB-MCTS adaptively decides whether to explore ("go wide") by generating new candidate responses, or exploit ("go deep") by improving existing responses, using external feedback signals.

Underlying this, AB-MCTS estimates node potential through Bayesian posterior predictive distribution and selects actions using Thompson sampling to ensure each expansion balances exploration and exploitation in a principled manner.

This design naturally extends multiple sampling, enabling AB-MCTS to leverage the diverse and vast output space of LLMs when necessary.

On this basis, Sakana AI also proposed two variants: AB-MCTS-M and AB-MCTS-A.

Simply put:

AB-MCTS-M: More hierarchical. Uses mixed-effects models to share statistical information between subtrees, balancing global and local exploration through hierarchical Bayesian inference.

AB-MCTS-A: More lightweight. Explicitly separates "generation" and "optimization" actions through CONT nodes, and achieves efficient posterior updates based on conjugate priors, simplifying computation.

Its Sharp Edge

Benchmarking AB-MCTS showed that it consistently performed excellently across various benchmarks and LLMs, achieving the highest average ranking and outperforming established baselines.

This sustained success stems from AB-MCTS's unique ability to dynamically adjust search strategies, precisely balancing exploration and exploitation to adapt to different problem requirements, which is almost entirely lacking in baseline methods.

LiveCodeBench and CodeContest

The left and middle sections of the graph report the success rate of GPT-4o in relation to generation budget on LiveCodeBench and CodeContest, showing performance improvements for all methods as computational budget increases. In these two benchmark tests, the AB-MCTS algorithm typically outperforms baseline methods.

In LiveCodeBench, AB-MCTS begins to surpass baseline methods even with a small budget; in CodeContest, AB-MCTS shows performance superior to baselines when the budget is 32 or higher.

ARC-AGI

The right side of the graph shows GPT-4o's performance on the particularly challenging ARC-AGI benchmark. Repeated sampling proves to be a strong baseline in this setting, indicating that broad exploration is crucial for this task.

While standard MCTS only brings minimal improvements as budget increases, the AB-MCTS framework achieved performance comparable to repeated sampling. This indicates that AB-MCTS can effectively explore potential solutions by dynamically expanding its search scope when advantageous.

MLE-Bench

The table shows the performance using GPT-4o in three MLE-Bench competitions. Since MLE-Bench requires substantial GPU resources for training and evaluating machine learning models, the research team used only GPT-4o and focused on baseline methods and AB-MCTS-M.

The results show that the best-performing baseline methods vary across competitions, again emphasizing that different tasks benefit from different exploration-exploitation trade-offs.

In contrast, AB-MCTS-M consistently performed excellently in these tasks.

This consistent success across different competitions highlights the inherent advantages of AB-MCTS-M in effectively adapting its search strategy to different problem structures.

To quantitatively analyze how AB-MCTS balances exploration and exploitation, the research team also examined the average depth and width of generated search trees at each depth.

As the graph shows, compared to standard MCTS, AB-MCTS methods tend to generate wider trees. This is because AB-MCTS can adaptively decide to explore wider (select GEN nodes) from any existing node, whereas standard MCTS cannot. This mechanism enables more flexible exploration at different tree depths.

In addition to the flexibility of exploration width, AB-MCTS also achieved excellent performance in benchmarks where sequential optimization performs well, indicating that it can effectively identify and utilize promising branches by selecting existing child nodes for optimization. This adaptive characteristic allows it to combine the advantages of exploration and exploitation, demonstrating strong performance across various benchmark tests.

To study AB-MCTS's scalability, experiments on ARC-AGI were extended using DeepSeek-V3, increasing the generation budget to 512. As shown in the graph, as the budget increases from 200 to 500, AB-MCTS's performance continues to improve significantly, while the improvement rate of repeated sampling begins to plateau.

Standard MCTS continues to improve after increasing the budget, but compared to the AB-MCTS method, its success rate is significantly lower. This performance gap indicates that AB-MCTS more effectively guides the search towards more promising branches in the search tree.

The above image shows an example of search trees generated by AB-MCTS-M and standard MCTS. These visualizations demonstrate the stronger adaptive branching characteristics of AB-MCTS-M compared to standard MCTS.

This adaptability indicates that AB-MCTS-M flexibly balances exploration and exploitation throughout the search process, dynamically allocating budget to explore diverse new candidates ("expanding width") and optimize potential candidates ("deep mining").

These results suggest that even considering the inherent advantages of repeated sampling, AB-MCTS remains a promising method that can efficiently utilize generated budget to achieve better results in various scenarios.

In the challenging ARC-AGI-2 benchmark, AB-MCTS combined with ChatGPT, Gemini, and DeepSeek solved 30% of ARC-AGI-2 puzzles, while the top independent models only solved 23%.

The results show that in several cases, only the combination of different models could arrive at the correct answer.

Natural Inspiration and Path of Innovation

The research on AB-MCTS did not emerge out of thin air; it is based on Sakana AI's 2024 work in evolutionary model fusion, where the team shifted focus from "mixing to create" to "mixing to use" existing powerful AI.

They said:

At Sakana AI, we are always committed to pioneering innovative AI systems by applying principles inspired by nature, such as evolution and collective intelligence.

And they indeed did so:

Not only the 2024 evolutionary merging model, but in May this year, Sakana AI also collaborated with researchers from Columbia University to develop the Darwin-Gödel Machine (DGM) - an AI framework designed to self-evolve, not optimized for fixed goals, but drawing inspiration from biological evolution and scientific discovery, generating new solutions through open-ended search and continuous self-modification.

Recently, two physicists referenced the self-assembly process of biological systems, revealing the essence of "creativity" in diffusion models...

These discoveries and creations are manifestations of "natural inspiration".

Reference Links:

[1]https://the-decoder.com/sakana-ais-new-algorithm-lets-large-language-models-work-together-to-solve-complex-problems/

[2]https://x.com/SakanaAILabs/status/1939854145856708910

This article is from the WeChat public account "Quantum Bit", author: Focus on Cutting-Edge Technology, published by 36Kr with authorization.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

Foresight News

How to participate in the Polymarket airdrop?

U.Today

Ethereum (ETH) Price Prediction for July 23

ETH

2.2%

Decrypt

Why Ethereum Will Keep Climbing in Coming Months: Bitwise

ETH

2.2%