Sequoia Capital interviews OpenAI team: first disclosure of ChatGPT Agent development details

This article is machine translated

Show original

On July 23, Sequoia Capital held a dialogue with members of the OpenAI ChatGPT Agent team to discuss its technological innovation and future potential. The dialogue was co-hosted by Sonya Huang and Lauren Reeder, two partners of Sequoia Capital, and attended by Isa Fulford, Casey Chu, and Edward Sun, members of the OpenAI team who participated in the ChatGPT Agent release event.

In this conversation, they shared the development process of ChatGPT Agent and discussed how ChatGPT Agent combines the advantages of Deep Research and Operator to achieve efficient execution of cross-domain tasks. They also discussed ChatGPT Agent's security measures and a wide range of application scenarios.

According to OpenAI's vision, ChatGPT Agent will have stronger independent judgment capabilities, be able to provide customized services based on the habits and needs of each user, and support multiple communication methods such as voice, text, and images. In the future, OpenAI will create a general super-intelligent agent that can handle almost all tasks that humans can do on computers.

The following is a condensed version of the conversation:

Moderator: Today, we will discuss the evolution of AI agents with Fulford, Casey Chu, and Zhiqing Sun from the OpenAI team. You have developed a new ChatGPT Agent. Please introduce its core functions and major breakthroughs.

Fulford: Thanks for having us on the show. ChatGPT Agent is a collaborative effort between Deep Research and the Operator team. This AI agent is capable of performing complex, multi-step tasks that can take up to an hour. We equipped it with a virtual computer environment that integrates text browsing, visual browsing, terminal access, and API integration, all of which share state, similar to how multiple applications share a file system when humans use a computer.

This design allows ChatGPT Agent to flexibly handle a variety of complex tasks, significantly improving efficiency and capabilities. We are particularly satisfied with the performance of this model in multi-round conversations, which can continuously handle tasks and continue to improve. In the future, we hope to further enhance personalization and memory functions so that ChatGPT Agent can perform tasks without the user's initiative.

1 Birth and evolution

Host: Can you share the origin story of this project? How did it start?

Cathy Chu: This project originated from the combination of Deep Research and Operator. In January 2025, we released Operator, which is capable of performing Internet tasks such as online shopping.

Two weeks later, we launched Deep Research, which focused on browsing and synthesizing web information to generate detailed research reports with citations. As we were charting our future development path, we realized that the two products could complement each other.

Operator is good at handling visual interactions, such as clicking on web page elements, while Deep Research is better at handling text information, such as reading long articles. User feedback shows that they want Deep Research to access paid content, and Operator already has this ability. Therefore, combining the two is a natural choice.

Sun Zhiqing: Our team has achieved a huge leap in capabilities by unifying the architecture of Deep Research and Operator. All tools share state, and users can smoothly switch between text analysis, visual browsing, and code execution. We do not pre-program the usage patterns of the tools, but instead use reinforcement learning to let the model discover the best strategy on its own on thousands of virtual machines.

This approach enables ChatGPT Agent to collaborate with users for hours, asking clarifying questions and accepting corrections in tasks, greatly expanding the ways to interact with AI agents. We also face challenges such as security and task complexity, such as date selection, which remains a difficult problem for AI. The breakthrough achieved by a small team through careful data screening shows that AI development has entered a new stage, where product insights are as important as computing power.

Fulford: ChatGPT Agent is able to perform complex tasks that would take a human a lot of time. We provide it with a virtual computer environment that includes a variety of tools: a text browser (similar to the Deep Research tool) for efficient access to online information; a visual browser (similar to the Operator tool) that can interact with the graphical user interface, supporting operations such as clicking, entering forms, scrolling, and dragging; and a terminal tool for running code, analyzing files, and generating output such as spreadsheets or slides.

In addition, through API integration, ChatGPT Agent can access services such as GitHub, Google Drive, SharePoint, etc. All tools share status, similar to the shared file system of applications on human computers. This design enables ChatGPT Agent to flexibly respond to complex tasks and provide strong support for users.

Moderator: Can you talk about this combination process in detail? How to achieve the effect of "1+1 is greater than 2"?

Cathy Chu: Our team developed Operator and Deep Research separately. Operator is good at handling visual interactions, such as clicking on a web page or filling out a form, but not good at reading long articles; Deep Research is good at efficiently browsing and synthesizing text information, but has difficulty handling highly interactive visual elements. We noticed that users tried Deep Research-type tasks on Operator, such as "research travel and then book it."

Therefore, combining the two is a natural choice. We not only merged the two tools, but also added terminal tools, image generation tools, and API call functions to enable ChatGPT Agent to perform a wider range of tasks. For example, the terminal tool can run commands for calculations, the image generation tool can add visual elements to slides, and the API call can generate PowerPoint presentations.

Sun Zhiqing: This combination significantly enhances the capabilities of ChatGPT Agent. For example, it can efficiently search for information with a text browser, then switch to a visual browser to view images or interactive elements, or even run code in a terminal to generate artifacts. All tools share state, allowing ChatGPT Agent to operate different applications seamlessly like a human.

Our team member Eric analyzed users’ prompts on the Operator and found that many tasks involved Deep Research-type requirements, such as “research travel and then book it,” which further verified the necessity of the combination.

2 Multi-scenario mission capabilities

Moderator: What are the specific application scenarios of ChatGPT Agent? How do users use it?

Fulford: We intentionally designed an open-ended agent, named ChatGPT Agent, to encourage users to explore its potential. We trained it on Deep Research tasks, such as generating detailed reports; Operator tasks, such as booking flights or shopping online; and Data Analysis tasks, such as creating spreadsheets or slides. Given its flexibility, we expect users will discover many unanticipated uses for it.

For example, Deep Research users accidentally discovered the code search feature. We hope that ChatGPT Agent can play a role in both consumer and enterprise scenarios, such as helping professional users generate detailed reports or planning activities for personal users. Whether it is a consumer waiting 30 minutes to get a detailed report or an enterprise user using it at work, it can do it.

Cathy Chu: I personally use it to process data in Google Docs and generate slides to present the data. Another interesting case is that I use it to study new developments in the field of ancient DNA. Since the information in this field is scattered and there is a lack of comprehensive reference materials, ChatGPT Agent can collect information from the Internet and synthesize it into reports or slides, which greatly simplifies my work.

Sun Zhiqing: I use it for online shopping, especially for scenarios that require visual browsing, such as viewing product images or selecting styles through search filters. It is also very useful for planning activities, such as scheduling trips or events. My favorite shopping task is buying clothes, because many websites require a visual browser to process search filters or see the appearance of products.

Host: You also demonstrated a cool case before, can you share it?

Fulford: Absolutely! Our colleagues asked ChatGPT Agent to estimate OpenAI’s valuation based on web information and generate a financial model, including a spreadsheet, summary analysis, and slides showing the results. This task took 28 minutes, showing its ability to handle long tasks. The ChatGPT Agent’s predictions were pretty bold, and the quality of the slides was impressive!

Cathy Chu: This case opens up a new paradigm: users can leave after proposing a task, and the ChatGPT Agent returns with a detailed report after a period of time. As the ChatGPT Agent becomes more autonomous, the task time may be longer, which is a good example.

Moderator: 28 minutes is already a long time! Do you have any longer tasks? How do you ensure that ChatGPT Agent does not go off track when running for a long time?

Sun Zhiqing: I recently ran a task that lasted an hour, which is probably the longest task we have ever seen. To ensure stability, we developed tools to extend the context length of ChatGPT Agent, so that it records task progress and completes complex tasks step by step.

In addition, we designed a flexible human-computer interaction mechanism so that users can correct ChatGPT Agent, provide additional instructions, or request status updates at any time. For example, users can ask it to summarize the current progress, or add instructions such as "I only want blue sneakers."

Fulford: This collaboration model mimics the way people communicate through Slack. ChatGPT Agent asks for permissions or clarification questions when needed, such as asking for user consent when performing destructive actions or requiring login.

Our interface also allows users to monitor the operation of ChatGPT Agent in real time and even take over the virtual computer environment after the task is completed, such as logging into an account or entering credit card information. This "watch colleagues operate and take over at any time" experience is very intuitive and enhances the user's sense of control over ChatGPT Agent.

3 Training and breakthrough

Moderator: From a technical perspective, how is ChatGPT Agent trained?

Casey Chu: We used reinforcement learning (RL) technology to provide it with a text browser, GUI browser, terminal, image generation tools, etc. in a virtual machine environment .

We designed complex tasks to allow ChatGPT Agent to discover the best tool usage strategy through experiments, and reward it based on the quality and efficiency of task completion. For example, ChatGPT Agent may first search for restaurant information with a text browser, then use a GUI browser to view dish images and reservation availability, or download data from the website and process it in the terminal. This shared state tool design enables ChatGPT Agent to seamlessly switch tools and complete diverse tasks.

Fulford: Unlike previous tool usage, all tools share state, similar to how humans use multiple applications on a computer. This design enables ChatGPT Agent to efficiently handle interactive tasks such as the Internet, file systems, and code. Instead of pre-specifying tool usage rules, we let the model discover the best strategy by itself through reinforcement learning, and the effect is almost magical. Reinforcement learning requires much less data than pre-training, and we teach the model new skills through carefully selected high-quality data sets.

Sun Zhiqing: Reinforcement learning is very data efficient, and we only need a small amount of high-quality data sets to teach new skills. For example, we created a diverse set of tasks, including finding niche information, writing long reports, etc. As long as the output quality can be evaluated, reinforcement learning can effectively improve performance. In order to make the Operator function perform well, we have invested a lot of time in the past two or three years to enable the model to understand visual elements and page interactions, laying the foundation for the current ChatGPT Agent.

Host: Is this reinforcement learning method the standard way OpenAI trains AI agents?

Fulford: We think this approach has great potential. This release is a minimum viable product (MVP) that our team worked on together, but it already shows strong capabilities. For example, the slideshow generation feature is very good, thanks to the hard work of many team members. We believe that we can improve it further using the same technology, but it may require the introduction of other technologies.

Cathy Chu: This approach is amazing, the same reinforcement learning algorithm applies to Deep Research, Operator, and now the computer using ChatGPT Agent. We have achieved these results in a short period of time, and there is still a lot of room for improvement in the future.

Host: Are there any special training methods for interactivity in reinforcement learning?

Zhiqing Sun: We focus on end-to-end performance, from user prompts to task completion. ChatGPT Agent performs well in interacting with users, in part because we incorporate diverse task trajectories in training. Users can intervene at any time to provide clarifications or corrections, and it can adjust its behavior based on feedback.

Moderator: The early World of Bits project (a general AI training platform developed by OpenAI) tried to use reinforcement learning to control the mouse path, but the problem was too complex. What has changed now to make this problem solvable?

Sun Zhiqing: The development of ChatGPT Agent can be traced back to the World of Bits project in 2017, which we jokingly called "World of Bits 2". The biggest change is the increase in training scale. Whether it is pre-training or reinforcement learning, the amount of computing may have increased by hundreds of thousands of times. The increase in data scale and computing power has enabled us to achieve our goals.

4 How to prevent “losing control”

Moderator: How does ChatGPT Agent ensure security and reliability when performing external operations?

Fulford: Since ChatGPT Agent is able to interact with the outside world, such as accessing websites or calling APIs, security is a core concern.

Compared to Deep Research's read-only mode, ChatGPT Agent could pose a greater risk, such as performing unexpected destructive actions while completing tasks, such as purchasing 100 different options to ensure user satisfaction. To this end , we have implemented multi-layered security measures, including internal and external red team testing, real-time monitoring systems (similar to antivirus software), and protocols for rapid response to new threats . We pay special attention to serious issues such as biological risks, such as preventing ChatGPT Agent from being used to create biological weapons.

Cathy Chu: The Internet is full of risks, including phishing attacks, fraud, and other threats. Our models have been trained to identify some risks, but sometimes they may be too eager to complete the task and be deceived. We have developed a real-time monitoring system to check the behavior of ChatGPT Agents. If suspicious operations are found (such as visiting abnormal websites), the task will be suspended immediately.

Additionally, we have protocols in place to quickly respond to new threats, similar to updating antivirus software. Thanks to the mitigation work of our corporate biorisk team, we conducted weeks of red team testing to ensure that the model could not be used for harmful purposes.

Fulford: Security training is a cross-team effort involving security, governance, legal, research, and engineering teams. We have implemented protections at every level and will continue to iterate to address new threats. For example, we ensure that ChatGPT Agent asks for user permission before performing sensitive actions (such as logging into a bank account).

5 Teamwork behind the scenes

Moderator: How does the development team collaborate? What is the size?

Fulford: Our team is a merger of the research and application teams of Deep Research and Operator, and the total number of people is not large. The Deep Research team initially had only 3-4 people, and the Operator team was about 6-8 people, plus an excellent engineering and product design team led by Yash Kumar. The research and application teams work closely together, and are user scenario-oriented from defining product features to model training. This small team collaboration has enabled us to achieve remarkable results in a short period of time.

Cathy Chu: The boundary between the research and application teams is not strict. Application engineers participate in model training, and researchers also participate in model deployment . This cross-functional cooperation makes the project full of vitality and the team atmosphere is very good. Fulford and I are old friends, and this tacit understanding also promotes teamwork.

Sun Zhiqing: A small team can accomplish great things. We completed this project in a few months, and the research and application teams worked together to define product features from the beginning to ensure that they were user-oriented. Although ChatGPT Agent has not yet fully achieved all its goals, this framework enables us to iterate quickly.

Host: What is the biggest challenge during training?

Sun Zhiqing: The stability of training is a huge challenge. Deep Research only involves text browsing and Python, while ChatGPT Agent needs to handle multiple new tools at the same time, such as GUI browsers, terminals, image generation tools, and API calls, all running in the same virtual machine environment. We need to run thousands of virtual machines to access the network at the same time, and often encounter problems such as website downtime, API restrictions, or insufficient network capacity .

For example, some websites may be temporarily unavailable due to traffic overload, or API calls may fail due to rate limits, which requires us to add robustness mechanisms to training to ensure that ChatGPT Agent can handle these abnormal situations. Despite these challenges, we successfully trained the model by optimizing the virtual machine environment and improving the training algorithm, making it perform well in a variety of tasks.

Fulford: In the future, we hope to further enhance ChatGPT Agent's multi-round dialogue capabilities, personalization, and memory functions. Currently, all tasks are initiated by users, but we envision that ChatGPT Agent will be able to autonomously identify user needs and proactively perform tasks in the future. For example, it may predict needs based on user historical behavior, automatically generate reports, or plan activities.

We are also exploring new user interfaces and interaction modes, such as more intuitive non-chat interactions, voice commands, or graphical interfaces, to enhance the user experience. In addition, we plan to optimize the context management of ChatGPT Agent to better maintain task coherence during long-term tasks while reducing dependence on computing resources.

Cathy Chu: From a coding perspective, I find ChatGPT Agent to be excellent for code search and small code edits because it reads documents accurately and reduces hallucinations. For example, it can access GitHub through an API, search a specific code repository, and extract relevant code snippets. I use it for interactive coding tasks like o3, while Codex is better suited for solving well-defined problems. Users will find more new use cases, such as the code search feature discovered by Deep Research users.

In the future, we hope that ChatGPT Agent can be further improved in programming tasks, such as supporting more complex code debugging or automatically generating complete applications. In addition, we are studying how to make ChatGPT Agent better understand user intent, such as automatically inferring the functions that users want in code editing without detailed instructions.

6. Building a general superintelligence

Moderator: Will you develop specialized sub-agents, such as a financial analysis agent or an event planning agent, or will you stick to the vision of a single super-agent?

Fulford: We prefer to build a general superintelligence. If an agent can flexibly call on all the tools as needed, like an all-powerful chief of staff, it will be a simple and efficient solution.

Our training data shows that there is positive transfer between different tasks. For example, the visual interaction skills learned in the shopping task can be applied to web navigation in the research task. Therefore, the single agent model has more potential in scalability and versatility. We hope that through continuous optimization, ChatGPT Agent can seamlessly handle a variety of tasks from simple queries to complex workflows, reducing users' dependence on multiple dedicated models .

Cathy Chu: Although customized models may have market value at product launch, from a training perspective, general agents can better take advantage of the transferability of skills. For example, ChatGPT Agent may use the terminal to perform budget calculations in shopping tasks without the need for specialized financial analysis tools. We are also exploring how to further improve its generalization capabilities through reinforcement learning, such as allowing it to quickly adapt when encountering new tasks without the need for a large amount of additional training data. In the future, ChatGPT Agent may dynamically adjust its behavior patterns by learning from user feedback to further improve the accuracy of task completion .

Sun Zhiqing: Our goal is to have ChatGPT Agent handle almost all tasks that humans perform on computers. Users can even ask it to 'try to make money online', although the execution is not perfect at the moment. We will improve the quality and accuracy of task completion through iterative deployment. For example, we plan to optimize ChatGPT Agent's decision-making process in complex tasks, reduce the possibility of erroneous operations, and improve its adaptability in dynamic environments. In addition, we hope to continuously improve the performance of ChatGPT Agent through user feedback and actual usage data, making it more intelligent and efficient in handling cross-domain tasks.

Host: Looking into the future, what is your vision for ChatGPT Agent?

Fulford: We provide ChatGPT Agent with a toolset that covers most of the tasks that humans can accomplish on computers. We will work on improving the performance of the model on a variety of tasks, optimizing the user interaction experience, and exploring new interaction modes, such as more personalized memory functions or autonomous task initiation.

We hope that ChatGPT Agent will be able to autonomously perceive and respond to user needs in the future. For example, ChatGPT Agent may automatically plan meetings based on the user's schedule, or recommend personalized solutions based on historical preferences.

Casey Chu: We are excited about improving the user interface and experience. The current chat-based interaction is just the starting point, and there may be more innovative interaction methods in the future, such as gesture-based or multimodal input interfaces.

We hope that users will discover new capabilities of ChatGPT Agent, such as the code search feature discovered by Deep Research users. For example, ChatGPT Agent has surpassed human benchmarks in data science tasks, thanks to the work of colleague John Blackman in spreadsheets and data analysis. In the future, we plan to make ChatGPT Agent further improve in data processing and visualization, such as automatically generating interactive dashboards.

Sun Zhiqing: Since the release of Operator in January, we have significantly improved the accuracy of clicks and form filling, although there is still room for improvement in tasks such as date selection. We provide ChatGPT Agent with a general toolset that covers most tasks that humans do on computers. The challenge in the future is to ensure that the model performs well on all tasks and develop new interaction paradigms, such as more natural voice interactions or real-time collaboration tools. We look forward to users forming a more natural collaborative relationship with ChatGPT Agent and ushering in a new era of AI agents.

Host: Thank you very much for sharing! Congratulations on the launch of the new product, and I look forward to seeing more of its wonderful performance!

This article comes from the WeChat public account "Tencent Technology" , translated by Wu Ji, edited by Helen, and published by 36Kr with authorization.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

ABMedia

07-18

ChatGPT Agent is now available! AI can operate web pages autonomously, you can do it just by thinking about it

36kr

From 2.5 billion questions to AI browser: Can ChatGPT Agent's "slow revolution" subvert Google?

All-in station

News

Binance will delist the following 2 spot pairs

GPS

1.43%