IBM’s strongest AI debater scores close to human professional debater level

In the Time it takes to drink just one cup of coffee, the system can learn to analyze 400 million reported articles.

This IBM’s strongest AI debater is comfortable debating logically live against human professional debaters, both organizing its own opening statement and refuting the opposing debater’s arguments.

In the end, Project Debater scored close to the average human professional debater in 78 categories.

Over the decades, AI systems have won many human-machine battles: in 1997, IBM’s Deep Blue system defeated a chess champion; 14 years later, IBM’s Watson system defeated a human star on the Jeopardy! In 2016, Google AlphaGo defeated the world Go champion …… but IBM researchers believe that these game-playing AI still exist in the “comfort zone” of AI.

They have a lot in common: there are clear definitions of winners and fixed rules that make it easy for the AI to use reinforcement learning to figure out a strategy to ensure victory …… but debate AI does not have these conditions, the scoring is in the hands of the human audience, and the AI cannot use a strategy that humans cannot follow to win.

Thus, in the debate arena, humans still have the upper hand, and the challenge for debate AI lies outside the AI comfort zone.

In response to this challenge, IBM researchers have launched a new mission since 2012 – to develop a fully autonomous AI system that can debate with humans in real time, namely Project Debater.

I. Autonomous debate goes beyond previous language research efforts

In February 2019, Harish Natarajan, the winner of the 2021 European University Debate Competition, participated in a special debate where he stood in front of a live audience of about 800 people and played a debate match with a black-columned computer.

▲Dr. Ranit Aharonov (left), Project Debater (center), Dr. Noam Slonim (right)

The computer is Project Debater, an AI debater system designed by IBM with a female voice, and the topic of the debate is “Should we subsidize preschool Education?

After the topic was announced, each side was given 15 minutes to sort out their ideas, and then each side alternated between a 4-minute opening statement, a 4-minute second round of rebuttals, and a 2-minute closing statement.

▲Details of the debate process (Source: Nature)

This was the first public demonstration of Project Debater’s debating ability, and although it lost the competition at the time, it impressed both its opponents and the audience with its superb summarizing ability and anthropomorphic communication.

Linguistic rhetoric and debate have always been uniquely human arts. According to Aristotle, the art of eloquence is based on three basic methods of persuasion: credibility (Ethos), infectiousness (Pathos), and logical structure (Logos). In this debate, the AI system successfully demonstrated all three methods of persuasion.

The AI debater did not win its first battle, but IBM research director Dario Gil said its goal was not to beat humans, but to create AI systems that could master the complex and rich human language skills.

Finding arguments to support an argument by analyzing human discourse was still a fairly difficult ability for AI to achieve just a few years ago.

Today, more than 50 labs around the world are working on this problem, including the research teams of many large software companies.

In recent years, language models have made substantial progress in understanding tasks. For simple tasks, such as predicting the sentiment of a given sentence, the most advanced systems tend to perform best; while on more complex tasks, such as automatic translation, automatic summarization, and dialogue systems, AI systems still cannot reach human levels.

Whereas debate is a cognitive activity of the human mind that requires the application of both extensive language comprehension and language generation skills, autonomous debate systems appear to be beyond the scope of previous language research efforts.

In response, IBM Research Labs has trained its latest autonomous AI system and comprehensively described the results of its performance in a wide range of topics in a recent paper published in the journal Nature.

Second, the four core modules of the autonomous debate AI system

The title of this IBM paper is “An autonomous debating system”.

Specifically, the Project Debater system consists of four main modules: argument mining, argument knowledge base (AKB), argument rebuttal, and argument construction.

▲ Debating AI system architecture (Source: Nature)

  1. Argument mining: indexing relevant sentences from 400 million articles

Argument mining is divided into two phases.

In the offline stage, based on a large corpus of about 400 million reported articles), the articles are broken down into sentences, and the sentences are indexed by analyzing the words, Wikipedia concepts, predefined words, etc. in them.

Once the debate topic is known at the online stage, the system relies on this index to perform corpus-wide sentence-level argument mining to retrieve position claims and arguments related to the debate topic.

First, the AI system retrieves sentences with a high propensity to contain such arguments using customized queries. Next, a neural model is used to rank these sentences according to their probability of representing relevant arguments. Finally, a combination of neural networks and a knowledge-based approach is used to classify the argumentative position of each approaching debate topic.

At this stage, the system also uses a topic expansion component to better cover the range of relevant arguments. If this component identifies other concepts related to the debate topic, it makes the argument mining module search for arguments describing these concepts as well.

In addition, the argument mining module also searches for arguments supporting the other side, with the aim of preparing a set of arguments that the opponent may use and evidence that may serve as a retort, an operation that will be used later by the argument rebuttal module.

  1. AKB: capturing commonalities between different debates

The text of the Argument Knowledge Base (AKB) contains principle arguments, counter-arguments and common examples that may be relevant to a wide range of debates. These texts are written by hand, or automatically extracted and then manually edited, and grouped into thematic categories.

Given a new debate topic, the system is able to use a feature-based classifier to determine which classes are relevant to that debate topic.

All texts associated with the matching classes can then be used in the speech, and the system selects those it predicts to be most relevant based on their semantic relevance to the debate topic.

These texts include not only arguments, but also inspiring quotes, colorful analogies, appropriate frameworks for debates, etc.

  1. Argument rebuttal: predicting the opponent’s arguments in advance

For argument rebuttal, the system uses the argument mining module, the AKB module and the arguments extracted from iDebate to compile a list of arguments that may be mentioned by the opponent, calling them “clues”.

Next, IBM’s “Watson” system uses its automated speech-to-text service for custom languages and custom acoustic models to convert the human opponent’s speech into text, which is then broken down into sentences and punctuated by neural models.

In the next step, specialized components determine which arguments predicted in advance are indeed stated by the opponent and target rebuttals. In addition to this claim-based rebuttal, AKB key sentiment terms are identified and used as an index for simple rebuttal forms.

  1. Argument construction: Combined construction of speech statements

The argument building module is a rule-based system with integrated cluster analysis. After removing arguments that are pre-specified as redundant, the remaining arguments are clustered based on semantic similarity, with each category identifying a topic, similar to a Wikipedia concept.

The system selects a set of high-quality arguments, then uses various text normalization and rephrasing techniques to improve fluency, and finally generates each speech statement paragraph by paragraph using predefined templates to complete the debate communication with the opponent.

Third, AI debate performance is close to that of professional human debaters

Evaluating the performance of a debate system is challenging because there is no single metric that can be accepted to determine a winner or loser.

In public debates, audience voting before and after the debate can determine the winning side, but there are limitations to this approach.

If pre-debate voting is highly unbalanced, the burden is greater for the holder with the highest number of pre-debate votes. For example, in the February 2019 human-machine debate, 79% of the pre-debate audience supported the AI-held side and only 13% of the audience supported the human-held side, so the AI could only convince another 21% of the audience, while the human contestant could potentially convince 87% of the rest of the audience.

In addition, voting involves personal opinions that are difficult to quantify and control, and creating a live debate with a large and impartial audience is very difficult.

To evaluate Project Debater’s overall performance, the researchers compared it to a variety of baselines, with 15 virtual viewers scoring the debate performance of the AI system and professional human debaters on 78 debate topics.

Other than Project Debater, the researchers did not find any other method that could engage in a full debate, so the scope of the comparison was relatively limited.

In Figure a, the bars represent the average score, with decreasing agreement with the opening statement from 5 to 1, with 5 representing “strongly agree” and 1 representing “strongly disagree”. The slanted bars indicate that the speeches in the system were generated by humans or relied on manually organized arguments.

The results show that Project Debater’s average scores are closest to the average scores of professional human debaters.

▲ Comparison of Debate Scores (Source: Nature)

In scoring the final system, the researchers also covered 78 debates. 20 raters watched three types of debate statements and scored them without knowing their source.

The results are shown in Figure b. Project Debater scored an average of 3 points above the neutral score for all debates, and 50 of the 78 debates scored an average of more than 4 points, indicating that in at least 64% of the debates, the raters thought Project Debater performed well.

However, despite scoring above the baseline and control groups, there is still a significant gap between Project Debater’s performance and that of human debaters.

IV. Debates with high ratings are more relevant

To overcome additional challenges, in a follow-up periodic evaluation, the researchers conducted additional evaluations on an independent set of 36 debates, which showed minimal overfitting of Project Debater.

By further analyzing the results of these 36 debates, the researchers found that their errors could be broadly categorized as local errors, those affecting specific content units in speech, and broader errors that propagate through multiple elements and affect the entire speech.

The most common types of local errors were incorrectly categorized argumentative positions, such as content that was off-topic and did not fit into the overall narrative coherence; in the broader errors, the same type of error was repeated throughout the speech.

The researchers divided these arguments into three groups, which were annotated as “high” (more than 3.5 points), “medium” (3-3.5 points), and “low” (less than 3 points) according to the ratings.

It is noteworthy that extensive errors occurred only in the “low” group, while, on the contrary, partial errors appeared to some extent in almost all debates, including in the “high” group.

In addition, the most obvious difference between the groups was the amount of content in the three arguments. In terms of total word count, the average word count for the “high”, “medium” and “low” debates was 1496, 1155 and 793 words, respectively.

This signature of “low” debates reflects the challenge of building a system that relies on many constituent outputs in order to produce accurate outputs on a variety of topics.

Specifically, in order for the system to find relevant content, the topic of the debate must be discussed in the corpus, and the specific content units to be included in the final output must pass multiple confidence thresholds, which are set very tightly to ensure high accuracy.

This in turn has the potential to result in a lot of relevant content being filtered out, making the generation of several minutes of spoken content a difficult task.

Another distinctive feature is the quality of the narrative framework provided by the AKB composition in the opening and closing statements. “High” debates typically contained framing elements that accurately captured the essence of the debate, while “medium” debates tended to be framed in an acceptable but less relevant manner.

Finally, the researchers analyzed the word frequency of five content types covering the entire system output: argument mining, AKB, rebuttal, rebuttal cue, and “canned” text (human pre-written sentence fragments).

▲Content type analysis (Source: Nature)

As shown in the figure, among all types, the amount of content for “low” arguments is relatively small, which is consistent with the previous analysis. The largest gap is in the content mined, further suggesting that the high quality output is related to the richness of relevant arguments in the examined corpus, and the precise argument mining module.

In addition, the researchers examined the relative distribution of content types across all 78 arguments in their original assessment set. Less than 18% of the content was traditional “canned” text, while the rest was provided by more advanced underlying system components.

V. AI weaknesses: imitating the coherence of human debaters

Chris Reed, a computer scientist at the University of Dundee in Scotland, believes that the achievement of the IBM team is evident from the real-time performance of Project Debater, which not only uses knowledge extracted from a huge dataset, but also responds to human discourse instantly.

He also mentioned that perhaps the weakest aspect of this system is that it struggles to mimic the coherence and fluidity of a human debater. The problem is related to the highest level of argument selection, abstract expression, and argument orchestration.

This limitation, however, is not unique to Project Debater.

Despite more than two thousand years of human research, the understanding of argument structure remains limited.

Depending on whether the focus of argument research is on language use, epistemology, cognitive processes, or logical validity, different key features have been proposed for a coherent combination of argumentation and reasoning.

A final challenge for technical systems of debate is whether to treat arguments as localized fragments of discourse influenced by a single set of considerations or to codify them into larger social-scale debates.

To a large extent, this is about designing the problem to be solved, not designing the solution.

Theoretical simplification can be achieved by setting a priori bounds on the argument, thus providing a major computational advantage. The “primary need” for identification becomes an explicit task that can be performed almost as reliably by machines as by humans.

The problem is that humans are not good at this task at all, precisely because it is artificially designed.

In an open discussion, a given scope of argument may be a claim in one context and a premise in another. Moreover, in the real world, there are no clear boundaries to define an argument: discourse that occurs outside the debate room is not discrete, but is linked to cross-references, analogies, examples, and generalizations.

Theories have been proposed on how AI can solve such argumentative networks. But the theoretical challenges and socio-technical problems associated with these implementations are enormous.

Designing ways to attract large audiences to such systems is as difficult as designing straightforward mechanisms to enable them to engage with these complex networks of argumentation.

Conclusion: Looking at the future of AI systems from the ambitions of debate AI

Project Debater has great ambition, both as an AI system and as a great challenge to the AI field.

AI and NLP research tends to focus on “narrow AI,” tasks that require fewer resources, often have clear evaluation metrics, and are amenable to end-to-end solutions.

In contrast, “composite AI” tasks are related to a broader range of human cognitive activities, require the simultaneous application of multiple skills, and are less likely to be addressed by the AI community.IBM decomposed composite AI tasks into a tangible, narrow set of tasks and developed corresponding solutions for each.

The results show that a system that properly organizes these components can meaningfully engage in complex human activities that IBM researchers believe are not readily amenable to a single end-to-end solution.

IBM’s research, shows that AI has the ability to engage in complex human activities.

Project Debater challenges difficult problems that go far beyond the comfort of current AI technology. It offers the promising prospect that when AI can better understand human language while becoming more transparent and interpretable, humans will also be able to make better decisions with the help of AI.

Given the prevalence of fake news in full swing, polarized opinion and inert reasoning, AI can provide support to humans in creating, processing and sharing complex arguments, among other things.

As Project Debater said when greeting his opponent at the San Francisco demonstration, “I’ve heard that you hold the world record for wins in debate competitions against humans, but I’m afraid you’ve never debated a machine. Welcome to the future. ”