AI models are getting better at answering questions, but they’re not perfect


Missed a Future of Work Summit session? Head over to our Future of Work Summit on-demand library to stream.

Leave him OSS Enterprise Newsletter guide your open source journey! register here.

Late last year, the Allen Institute for AI, the research institute founded by late Microsoft co-founder Paul Allen, quietly opened up a large AI language model called Macaw. Unlike other language models that have recently gained public attention (see OpenAI’s GPT-3), Macaw is quite limited in what it can do, only answering and generating questions. But the researchers behind Macaw say it can outperform GPT-3 on a set of issues, despite being an order of magnitude smaller.

Answering questions might not be the most exciting application of AI. But question-answering technologies are becoming increasingly valuable in business. Rising customer call and email volumes during the pandemic have prompted companies to turn to automated chat assistants – according to Statista, the size of the chatbot market will exceed $1.25 billion by ‘by 2025. But chatbots and other conversational AI technologies remain quite rigid, bound by the questions they were trained on.

Today, the Allen Institute released an interactive demo to explore Macaw as an addition to the GitHub repository containing Macaw’s code. The lab believes the performance and “practical” size of the model – about 16 times smaller than GPT-3 – illustrates how large language models are becoming “commodified” into something much more widely accessible and deployable.

To answer questions

Built on UnifiedQA, the Allen Institute’s previous attempt at a generalizable question-and-answer system, Macaw was refined on datasets containing thousands of yes/no questions, stories designed to test reading comprehension , explanations for questions and exam questions in science and English. . The largest version of the model – the demo version and which is open source – contains 11 billion parameters, significantly less than GPT-3’s 175 billion parameters.

Given a question, Macaw can produce an answer and an explanation. If answered, the model can generate a question (possibly a multiple-choice question) and an explanation. Finally, if given an explanation, Ara can give a question and an answer.

“Ara was built by training Google’s T5 transformer model on approximately 300,000 questions and answers, collected from several existing datasets that the natural language community has created over the years”, Peter Clark and Oyvind Tafjord of the Allen Institute, who participated in the development of Macaw. , told VentureBeat via email. “The Macaw models were trained on a Google cloud TPU (v3-8). The training builds on the pre-training already done by Google in its T5 model, thus avoiding a significant expense (both financial and environmental) in the Macaw construction From T5, the additional adjustments we made for the larger model took 30 hours of TPU time.

Above: Examples of Macaw’s abilities.

Image Credit: Allen Institute

In machine learning, parameters are the part of the model that is learned from historical training data. Generally speaking, in the domain of language, the correlation between the number of parameters and sophistication has held up remarkably well. But Macaw punches above his weight. When tested on 300 questions created by Allen Institute researchers specifically to “break” the macaw, the macaw outperformed not only the GPT-3, but also the recent Jurassic-1 Jumbo model from AI21 Labs, which is even bigger than GPT-3.

According to the researchers, the macaw shows some ability to reason about new hypothetical situations, which allows it to answer questions such as “How would you make a house conduct electricity?” with “Paint it with metallic paint.” The model also alludes to the awareness of the role of objects in different situations and seems to know what an implication is, for example in answering the question “If a bird had no wings, how would it be affected? with “He would be unable to fly.”

But the model has limits. Usually, Ara is fooled by questions with false assumptions like “How old was Mark Zuckerberg when he founded Google?” He sometimes makes mistakes when answering questions that require common sense reasoning, such as “What happens if I drop a drink on a bed of feathers?” (Ara replies “The glass is breaking”). In addition, the model generates answers that are too brief; breaks down when questions are rephrased; and repeats the answers to certain questions.

The researchers also note that Macaw, like other large language models, is not free of bias and toxicity, which it could pick up from the datasets that were used to train it. Clark added: “Ara is released without any usage restrictions. Being an open build model means there are no guarantees as to the release (in terms of bias, inappropriate language, etc.), we we therefore expect its initial use to be for research purposes (for example, to study what current models are capable of).”


Macaw may not solve current outstanding challenges in designing language models, one of which is bias. Additionally, the model still requires sufficiently powerful hardware to be operational – the researchers recommend 48 GB of total GPU memory. (Two of Nvidia’s 3090 GPUs, which have 24GB of memory each, cost $3,000 or more, not considering the other components needed to run them.) But Macaw Is it that demonstrate that, according to the Allen Institute, capable language models are becoming more accessible than they were before. GPT-3 is not open source, but if it was, an estimate pegs the cost of running it on a single Amazon Web Services instance at a minimum of $87,000 per year.

Macaw from the Allen Institute

Macaw joins other multitasking open source models that have been released over the past few years, including EleutherAI’s GPT-Neo and BigScience’s T0. DeepMind recently showed off a model with 7 billion parameters, RETRO, which it says can beat others 25 times its size by leveraging a large text database. Already, these models have found new applications and spawned startups. Macaw – and other similar question-and-answer systems – might be on the verge of doing the same.


VentureBeat’s mission is to be a digital marketplace for technical decision makers to learn about transformative technology and transact business. Our site provides essential information on data technologies and strategies to guide you in the direction of your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on topics that interest you
  • our newsletters
  • gated thought leader content and discounted access to our popular events, such as Transform 2021: Learn more
  • networking features, and more

Become a member


Comments are closed.