While artificial intelligence (AI) chatbots may carry out some impressive tasks such as creating music, writing poetry or debugging code for that matter, it still seems to lack a coherent understanding of the world.
Large Language models (LLMs) may have the ability to complete a sentence but they may not be as impressive as they seem.
A new study recently discovered that a widely used generative AI model performed excellently at first but when asked additional questions some complex, some not, the quality began to deteriorate.
Gen AI Put to Test
Researchers used a gen AI model to experiment. They asked the AI model to provide directions around New York City. The gen AI model provided turn-by-turn highly precise driving instructions in New York City – without explicitly designing a detailed internal map of the city.
However, the AI model’s ability to navigate plummeted when the researchers closed some streets and added detours.
Eventually, the scientists realised that the popular gen AI model devised its own internal representation of New York City without explicitly training on preexisting maps.
The AI’s representation included several fictional nonexistent streets, curving between the grids and connecting distant points in the city.
Experts believe that this could have significant complications for gen AI models used in the real world. While a model performs well in a certain situation, it could break down in another context if the prompts change the task even slightly or the environment.
“Such incoherence creates fragility – using a generative model to solve related but subtly different tasks can lead it to fail badly,” the study’s authors stated.
They suggest building generative models that meaningfully capture the underlying logic of the domains they model would be immensely valuable.
“Our results suggest new ways to assess how close a given model is to that goal,” the authors added.
Examining Gen AI’s Navigation and Gaming Capabilities
Models like a transformer model were used to carry out these examinations. These transformer models are the foundation of AI chatbots such as GPT-4.
According to MIT News, transformers are trained on a massive amount of language-based data to predict the next token in a sequence, such as the next word in a sentence.
To test the accuracy of such Large Language Models (LLMs), researchers designed two metrics called sequence distinction and sequence compression.
These metrics examined the identification capabilities of a transformer model in addition to observing whether the AI model correctly responded to different or identical states.
For instance, the researchers tested the models on navigating intersections in a city and on Othello board configurations.
The model could predict the moves precisely when playing Othello when asked to generate valid moves in the board game based on patterns in the data but it did not develop a coherent comprehension of the game's rules.
The authors stated that if scientists want to determine whether an LLM has formed an accurate model of the world, measuring the accuracy of its predictions doesn’t go far enough.
Senior author Ashesh Rambachan, from the MIT Laboratory for Information and Decision Systems (LIDS), says that one hope for LLMs could be – they can accomplish all these amazing things in language, and maybe we could use these same tools in other parts of science, as well.
“But the question of whether LLMs are learning coherent world models is very important if we want to use these techniques to make new discoveries,” he added,
Rambachan said that often these models have been observed to do impressive things and think they must have understood something about the world.
“I hope we can convince people that this is a question to think very carefully about, and we don’t have to rely on our own intuitions to answer it,” he cautioned.