GPT-4’s highly anticipated March 2023 release got a lot of interest. However, as its flaws have emerged, criticism has not gone unnoticed. The amount of traffic to ChatGPT fell noticeably in June by 9.7%. Concerns were raised when Stanford University carried out a study in July. The study revealed that GPT-3.5 and GPT-4 performance had significantly declined.
The Study Results
The Stanford research concentrated on four crucial areas: figuring out mathematical issues, providing answers to sensitive information, visual thinking, and writing codes. GPT-4’s accuracy in completing arithmetic questions fell precipitously, from a respectable 97.6% in March to a meager 2.4% in June. Surprisingly, the accuracy of GPT-3.5 increased throughout this time, going from 7.4% to an astonishing 86.8%.
GPT-4 showed a decline in its capacity to respond to sensitive questions, from 21% in March to barely 5% in June. GPT-3.5 improved how it handled delicate issues, going from 2% to 8%.
Regarding code creation, both models’ capacity to generate directly executable code decreased. GPT-4 witnessed a considerable decline, falling from 52% to 10%, while GPT-3.5 also observed a decline from 22% to 2%.
In addition, the research showed that GPT-4 made mistakes in visual reasoning tests that it had completed well in March.
The research raised awareness of the problem of “LLM drift.” LLM drift describes changes in the model’s behavior over time that may cause interruptions in pipelines downstream.
Although the precise causes of the performance drop are unclear, critics have expressed worry that GPT-4 is showing symptoms of becoming “lazier” and less clever. Peter Welinder, VP of Product at OpenAI, disputes these assertions and contends that users are only becoming more conscious of the technology’s limits as they use it often.
Other GPT-4 Problems
Along with the performance reduction, GPT-4 and related language models (LLMs) deal with several additional problems. These include linguistic restrictions, cybersecurity procedure flaws, and malware risk classification errors. LLMs also often provide users with false or fraudulent information, which raises further questions about their authenticity and dependability.
Despite conflicting reviews, a survey by the Capgemini Research Institute found that 73% of users trust material created by generative AI. Also, many are receptive to getting guidance from virtual assistants driven by this technology. However, most of the populace also worries about LLMs’ competence.
The featured image is from metaversepost.com