The competition among companies to develop advanced artificial intelligence is intense. To gain an edge against rivals like OpenAI’s GPT-4 and Meta’s Llama 2, Google has launched its new AI model called Gemini. This model has been built from scratch and is capable of handling multiple types of data at once, including text, code, audio, images, and video. The AI model will be available in three different sizes: Ultra, Pro, and Nano, for tackling complex tasks, scaling across a broad range of tasks, and on-device tasks, respectively. Gemini is the first model from Google DeepMind after the merger of the company’s AI research units, DeepMind, and Google Brain.
Late last year, OpenAI’s ChatGPT, which is powered by GPT-3.5, was launched and created quite a buzz around the world. Initially caught off guard, Google is now preparing itself for competition. It recently launched Gemini, which has made some bold claims, but questions have been raised about its accuracy. Google DeepMind claims that Gemini outperforms GPT-4 on 30 of the 32 standard performance measures, though the margins are narrow. Despite presenting a futuristic vision to the public, accuracy concerns are now at the forefront.
Gemini, a language model developed by Google, scored an impressive 90 percent on the Massive Multitask Language Understanding (MMLU) test, surpassing human experts (89.8 percent) and outperforming GPT-4 (86.4 percent) in several benchmark tests conducted by Google. MMLU uses 57 subjects, such as math, physics, history, law, medicine, and ethics, to evaluate both problem-solving abilities and world knowledge.
It’s worth noting that Google used different prompting techniques for the two models. While GPT-4’s score of 86.4 percent relied on the industry-standard “5-shot” prompting technique, Gemini Ultra’s 90 percent result was based on a different method— “chain-of-thought with 32 samples.”
Also, it’s important to mention that Google conducted these tests using an outdated version of GPT-4, as indicated in the yellow box in the image above. The note mentions that they used a “previous state-of-the-art” (SOTA) version for GPT-4.
When the 5-shot MMLU was used to evaluate both models, GPT-4 achieved a score of 86.4 percent, surpassing Gemini Ultra’s score of 83.7 percent. Additionally, when the 10-shot HellaSwag was used to measure commonsense reasoning, GPT-4 outperformed both Gemini Ultra (87.8 percent) and Gemini Pro (84.7 percent) by scoring 95.3 percent.
In machine learning, the term “shot” indicates the number of instances or examples used during training. For instance, in few-shot learning, the model is trained on a small number of examples per class, and the number, such as 5 shots, indicates that the model is trained with only five instances of each class.
Chain of thought (CoT) refers to the logical progression or sequence of steps that an AI model follows to arrive at a decision or output. In other words, CoT prompting involves guiding the model to think step by step before generating an answer.
Google used different prompting techniques for benchmarks like GSM8K, DROP, and HumanEval. Their Gemini Ultra model achieved the highest accuracy by using a chain-of-thought prompting approach. This involves generating a sequence of responses using multiple samples and considering model uncertainty. If a consensus is reached, it selects that answer, otherwise, it resorts to a greedy sample.
What the heck?
Did you know that Google used different techniques to test their AI’s ability to reason in different subjects like math, reading comprehension, and coding? The best-performing approach was the “chain-of-thought” method used in their Gemini Ultra model. This approach generates a sequence of responses using multiple samples and considers model uncertainty before making a decision. If there’s a clear consensus, it selects that answer, otherwise, it uses a greedy approach to find the best answer. Pretty cool, huh?
The op-ed by Olson clarified that the voice in the demo was simply reciting predetermined prompts that humans had fed into Gemini, while also displaying static images. It is important to note that this is different from what Google appeared to be suggesting, which was that individuals could engage in a seamless, real-time conversation with Gemini as it observed and responded to its surroundings. In the demo video, Google had reduced latency and shortened Gemini’s outputs for the sake of brevity. Oriol Vinyals, Google DeepMind’s VP of Research and co-lead of Gemini stated in an X post that he was pleased with the attention garnered by the “Hands-on with Gemini” video. Vinyals confirmed that all of the user prompts and outputs shown in the video were authentic, but abbreviated for the sake of concision. The video served as an example of what multimodal user experiences developed with Gemini might look like and was intended to inspire developers.