Google’s Spotlight, Meta’s LLaMA, and other innovations

Google introduced Spotlight, a foundational model for mobile UI modeling, particularly for tasks like command grounding, screen summarization, tappability prediction, and widget captioning. Traditional mobile UI design often uses the concept of view hierarchy information, but these view hierarchies are sometimes either not available, or corrupted. Spotlight not only bypasses the need for view hierarchies, but also shows better results than other existing models. It uses several existing architectural components of Google, particularly the Vision Transformer (ViT) for encoding screenshot images, and the Text-To-Text Transfer Transformer (T5) for generating language.

In a nutshell, Spotlight captures multiple UI tasks through an innovative input-output representation. The input consists of three elements – the screenshot, the region of interest on the screen, and the text description of the task. The output is in the form of a text description of the region of interest. Most notably, a Focus Region Extractor, leveraging the concepts of region-of-interest alignment and region summarization, is used to ensure that the model can effectively focus on the region of interest in the screenshot. The research paper also discusses three modeling strategies for downstream tasks (post-pre-training), such as few-shot prompting, multi-task learning, and task-specific fine-tuning.

Some of Google’s other innovations included:

BARD, a LaMDA-powered experimental conversational AI service
Dreamix, a diffusion-based video editor tool
Grounded Decoding, a new robot control framework
ROSIE (robot learning with semantically imagened experience) for scaling learning in robotics
SPEAR-TTS, a multi-speaker text-to-speech system
Scaled Q-Learning, a novel approach for pre-training in reinforcement learning
ViT-22B, the largest vision transformer model to date

Meta announced its foundational model called LLaMA (Large Language Model Meta AI) closed-book question-answering, code generation, common sense reasoning, mathematical reasoning, reading comprehension, and related NLP tasks. The associated research paper also states that LLaMA (13B version) outperforms GPT-3 (175B) version on most benchmarks in code generation, closed-book question answering, common sense reasoning, mathematical reasoning, and reading comprehension. Moreover, the model’s 65B version is shown to be at par with other top-performing language models, such as Chinchilla and PaLM.

LLaMA‘s network architecture incorporates several modern innovations – such as (i) the normalization of each transformer sub-layer input through RMSNorm, (ii) the usage of SwiGLU instead of ReLU for non-linear activation, and (iii) rotary positional embeddings replacing absolute positional embeddings. Furthermore, LLaMA attempts to understand its biases across nine categories (including gender, race/color, and religion) through the CrowS-Pairs and WinoGender benchmark datasets. This bias evaluation is primarily based on measuring the model’s preference for stereotypes instead of anti-stereotypes. Additionally, the TruthfulQA benchmark is leveraged to evaluate the probability of the model generating false claims and misinformation.

Meta also announced Toolformer, a self-supervised language model that decides how and which external tools to use (e.g., calculator, search engine, translation system, etc.) for generating accurate answers. The company, in partnership with academia, also introduced Multi-task Video Grounding From Multimodal Queries (MINOTAUR) for query-based video understanding. Another innovation is LEVER (Learning to Verify Language-to-Code Generation), Meta’s simple but practical method of improving pre-trained code language models (CodeLMs) by verifying the generated code with their execution results.

Amazon introduced a novel Multimodal Chain-of-Thought (MM-CoT) reasoning model where vision and language inputs are combined to generate intermediate reasoning chains, which then serve as the rationale for inferencing the final output. Model development is done through a two-stage training process – rationale generation, and answer inference. While both stages have the same model architecture, their inputs and outputs differ. The researchers showed that their model’s <1B version outperformed GPT-3.5 and human performance on the ScienceQA benchmark.

NVIDIA introduced Retrieval-augmented Visual Language Model (Re-ViLM) for improving zero-shot and few-shot image-to-text generation performance. The company also announced ACE-VC, an adaptive and controllable zero-shot voice conversion system based on self-supervised learning. Stanford researchers announced a state space-based system for language modeling called Hungry Hungry Hippos (H3). While attention remains the most popular approach for language modeling, innovations in state space models (SSMs) can yield significant improvements in terms of scaling and hardware efficiency.