Building Next-Gen Artificial Intelligence Systems Through Multimodal Machine Learning

Human perception is multimodal. We make sense of objects and events through multiple modalities (sensory organs), and that is why we excel in our understanding of the world around us. Similarly, in many real-world problems, Artificial Intelligence systems become more efficient when they process inputs (signals) from multiple modalities, and then generate the outputs (prediction.) This is achieved through Multimodal Machine Learning.

Applications of Multimodal Machine Learning

Multimodal AI research and development has made significant progress in the past two or three years. For instance, machine vision is increasingly getting integrated with natural language (text) and audition (audio/speech signals) for solving many real-life problems. The large-scale availability of various types of sensor data (e.g., from IoT applications), the exponential increase in audio-video content, the maturity of data storage and processing technologies, critical innovations in deep learning, and other factors have been instrumental in triggering the shift towards building better machine perception.

Three critical factors give multimodal learning an edge over unimodal machine learning.

Predictions tend to be more robust when real-life phenomenon are observed and interpreted through multiple modalities.
Data from multiple modalities often encode complementary information that may not be easily available (or identifiable) in single modal data.
Systems built through multiple modalities can operate even when individual modalities are partly or fully unavailable. For example, in many real-life use cases, emotion recognition can be conducted through visual signals even when the subjects are not speaking.

As a result, multimodal learning offer two critical advantages in building intelligent systems – an increased awareness of contexts, and a deeper understanding of patterns. Some of the important applications are highlighted below.

Autonomous driving, e.g. monitoring & planning tasks based on data generated through multiple sensor modalities.
Healthcare, e.g. integrating data from signals (EEG), vision (scans) & natural language (text) for faster, better diagnostics.
Human activity recognition, e.g. predicting people’s movement in security-related applications.
Hyper-personalization in advertising, recommendations, etc.
Hate & sarcasm detection, e.g. detecting hate speech in memes where one modality (say, text) may be benign but the other (say, image) may be harmful.
Multimedia content analysis, e.g. extracting rich semantic & contextual information, image & video captioning, etc.
Multimodal emotion recognition & sentiment analysis
Personality analysis, e.g. in HR candidate screening.
Privacy & security checks, e.g. advanced authentication by integrating data from multiple biometric sources.
Real-world event detection, including understanding the sequence of multiple events of interest.
Robotics, e.g. enabling real-time collaboration between human workers and robots in manufacturing.
Visual question answering, visual commonsense reasoning, visual storytelling, etc.

Core Elements of Multimodal Learning

Multimodal learning is a complex exercise. Broadly speaking, this learning paradigm involves five core elements (or problems) that need to be addressed within the boundaries of the multimodal setting.

Multimodal Representation

Data in a multimodal setting needs to be represented in a meaningful way using information from multiple entities. A good representation needs to encode desirable properties like exploiting data complementarity, natural clustering, similarity of concepts, smoothness, spatio-temporal coherence, and others. This is a complex problem due to the heterogeneity of the data (e.g. audio waves are signals but natural language is symbolic), prevalence of varying degrees of noise in each modality, and other issues.

Two types of multimodal representations are adopted – coordinated representations (separate representations are learnt for each modality, and then coordinated through constraints) and joint representations (unimodal representations are projected together into a multimodal space).

Additionally, multimodal representation may also need to consider the features extracted from the different modalities. Note that feature extraction in one modality is generally independent of the feature extraction in the other modalities, and this tends to increase the scope of the problem.

Translation

Translation is the process of mapping data from one modality to another. It is a complex process because of the heterogeneity factor, and the subjective nature of the relationship between the different modalities. For instance, while speech recognition generally involves a single correct translation, tasks like speech synthesis, language translation, and media description often involve multiple correct answers.

Two broad approaches are used in translation: generative modeling (e.g., grammar-based, encoder-decoder, or continuous generation-based), and example-based modeling (e.g., retrieval-based, or combination-based.) The generative approach involves directly constructing models that can produce the required translation between the modalities. On the other hand, the example-based approach involves using some form of a reference dictionary during the translation process.

Multimodal Alignment

Multimodal alignment is the process of determining the relationships between the sub-elements from the different modalities. For instance, the visual part of a video frame must be properly aligned with the audio associated with that frame. Key challenges include the technical complexity of designing the similarity metrics between the different modalities (particularly in cases of multiple possible alignment options, or the absence of correspondences between the modalities), the lack of adequate datasets with explicitly annotated alignments, and others.

Two types of alignment are practiced: explicitly aligning the sub-components between the modalities (explicit alignment), and aligning the sub-components mostly as an intermediary step (implicit alignment.)

Multimodal Fusion

Multimodal fusion is the process of integrating information from various modalities for predicting outcomes. For instance, the motion of lips need to be integrated with speech signals to predict the spoken words in audio-visual speech recognition. This is generally achieved through two approaches.

Aggregation-based Fusion, where multimodal networks are combined into a single network through averaging, concatenation or self-attention.
Alignment-based Fusion, where the embeddings of the networks are aligned.

Furthermore, the fusion process can be early-stage (data fusion, or feature-level fusion), intermediate-stage, or late-stage (decision-level fusion.)

Co-learning

Co-learning is the process of enabling the learning of one modality through the other modalities. It is very helpful in cases when one modality may have limited resources, such as limited annotated data, noisy data, or unreliable labels.

Two types of co-learning approaches are generally used. In some cases, a hybrid approach may also be leveraged.

Parallel Data Approach, where the observations from one modality are directly linked to the observations from other modalities. This approach is used when the multimodal observations are from the same instances (e.g., both the speech and video samples are from the same speaker in audio-visual speech dataset.)
Non-Parallel Data Approach, where the modalities do not have shared instances, but common categories or concepts. This approach enables better semantic and concept understanding, such as zero-shot learning.

Challenges & Innovations in Multimodal Learning

Multimodal learning has a shown lot of promise but there are still a lot of challenges to be addressed. These can be understood under two broad categories.

Firstly, inherent challenges exist in each of the core elements of multimodal learning. Examples include the high degree of complexity in representing heterogeneous data (especially with varied scales and dimensions), the unknown and subjective aspects of relationship-mapping of the sub-components of the multiple modalities, and others.

Secondly, many of the existing challenges of regular (unimodal) AI are not only prevalent in multimodal AI, but may even become bigger. Here are a couple of examples.

Explainability and bias detection are generally more difficult to achieve in multimodal data. Moreover, biases tend to get aggravated if multiple modalities contain similar biases. The results in such scenarios can be extremely harmful (e.g., serious discriminations in employee screenings.)
Multimodal networks generally have a higher risk of over-fitting than their unimodal counterparts. Furthermore, each modality generalizes at its own rate. Hence, a single optimization strategy may not be useful, and designing multi-optimization strategy is a complex task.

Most of the current research in multimodal learning is focused on addressing the challenges of the core elements. For instance, multimodal fusion has witnessed several key innovations in recent times, such as adaptive fusion, architectural search, bilinear pooling, channel exchanging, x-flow (cross-modal), and low-rank fusion.

The top AI players are developing superior capabilities in this space. In 2020, Facebook announced their deep learning-based Multimodal Framework (MMF) for vision and language. This is an improved version of their Pythia framework for multimodal learning that was open-sourced in 2019. Earlier this year, they announced the ‘Learning from Videos‘ project that aims to automatically learn audio, textual, and visual representations from public videos. This leverages their Generalized Data Transformations method for video embeddings that systematically learns the relations between sound and images in videos.

In recent months, Google published the ALIGN framework for better vision representation learning by leveraging large-scale noisy image-text data. Additionally, they have released new datasets for enabling better understanding of vision-language navigation, improving cross-modal associations, etc.

The past two years have also witnessed key innovations in multimodal transformer architectures (e.g., multimodal bitransformer, multi-copy mesh, spatially-aware multimodal transformers, ViLBERT & VisualBERT). Sophisticated attention mechanisms have been developed, particularly for better spatial reasoning. Additionally, many annotation platforms are getting upgraded by their respective vendors to enable multimodal training.

To conclude, the objective of multimodal machine learning is to extract vital information and deep patterns from the individual modalities, and integrate them in a meaningful, cohesive way to produce richer representations and better understanding of the objects or phenomenon under study. The challenges are many, but the progress made in the last few years have given the global AI community the confidence to tackle the same, and build the next-generation of intelligent systems.

Acknowledgement:

Multimodal Machine Learning: A Survey & Taxonomy by Baltrusaitis, Ahuja & Morency