Beyond Language: How Multimodal AI Sees the Bigger Picture

By Matthew R. Carey on January 4, 2024

AI chatbots have grown increasingly ubiquitous over the last year. For example, the basic version of ChatGPT is a conversational chatbot capable of understanding natural language inputs and generating highly coherent text responses. However, exciting new multimodal AI models like Google’s Gemini showcase more sophisticated capabilities.

What distinguishes these two varieties of artificial intelligence? How may such multimodal systems further extend machine learning’s capacities? And by what means might novel implementations leveraging multiple modalities secure patent rights?

Overview of AI Chatbots

AI Chatbots typically incorporate Large Language Models (LLMs) which are AI models designed to understand user-supplied input and generate human-like textual output. These models, such as GPT-3.5 as incorporated in ChatGPT, are trained on vast amounts of diverse textual data and can perform a wide range of natural language processing tasks. They excel at tasks like language translation, text summarization, question answering, and even creative writing. On a technical level, AI chatbots are programmed to understand and generate natural language, analyzing the text that users input in order to interpret intent and provide relevant text responses.

Overview of Multimodal Models

Multimodal AI takes a huge leap forward by integrating multiple data modes beyond just text. By understanding relationships between modalities, such as various combinations of images, videos, speech data, and text, multimodal models can reason about concepts more abstractly and generate a wider range of creative output.

Multimodal models are trained on diverse datasets that include both textual and visual information which enables them to perform tasks that involve a combination of text and images/video, such as image captioning, visual question answering, and generating images from textual descriptions.

World Modeling

Think of an LLM as a phone call between you and a friend – it can have an intelligent dialogue, but cannot see or directly experience what you see in your physical world.

In contrast, a multimodal AI is like a conversation with your friend sitting next to you – it can converse naturally and also perceive objects and environments similar to human senses through vision, acoustics, tactile inputs, and more.

Just as your friend can fluidly switch between discussing an abstract topic and commenting on the motorcycle roaring by outside that distracted you both, multimodal AI combines conversational abilities with real-world situational awareness. While pure language models reason only over symbolic representations of knowledge, multimodal models dynamically ground themselves in physical environments using multiple integrated perception modes.

This allows multimodal AI to achieve contextually relevant bidirectional dialogue informed by visual, auditory, and other sensory insights – moving closer to human-like interaction. Just as your friend engages differently witnessing an event unfold versus hearing about it second-hand, multimodal AI can discuss its real-time experiences rather than conceptual abstractions.

These improved capabilities further enable the concept of “world modeling” in which models aim to simulate or capture the structure, dynamics, and behavior of the environment in which an AI system operates. This involves creating a representation of the external environment that an AI system can use to make predictions, plan actions, and understand the consequences of its decisions.

LLMs are limited in their capabilities of world modeling, where a chatbot is confined to assist in answering questions, provide information, or help users navigate specific domains through text-based interactions.

Multimodal models offer vastly improved abilities to perform various world modeling tasks. For example, multimodal models may analyze and interpret complex visual scenes by recognizing objects, relationships between objects, and contextual information within images; navigate and interact in the physical world by processing visual inputs to understand the environment, plan actions, and make decisions based on both visual and non-visual sensory data; analyze medical images for diagnosis or treatment planning by extracting information from medical images, identifying structures, and assisting in medical decision-making; and improve augmented reality (AR) by combining real-world visual data with digital overlays to create an augmented representation of the environment.

Patenting an Application of a Multimodal Model

How could one approach claiming a multimodal model in a patent application? Imagine that you are a surgeon who has invented a way to leverage machine learning and AR to assist in surgical procedures.

A genericized, sample patent claim directed to this concept could look as follows:

1. A robotic surgery assistance system for use with surgical procedures, comprising:

a robotic surgical arm equipped with a camera configured to capture a live video feed of a patient organ during surgery performed by a surgeon;

a computer processor configured to:

train a multimodal neural network using a first dataset comprising medical scans paired with a second dataset comprising videos of corrective procedures,

analyze, using the multimodal neural network that was trained, the live video feed captured by the camera of the robotic surgical arm, and

based on analyzing the live video feed, generate an augmented reality (AR) overlay illustrating a set of recommended procedural actions tailored to a set of anatomical structures and anomalies of the patient organ as identified in the live video feed; and

a heads-up stereoscopic display worn by the surgeon and configured to display the AR overlay.

Of course, this is just one illustrative case – the possibilities for patenting innovative applications of multimodal models across industries are endless. As research continues advancing new techniques for unified understanding of data whether it be text, vision, speech, sensor inputs, and beyond, multimodal AI promises to push the boundaries of what technology can do for sectors from medicine to manufacturing to transportation and far more. If you have a novel multimodal AI solution that solves a pressing problem, the door is open to protect your intellectual property and chart the future.

****

Subscribe to get updates to this post or to receive future posts from PatentNext. Start a discussion or reach out to the author, Matt Carey, at mcarey@marshallip.com (Tel: 312-474-9581). Connect with or follow Matt on LinkedIn.

Beyond Language: How Multimodal AI Sees the Bigger Picture

Overview of AI Chatbots

Overview of Multimodal Models

World Modeling

Patenting an Application of a Multimodal Model

ABOUT MARSHALL, GERSTEIN & BORUN LLP

Topics

Archives