Earlier this month, researchers at the Allen Institute for AI — a nonprofit founded by late Microsoft cofounder Paul Allen — released an interactive demo of a system they describe as part of a “new generation” of AI applications that can analyze, search across, and respond to questions about videos “at scale.” Called Merlot Reserve, the researchers had the system “watch” 20 million YouTube videos to learn the relationships between images, sounds, and subtitles, allowing it to, for example, answer questions such as “What meal does the person in the video want to eat?” or “Has the boy in this video swam in the ocean before?”
Merlot Reserve and its predecessor, Merlot, aren’t the first “multimodal” AI systems of their kind. Systems that can process and relate information from audio, visuals and text have been around for years. These technologies continue to improve in their ability to understand the world more like humans. San Francisco research lab OpenAI’s DALL-E, which was released in 2021, can generate images of objects — real or imagined — from simple text descriptions like “an armchair in the shape of an avocado.” A more recent system out of Google called VATT can not only caption events in videos (e.g., “a man swimming”) but classify audio clips and recognize objects in images.
However, until recently, these multimodal AI systems were strictly for the domain of research. That’s changing — increasingly, they’re becoming commercialized.
“Different multimodal technologies including automatic speech recognition, image labeling and recognition, neural networks and traditional machine learning models [can help to] gain an understanding of text, voice, and images — [especially when paired] with text processing,” Aaron Sloman, the cofounder and CTO of CLIPr, told VentureBeat via email. CLIPr is among the nascent cohort of companies using multimodal AI systems for applications like analyzing video. Tech giants including Meta (formerly Facebook) and Google are represented in the group, as are startups like Twelve Labs, which claims that its systems can recognize features in videos including objects, text on screen, speech, and people.
“[My fellow cofounders and I] sought out a solution to help us easily extract important and relevant clips from videos as an alternative to skipping around at 10-15 second intervals, and when we weren’t able to find a solution, we decided to build one … Our namesake video indexing platform … ingests recorded video and helps make it searchable by transcription, topics, and subtopics,” Sloman said. “Analyzing prosody is also critical for us, which is the rhythm, stress and intonation of speech. We leverage it against image analysis, such as meeting presentation slides, to help evaluate the accuracy of these tonal changes or [look] for animated gestures with the participants who are on video.”
Sloman claims that CLIPr has clients in a “variety” of industries, chiefly media publishing, enterprise, and events. In the future, the startup aims to apply its technology to livestream video and create “role-specific” bots that can, for example, take keynote sessions from an event and automatically create a highlight reel.
“It is our belief that video is the most important and underutilized form of modern communication, and our goal is to make video as accessible as written content,” Sloman continued.
Outside of multimodal systems, AI doesn’t experience the world the same way that people do. For example, a speech recognition system can only understand one type of data — speech — and doesn’t comprehend the context of that speech. By contrast, people use all of their senses (e.g., sight, sound, smell) to process and ground events in time. From images and captions of someone cooking popcorn, for example, a person can imagine what the sounds of the scene might be, like raw kernels scattering in an empty pot and the “pops” of the popcorn expanding.
“[M]any of these multimodal models are image-specific and focus on visual recognition — describing what is literally shown,” Rowan Zellers, a computer science Ph.D. candidate at the University of Washington and the lead researcher on the Merlot Reserve project, told VentureBeat via email. “We could see models answer questions about what people are doing (and why) in videos, possibly for search applications.”
Twelve Labs, for instance, claims that its system makes any video database analyzable by transforming clips into mathematical representations known as vector embeddings. Customers have used it to build recommendation engines, content moderation systems, and media analytics dashboards, according to CEO Jae Lee.
“[Twelve Labs is] working on building [a] model that can create powerful video embeddings that can be used for not only semantic search, but also other variety of tasks, such as caption, highlight, and summary generations,” Lee told VentureBeat via email. “Our video models are trained under language supervision. We extract diverse modules — multimodality — of information such as images, audio, transcription, motion, etc. from the video and fuse that information into a single vector representation. That representation is trained under relevant text — sentences — that is processed using natural language processing (NLP) technology.”
Beyond startups, last year, Google revealed that it plans to use a multimodal AI system called multitask unified model (MUM) to enhance Google Search experiences across different languages and devices. Among other improvements, in Google Search, MUM will power new features that take a query (e.g., “acrylic paintings”) and spotlight resources like step-by-step instructions and pick out subjects in videos (e.g., “acrylic techniques”) based on the audio, text and visual content.
Meta recently said that it’s also applying a multimodal system, called Few-Shot Learner (FSL), to determine whether the content of Facebook and Instagram messages — including text, images, and URLs — violates its community guidelines. The company claims FSL was developed against a database of billions of Facebook posts and images in more than 100 languages.
Zellers believes that, in the future, these sorts of multimodal models could be used to create products that not only analyze online video, audio, and related forms of content, but assist users with vision or hearing challenges. “This could involve anything from answering basic questions, all the way to contextual interaction,” he added.
While commercialized multimodal AI is more common than it used to be, several hurdles must be overcome before these types of systems reach wide scale deployment. It’s partly a case of making the economics work: While running an existing system isn’t typically expensive, at least compared with developing a new one, it depends on the nature of the workload and the skill level of the company’s data science team.
“Initial model [development] is easily the most costly aspect because it includes perfecting the data science in parallel,” Sloman said. “For example, the process of distinguishing what is or is not a slide across thousands of verified Zoom meetings is very expensive.”
For example, Merlot Reserve took roughly three weeks to develop on a cluster of 512 of Google’s third-generation tensor processing units (TPUs), chips designed to accelerate certain aspects of the AI creation process. A pod of thirty-two third-generation TPUs costs $32 per hour to evaluate, according to current public pricing, bringing Merlot Reserve’s development costs to just over $16,000 (assuming no volume, annual, or academic discounts).
“We currently run seven different models, some of which are large-scale open source repositories of data with hundreds of millions of objects, while others are proprietary,” Sloman explained. “Our proprietary models have been training for over a year now, and while it’s hard to say for the open source models we use, they have likely been training for much longer than that … I suspect that the next sweeping change in multimodal AI will be building more standardized linkages between different types of siloed models. We’ve had to patchwork several AI models, each of which does one type of analysis well. Eventually, with many companies building products using multimodal, we will see more open source offerings, making it easier and less expensive to train and run experiments.”
Today’s multimodal systems suffer from technical flaws, too, like picking up biases in the data (e.g., YouTube videos) from which they’re learning. For instance, because Merlot Reserve “watches” a large volume of YouTube videos, it’s biased to YouTube’s recommendations and, more broadly, by the economic pressure of which content people are encouraged to produce.
“The content moderation on YouTube disproportionately filters out [minority] voices … People’s roles in YouTube videos [also] tend to be highly gendered, which might bias situation understanding,” Zellers and his colleagues wrote in a study describing Merlot Reserve’s capabilities. “The automatic captions in YouTube are known to suffer from gender bias, which our model (like neural models generally) might in turn amplify. The transcriptions on YouTube are also likely poor at handling important identity markers, like pronouns.”
Biases aside, there’s nothing preventing bad actors from using multimodal systems for controversial purposes, like identifying events or activities in surveillance footage. In a paper published by Stanford’s Institute for Human-Centered Artificial Intelligence, the coauthors argue that advances in multimodal models like DALL-E will result in higher-quality, machine-generated content that’ll be easier to personalize for “misuse purposes” — like publishing misleading articles targeted to different political parties, nationalities, and religions.
Sloman says that CLIPr, for its part, takes steps to mitigate model bias and misuse through a “human-in-the-loop” approach. The company encourages customers to point out mistakes the CLIPr system makes so that it can correct them — and ideally improve model development on the backend.
“Multimodal has its advantages, because if done correctly, it has less chance to produce bias compared to more siloed models,” he said. “The real danger comes from not acknowledging the complexity and imperfection of multimodal AI and using data points that lead you down a particular linear decisioning path that limits the spectrum of answers or matches.”
Lee said that Twelve Labs, too, has implemented bias mitigation strategies. The company takes a three-phase approach that includes collecting datasets from diverse sources, creating documentation for the datasets, and curating the raw video and text information.
“Computer vision models are used to detect and filter visual content that may contain toxicity or sensitive content,” Lee explained. “Then, the transcription of the raw video is analyzed by leveraging block words (i.e., removing any text containing words from a list of selected words) and advanced NLP techniques to filter content that may contain political, socio-economic, or demographic bias. Block words and NLP techniques are also used to filter text labels that may contain toxicity and bias … Understanding and mitigating potential biases when leveraging multimodal models is integral to the success of Twelve Labs.”
‍
We are excited to announce Marengo 2.7 - a breakthrough in video understanding powered by our innovative multi-vector embedding architecture!
We had fun interacting with the AI community in Denver!
Twelve Labs co-hosted our first in-person hackathon in Los Angeles!
Twelve Labs will co-host our first in-person hackathon in Los Angeles!