In today's world, video understanding applications play a crucial role in our daily lives. Videos are ubiquitous, from social media to surveillance cameras, and contain a wealth of information that can be extracted and used for various purposes. For example, video data can be used to improve public safety, optimize traffic flow, and enhance customer experience. However, extracting meaningful information from videos is a complex task that requires sophisticated computer vision algorithms and software infrastructure.
If you are a technical builder who wants to build real-world video understanding applications, you need to have insights into how to manage video data and what solutions are available in the market to extract meaning from it. In this blog post, we will examine the technical challenges of video data management and take a look at two solutions that help solve these challenges: ApertureDB and Twelve Labs.
Individual video files can often be huge (a few 100 megabytes to gigabytes), which leads to high space requirements in storage and heavy memory requirements when decoding and processing them.
Additionally, video data can be complex and difficult to analyze. The transforms on this data are unique. Common operations like resize, crop, and sample (ones that can be performed on video frames or videos) can be quite computationally expensive. Often when storing, large videos are split into snippets, sometimes causing an interesting event to span across snippets. Working with such videos can require dealing with very finicky computer vision libraries like FFmpeg and knowing about container types, codecs, muxing, etc.
Managing video data becomes particularly challenging, especially for machine learning use cases, given the sheer volume of data involved. Videos, particularly for ML use cases, occur in high volumes, and given the complexity of processing them, working with a lot of videos typically means applications need to implement highly concurrent processing and lean on other optimizations like keyframe indexing, re-muxing, and re-encoding. All these require significant programming effort and testing, particularly at very large scales.
Therefore, you need large enough machine resources to be able to download and process raw video files. This can quickly add to enterprise costs and pose a significant challenge to organizations, particularly those with limited resources or smaller budgets.
ML use cases also lead to unique pre-processing requirements where training some models might expect sampled videos instead of the original high frame rate content, or might work only with frames. Sometimes certain models will also require something in RGB, but the video is YUV, so the models must be able to handle different color spaces. Some use cases might only require small but interesting clips out of a video. In such cases, it may become necessary to download gigabytes of data just to sample it at a lower frame rate or to extract and train using a few key frames from a video. This can be a time-consuming and resource-intensive process, making it difficult for individuals and teams to work with video data efficiently.
Expanding a video into frames or storing clips preemptively, in addition to the original video, can significantly increase the footprint of the data, which can further complicate the process of managing and working with it. This is also wasteful since not all frames contain valuable information. On the flip side, it is also a waste of time and resources to download the entire, often large video, to then discard a sizable portion of it.
As such, it is important to carefully consider the impact of video data on storage and processing resources and to develop strategies for managing it effectively.
Querying, visualizing, and debugging videos are essential steps in managing video data. However, these tasks can be challenging, often due to the amount and complexity of metadata associated with video files. Metadata contains valuable information about the visual object, the surrounding application, annotations, and embeddings that describe the content. As we automate video understanding, this metadata can evolve to include the newer insights discovered in existing or incoming videos. Thus, managing this metadata can quickly become complicated, and without proper tools, it can be challenging to query, visualize, and debug videos effectively.
Another challenge is that video metadata is often scattered across different sources, such as different storage devices, databases, and applications. This scattered metadata can make it difficult to create a unified view of the video data and derive meaningful insights from it. Additionally, the metadata may be organized differently in each source, further complicating the process of querying, visualizing, and debugging videos. To address this challenge, you must have a comprehensive metadata management strategy that incorporates various sources and provides a unified view of the metadata.
Video data may require manual annotation in order to be useful for machine learning, which can be time-consuming and labor-intensive. This is because the data must be properly categorized and segmented in order to be effectively utilized. Furthermore, the annotation process often involves multiple rounds of review and correction to ensure accuracy, which can add even more time and effort to the task.
More specifically, video annotations involve adding labels, tags, textual descriptions, or other metadata directly onto the video timeline itself. Annotators spend time watching through clips manually and tagging relevant moments where necessary. Given that annotating one hour of video at 24fps takes around 5 minutes (at best), imagine how much effort it requires to annotate entire seasons or feature films!
One reason why video modeling hasn't received as much attention as text or image modeling is due to the high computing burden it entails. As mentioned earlier, videos are much larger in size than text or images and require significantly more processing power to process. This issue is even more pronounced with a Transformer architecture (the default method to develop large pre-trained models nowadays), which has quadratic complexity with respect to the length of the sequence frames.
To illustrate, assume a 10-minute video, which typically has 30 frames (images) per second. This means the video contains 10 * 3600 * 30 images or 1,080,000 (approximately 1 million) images. Given the quadratic complexity of the Transformer, the overall compute needed is 1 million squared (1e12).
Additionally, there is a unique challenge to temporal modeling. Unlike text and images, videos contain a temporal dimension that must be taken into account when analyzing them. This requires specialized techniques that are not commonly used in other modalities.
Finally, in addition to the visual information presented in video clips, there are synchronized audio cues that require additional processing. These audio cues can include sounds or conversations happening within the video, providing additional context and information to the viewer. It is important to note that these audio cues are often just as important as the visual information presented in the video and should not be overlooked. Therefore, the processing of these audio cues is a crucial aspect of video analysis and should be given the same level of attention as visual cues.
In the previous section, we highlighted some of the challenges associated with managing video data for machine learning use cases. Fortunately, there are tools available that can help overcome these challenges. In this section, we will take a closer look at two such tools: ApertureDB and Twelve Labs.
ApertureDB is a purpose-built database for visual data such as images, videos, feature vectors, and associated metadata, including annotations. By nature of its implementation, ApertureDB natively supports the management of images and videos. It provides necessary preprocessing operations like zoom, crop, sampling, and creating thumbnails as you access this data. ApertureDB video API supports videos with multiple video encodings and containers, together with efficient frame-level access. ApertureDB can store and access this data from cloud buckets like S3 / Google Cloud Storage, or data can be stored and managed by the ApertureDB server.
Since this data is typically accompanied by valuable application metadata, we manage that as a knowledge graph which helps us capture internal relationships between metadata/data in order to enable complex visual searches. ApertureDB's unified API can simplify the process of managing and querying video data. In fact, the ability to attach various annotations to videos or interesting clips within the videos can make it really easy to avoid duplicating per-frame or clip data, while using it to query and derive valuable business insights.
With vector indexing and search support through its unified API, it is also possible to find similar videos when using embeddings within ApertureDB.
Native API support in ApertureDB for preprocessing videos, extracting frames or clips, and indexing keyframes can reduce the need for customized deployments and simplify training/inference workflows when working with videos. This means workflows that would take days now take minutes of just writing and executing queries.
Scale, performance, and optimized data access are key features of ApertureDB.
Video search solutions act as powerful indexing tools, allowing for faster browsing and retrieval of desired scenes within multimedia assets. By taking care of tedious labor involving low-level visual analysis, these technologies facilitate locating relevant shots, enabling efficient tagging and faster turnaround times during projects.
Consider, for example, an animation studio producing TV shows or movies. Instead of wasting months going through footage manually and painstakingly creating shot lists and keyframe notes, the team can leverage cutting-edge search engines integrated into digital asset management platforms. They can then locate relevant scenes or character expressions for reference later, helping speed up production cycles.
All told, leveraging optimized video search solutions greatly reduces the burden placed upon people responsible for manually tagging video assets.
Twelve Labs provides a multimodal foundation model in the form of a suite of APIs - designed to assist you in creating applications that leverage the power of video understanding. One of those is the Video Search API that seamlessly finds specific moments of interest in your video with natural language queries.
At a high level, Twelve Labs API enables three search options:
Once you upload a video, the system automatically initiates the video indexing process. Twelve Labs discuss the concept of video indexing in relation to using a multimodal foundation model to incorporate temporal context and extract information such as movements, objects, sounds, text on screen, and speech from your videos, generating powerful video embeddings. This subsequently allows you to find specific moments within your videos using everyday language or to categorize video segments based on provided labels and prompts.
Once the system completes indexing the video and generating video embeddings, you can leverage them to find specific moments using the search API. This API identifies the exact start-end time codes in relevant videos that correspond to the semantic meaning of the query you enter. Depending on the indexing options you selected, you'll have a subset of the same options to choose from for your semantic video search. For instance, if you enabled all the options for the index, you'll have the ability to search: audio-visually, for conversations, and for any text appearing within the videos. The reason for providing the same set of options at both the index and search levels is to offer you the flexibility to decide how you'd like to utilize the platform for analyzing your video content and how you'd want to search across your video content using a combination of options you find suitable for your current context.
Managing video data can be a challenging task, requiring ample computing resources, careful metadata management, and manual annotation. However, with the help of innovative solutions like ApertureDB and Twelve Labs, these challenges can be overcome. By leveraging the capabilities of these tools, developers can focus on building real-world video understanding applications rather than worrying about the complexities of video data management.
Looking toward the future, we envision a world where managing video data is no longer a daunting task. As video AI continues to evolve, we expect to see more advanced solutions that can help automate the process of video data management and annotation. With these tools, we can unlock the full potential of video data and create a world where the benefits of video understanding are accessible to everyone. So why not check out ApertureDB and Twelve Labs today and see how they can help you take your video-understanding applications to the next level?
We are excited to announce Marengo 2.7 - a breakthrough in video understanding powered by our innovative multi-vector embedding architecture!
See how video foundation models can radically accelerate your film making timeline.
Learn how to build a semantic video search engine with the powerful integration of Twelve Labs' Embed API with ApertureDB for advanced semantic video search.
Whether you're looking to find the perfect berry-toned lipstick or just curious about spotting specific colors in your videos, this guide will help you leverage cutting-edge AI to do so effortlessly.