Twelve Labs raises $5M
Author
Jae Lee
Jae Lee
Date Published
Mar 16, 2022
Tags
Partnership
Investment
Share
Join our newsletter
You’re now subscribed to the Twelve Labs Newsletter! You'll be getting the latest news and updates in video understanding.
Oh no, something went wrong.
Please try again.
Hello World!

Twelve Labs is excited to announce that we’ve raised a $5M Seed funding round led by Index Ventures.

The Twelve Labs mission is to help developers build programs that can see, listen, and understand the world as we do by giving them the world’s most powerful video understanding infrastructure.

And these are the incredible folks who share our conviction that the world deserves an intelligent video understanding infrastructure, and have joined the round to build that future with us.

The world’s leading AI luminaries
  • Fei-Fei Li of Stanford University
  • Silvio Savarese of Stanford University
  • Aidan Gomez, Co-creator of Transformer and CEO of Cohere AI
  • Oren Etzioni, Professor at Allen Institute for AI and CEO of AI2
The most innovative founders disrupting the world
  • Alexandr Wang, CEO of Scale AI
  • Jack Conte, CEO of Patreon
  • Dug Song, CEO of Duo Security
  • Aaron Katz, CEO of Clickhouse
  • John Kim, CEO of Sendbird
  • Jean Paoli, CEO of Docugami
Paradigm-shifting investors
  • Index Ventures
  • Radical Ventures
  • Expa
  • Techstars Seattle
The world’s reliance on video only grows stronger

Today, more than 80% of the world’s data is in video. In fact, Cisco estimated in 2020 that it would take more than 5 million years to watch the amount of video that will cross global IP networks each month. According to Nielson, US adults spent 5 hours and 21 seconds watching videos in 2021. That’s one-third of our total waking hours!

This seems like a lot, but it’s not that difficult to believe when we consider all the time we spend each day watching Youtube or Netflix, doing calls on Zoom, or recording videos of our kids on our phones. Video is here to stay, and it’s only becoming more deeply ingrained in every part of our lives.

But there’s still no CTRL+F

Despite the exorbitant amount of video data that we consume and create each day, content within videos is still not searchable. If you were to search for a phrase within 300 pages of a text-based document, you’d be able to find it in less than a second with a simple CTRL+F. Across videos? Not possible.

Instead, we’ve had to rely on inadequate methods that are either wildly time-consuming or ineffective. An obvious workaround is to manually watch all the videos until you find what you are looking for. Larger organizations and enterprise would have people spend hours writing up tags (metadata) to match each timecode so that scenes could be located later through text-based matches on those tags. The more tech-enabled approach of today would be to use tech giants’ object detection APIs to auto-generate those tags based on the objects detected in an image.

Unfortunately, no finite number of tags could possibly be enough to fully describe a scene. If a scene had not been tagged properly, it wouldn’t be found through a metadata search. But most importantly, tagging can’t take any sort of context into consideration. And context matters.

Why does context matter? Humans understand the world by forming relationships between objects in a scene and making connections between past and present. The way we search is the way we perceive and remember the world. Unless the tags are complex enough to include contextual understanding, they can’t help with search.

So we made CTRL+F for videos.

So we built the search that we believe the world deserves. From noteworthy discussion points within an organization’s extensive Zoom recordings to urgently needed scenes within a media company’s archive all the way to that special day with your firstborn, all it takes is a search to find that exact moment you are looking for. The beauty is that you can just type in whatever comes to mind when you remember it, and you will be brought to the exact time code and file relevant to your query. It’s not a tag match, it’s a real search.

And this is how it works.

This is what our AI does: It views and understands the content of a video, including visuals such as action and movement and conversation. (Situational and temporal context included, of course!) It then transforms everything about the video into a powerful intermediary data format called vectors, which are basically a list of floating numbers that statistically represent the content of the video. When a user types in a search query, it finds the vectors that are closest to the query, and automatically outputs the most relevant scene and video file name.

And we’ve come up with intuitive interfaces for developers to access this AI. With simple Index and Search API calls, developers can integrate a powerful semantic video search to their video applications.. without ever having to actually think about search!

And when we say powerful, it’s literally the best.

And we are officially the best in the world at it. At the end of last year, we got tired of being asked questions like, “So are you better than tech giants?” That’s when we decided to participate in the 2021 ICCV VALUE Challenge for Video Retrieval (= Search) hosted by Microsoft. And we won first place!

We are proud to say that we bested tech giants and outperformed Microsoft’s previous state-of-the-art when we had no venture funding with just 12 people on the team. Here is Aiden’s (CTO) account of how we were able to beat the giants of the world.

We are building a foundation model for videos.

We believe that to understand video is to understand the world. A strong video understanding infrastructure that can most accurately transform videos into vectors will pave the way for even better search and other intelligent applications that power the next generation of videos. Just a few of these applications include: video-to-video search, summary generation, and content recommendation.

By building a foundation model that understands videos, we help developers build programs that can see, hear, and understand the world as we do.

Though Twelve Labs is the best in the world at video search today, we know that there is still much more science to do and improve on what we have. We are extremely grateful and excited about the innovation we can make possible with the support of our partners.

Generation Examples
No items found.
No items found.
Comparison against existing models
No items found.

Related articles

Semantic Video Search Engine with Twelve Labs and ApertureDB

Learn how to build a semantic video search engine with the powerful integration of Twelve Labs' Embed API with ApertureDB for advanced semantic video search.

James Le
Building Advanced Video Understanding Applications: Integrating Twelve Labs Embed API with LanceDB for Multimodal AI

Leverage Twelve Labs Embed API and LanceDB to create AI applications that can process and analyze video content with unprecedented accuracy and efficiency.

James Le, Manish Maheshwari
Advanced Video Search: Leveraging Twelve Labs and Milvus for Semantic Retrieval

Harness the power of Twelve Labs' advanced multimodal embeddings and Milvus' efficient vector database to create a robust video search solution.

James Le, Manish Maheshwari
Building Semantic Video Search with Twelve Labs Embed API and MongoDB Atlas

Learn how to create a powerful semantic video search application by combining Twelve Labs' advanced multimodal embeddings with MongoDB Atlas Vector Search.

James Le, Manish Maheshwari