Discover Jockey, an advanced conversational video agent that leverages LangGraph's flexible framework and Twelve Labs' powerful video understanding APIs. This article covers Jockey's architecture, how it integrates with LangGraph and Twelve Labs, and how to customize and extend its capabilities. Big thanks to the LangChain team for collaborating with us on Jockey’s development!
Twelve Labs provides state-of-the-art video understanding APIs that extract rich insights and information from video content. Powered by advanced video foundation models (VFMs), these APIs work with video natively, without relying on intermediary representations like pre-generated captions. This allows for more accurate and contextual understanding of the full video, including visuals, audio, on-screen text, and temporal relationships.
The key capabilities of Twelve Labs APIs include video search, classification, summarization, question answering, and more. Developers can easily integrate these APIs to build applications for use cases like content discovery, video editing automation, interactive video FAQs, and AI-generated highlight reels. With enterprise-grade security and the ability to scale to handle even the largest video archives, Twelve Labs APIs open up exciting new possibilities for video-powered applications.
LangChain recently released LangGraph v0.1, a framework for building agentic and multi-agent applications with greater control and precision. Unlike the legacy LangChain AgentExecutor, LangGraph provides a flexible API for custom cognitive architectures, allowing developers to control the flow of code, prompts, and LLM calls. LangGraph also facilitates human-agent collaboration through its built-in persistence layer, enabling features like human approval before task execution and 'time travel' for editing and resuming agent actions. LangGraph’s flexibility and low-level control allows for single-agent and multi-agent setups, while also increasing the reliability of agents — which has been critical for companies like Replit, Norwegian Cruise Line, and Elastic.
To complement the LangGraph framework, LangChain also launched LangGraph Cloud, currently available in closed beta. LangGraph Cloud provides scalable infrastructure purpose-built for deploying LangGraph agents, managing horizontally-scaling servers and task queues to efficiently handle numerous concurrent users and store large states. The platform supports real-world interaction patterns, including double-texting to handle new user inputs on currently-running threads, asynchronous background jobs for long-running tasks. LangGraph Cloud integrates with LangGraph Studio, a tool for visualizing and debugging agent trajectories, enabling rapid iteration and feedback for developers building agentic applications.
Jockey, an open-source conversational video agent, has evolved significantly in its latest v1.1 release. Originally built on LangChain in v1.0, Jockey now harnesses the power of LangGraph, offering enhanced scalability and improved functionality for both frontend and backend operations. This transition marks a pivotal advancement in Jockey's architecture, enabling more efficient and precise control over complex video workflows.
At its core, Jockey combines the strengths of Large Language Models (LLMs) with Twelve Labs' specialized video APIs through LangGraph's flexible framework. The LangGraph UI, as shown in the image above, illustrates the intricate network of nodes that comprise Jockey's decision-making process. This includes key components such as the supervisor, planner, video-editing, video-search, and video-text-generation nodes, each playing a crucial role in processing user queries and executing video-related tasks.
One of the primary advantages of adopting LangGraph is the granular control it provides over each step of the workflow. This allows Jockey to precisely manage which information is passed between nodes and how their responses contribute to the overall state. Such fine-tuned control not only optimizes token usage but also enables more accurate guidance of node responses, resulting in more efficient and effective video processing.
The above data-flow diagram of Jockey offers a clear visual representation of how information moves through the system. It showcases the decision-making process from the initial query input, through LangGraph's analysis to determine query complexity, and the subsequent routing to either a simple text response or a more complex chain of video processing steps. This includes retrieving videos from Twelve Labs APIs, combining or segmenting video content as needed, and presenting the final results to the user.
By leveraging LangGraph's scalable architecture and Twelve Labs' powerful video APIs, Jockey demonstrates a sophisticated approach to video understanding and manipulation. This combination allows for seamless handling of tasks ranging from video search and editing to generating text based on video content, all orchestrated through an intelligent, conversational interface.
Jockey's architecture is designed to efficiently handle complex video-related tasks through a sophisticated multi-agent system. As illustrated in the diagram above, the system comprises three main components: the Supervisor, the Planner, and the Workers, each playing a crucial role in processing user requests and executing video-related tasks.
The Supervisor acts as the central coordinator, responsible for routing tasks between different nodes and managing the overall workflow. It receives user input and determines the next course of action, whether it's engaging the Planner for complex requests or directing tasks to specific Workers. The Supervisor also handles error recovery and ensures the system adheres to the current plan or initiates replanning when necessary.
The Planner is called upon by the Supervisor to create detailed, step-by-step plans for complex user requests. This component is essential for breaking down intricate tasks into manageable steps that can be executed by the Workers. The Planner's role is particularly important when dealing with multi-step video processing workflows that require a strategic approach.
The Workers section consists of two key components:
This architecture allows Jockey to handle a wide range of video-related tasks with flexibility and precision. The system can dynamically adapt to different types of queries, from simple text responses to complex video manipulation tasks. By leveraging LangGraph's framework, Jockey can efficiently manage the state between nodes, optimize token usage, and provide granular control over the execution of each step in the video processing workflow.
The modular design of Jockey also facilitates customization and extension. Developers can modify prompts, extend the state to handle more complex scenarios, or add new workers to address specific use cases, making Jockey a versatile foundation for building advanced video AI applications.
This approach leverages Jockey's language model capabilities to introduce new functionalities without modifying the underlying code. For example, you could create a prompt that instructs Jockey to identify and extract specific types of scenes from videos, such as action sequences or emotional moments, without changing the core system.
For more substantial customizations, developers can directly modify Jockey's components:
Jockey represents a powerful fusion of LangGraph's flexible agent framework and Twelve Labs' cutting-edge video understanding APIs, opening up new possibilities for intelligent video processing and interaction. To get started, visit the Jockey GitHub repository to access the source code and documentation: https://github.com/twelvelabs-io/tl-jockey. You can deploy Jockey locally for development and testing, or leverage LangGraph Cloud for scalable production deployments.
To run Jockey, you need to sign up for a Twelve Labs account and get your API key from https://api.twelvelabs.io/. With Twelve Labs' comprehensive suite of video understanding tools and Jockey's intelligent orchestration, you'll be well-equipped to build the next generation of video-powered applications that can search, analyze, and manipulate video content with unprecedented ease and accuracy.
Finally, join Twelve Labs' Multimodal Minds Discord server if you have questions or encounter issues when working with Jockey!
We are excited to announce Marengo 2.7 - a breakthrough in video understanding powered by our innovative multi-vector embedding architecture!
Introducing our new Embed API in Open Beta, enabling customers to generate state-of-the-art multimodal embeddings.
See how video foundation models can radically accelerate your film making timeline.
Learn how to build a semantic video search engine with the powerful integration of Twelve Labs' Embed API with ApertureDB for advanced semantic video search.