Tutorial
Tutorial
Tutorial
How to perform Video OCR using Twelve Labs API?


Ankit Khare
Ankit Khare
Ankit Khare
A tutorial on performing Video OCR using Twelve Labs API
A tutorial on performing Video OCR using Twelve Labs API


Join our newsletter
Receive the latest advancements, tutorials, and industry insights in video understanding
May 19, 2023
May 19, 2023
May 19, 2023
13 min
13 min
13 min
Copy link to article
Copy link to article
Copy link to article
Introduction
Video Optical Character Recognition (OCR) involves detecting and extracting text from video frames using computer vision and machine learning algorithms. With video OCR, you can easily sift through your video content, pinpointing the exact moments where certain words, phrases or even entire sentences make their appearance on the screen. Imagine the applications - from streamlining content search and navigation, to diving deep into content analysis, optimizing advertisement placement, summing up content, turbocharging SEO, and ensuring compliance and monitoring.
Examples of elements that can be recognized by video OCR include:
Slide content during presentations or meetings
Product names as they're showcased on screen, such as in advertisements, films, or TV shows
Athlete or team names as they're displayed on jerseys during sports broadcasts
Nametags and names visible during meetings or conferences
Scribbles on whiteboards within lecture videos
Documents captured within video footage
Handwritten texts appearing on screen
License plate numbers and building names
Subtitles, captions, and ending credits within films and interviews
In this tutorial we will explore how the Twelve Labs platform enables video OCR at two distinct levels. On the video level, we're taking on the entire video in one fell swoop, harnessing every morsel of text it holds. On the other hand, the index level approach sharpens our focus, honing in on a specific keyword or a cluster of keywords, which we'll input as natural language queries to perform a comprehensive search across a library of videos indexed on the Twelve Labs platform.
The cherry on top? With Twelve Labs API at your disposal, you can accomplish all of this without worrying about the nitty-gritty of implementing and maintaining the OCR process. We've got your back from development to infrastructure, and even ongoing support. So gear up, and let's embark on this exciting expedition into the realm of video OCR together.
Prerequisites
The Twelve Labs platform is presently in its open beta phase, and we are offering free video indexing credits for up to 10 hours upon sign-up. It'll be advantageous for you to signup and get acquainted with the foundational aspects of the Twelve Labs platform before diving into this tutorial. Things like understanding video indexing, indexing options, the Task API, and search options are vital to smoothly follow through this tutorial, all of which I've covered extensively in my first tutorial. However, if you hit a roadblock or find yourself lost at any juncture, don't hesitate to reach out. By the way, our response times on our Discord server are lightning fast 🚅🏎️⚡️ if Discord is your preferred platform.
Quick tour of the tutorial
Following our previous discourse, we will explore video OCR tackling it from two distinct angles and levels. Accordingly, I've divided this tutorial into two pivotal sections, followed by a finale where we bring everything together in a working demo web-app:
Video OCR - A three step process
The process of extracting all recognized text from a specific video entails these three steps:
Video Indexing - No surprises at this step; if you've been following along with my past tutorials, this step should feel like a familiar friend.
Retrieve the unique identifier of the video - Once Twelve Labs platform finishes indexing our video we will retrieve the unique identifier of the video we require the OCR for.
Extract the text that appears on the screen - We'll pinpoint the video by using the specific index we created and the video id associated with the video we need OCR for. The API will do the heavy lifting, serving up the results we're after.
Text-in-video Search - searching for specific text within all indexed videos
Video OCR enabled us to scrutinize an entire video and distill all instances of text. Now, the text-in-video search feature empowers us to zero in on precise moments or video snippets where the input or searched text materializes. This greatly diminishes the time spent perusing a sizable catalogue of videos, yielding accurate search results predicated on alignment of search terms with the text that becomes visible on screen during video playbacks.
In our initial tutorials, we delved into content search within indexed videos, using natural language queries and various search options like visual (audio-visual search), conversation (dialogue search), and text-in-video (OCR). In this tutorial, we're going to repurpose our approach, harnessing only OCR technology to search for text within videos. To optimize processing time and costs, we'll create an index using solely the text_in_video indexing option. Then, we'll fire off our search query with the text_in_video search option, enabling us to discover relevant text matches within the indexed videos.
Building the Demo App
To bring it all home, we'll take the data yielded by the API endpoints and showcase them on a webpage, spinning up a Flask-based demo app that serves up a simple HTML page. The result of the video OCR will be neatly tabulated, displaying timestamps and associated text, while the text search will show the query we used and the corresponding video segments we found in response.
Video OCR - A three step process
For the sake of simplicity, I've uploaded just two videos to an index using a pre-existing account. Feel free to sign up; given we're currently in open beta, you'll receive complimentary credits allowing you to index up to 10 hours of video content. If your needs extend beyond that, check out our pricing page for upgrading to the Developer plan.
Video Indexing
Here, we’re going to delve into the essential elements that we'll need to include in our Jupyter notebook. This includes the necessary imports, defining API URLs, creating the index, and uploading videos from our local file system to kick off the indexing process:
<pre><code class="python">%env API_URL = https://api.twelvelabs.io/v1.1 %env API_KEY= tlk_2FGGACN2TFAH1N2H1HBXR0BDQ9GV !pip install requests import os import requests import glob from pprint import pprint # Retrieve the URL of the API and my API key API_URL = os.getenv("API_URL") assert API_URL API_KEY = os.getenv("API_KEY") assert API_KEY </code></pre>
<pre><code class="python"># Construct the URL of the `/indexes` endpoint INDEXES_URL = f"{API_URL}/indexes" # Set the header of the request default_header = { "x-api-key": API_KEY } # Define a function to create an index with a given name def create_index(index_name, index_options, engine): # Declare a dictionary named data data = { "engine_id": engine, "index_options": index_options, "index_name": index_name, } # Create an index response = requests.post(INDEXES_URL, headers=default_header, json=data) # Store the unique identifier of your index INDEX_ID = response.json().get('_id') # Check if the status code is 201 and print success if response.status_code == 201: print(f"Status code: {response.status_code} - The request was successful and a new index was created.") else: print(f"Status code: {response.status_code}") pprint(response.json()) return INDEX_ID # Create the indexes INDEX_ID = create_index(index_name = "extract_text", index_options=["text_in_video"], engine = "marengo2.5") # Print the created index IDs print(f"Created index IDs: {INDEX_ID}") </code></pre>
Uploading two videos to the index we've just created. The videos are titled "A Brief History of Film" (courtesy of Film Thought Project, available at https://www.youtube.com/watch?v=utntGgcsZWI) and "GPT - Explained!" (courtesy of CodeEmporium, available at https://www.youtube.com/watch?v=3IweGfgytgY). I have downloaded these videos from their respective YouTube channels and saved them in a folder named 'static' on my local hard drive. We'll use these local files to index the videos onto the Twelve Labs platform:
<pre><code class="python">import os import requests from concurrent.futures import ThreadPoolExecutor TASKS_URL = f"{API_URL}/tasks" TASK_ID_LIST = [] video_folder = 'static' # folder containing the video files def upload_video(file_name): # Validate if a video already exists in the index task_list_response = requests.get( TASKS_URL, headers=default_header, params={"index_id": INDEX_ID, "filename": file_name}, ) if "data" in task_list_response.json(): task_list = task_list_response.json()["data"] if len(task_list) > 0: if task_list[0]['status'] == 'ready': print(f"Video '{file_name}' already exists in index {INDEX_ID}") else: print("task pending or validating") return # Proceed further to create a new task to index the current video if the video didn't exist in the index already print("Entering task creation code for the file: ", file_name) if file_name.endswith('.mp4'): # Make sure the file is an MP4 video file_path = os.path.join(video_folder, file_name) # Get the full path of the video file with open(file_path, "rb") as file_stream: data = { "index_id": INDEX_ID, "language": "en" } file_param = [ ("video_file", (file_name, file_stream, "application/octet-stream")),] #The video will be indexed on the platform using the same name as the video file itself. response = requests.post(TASKS_URL, headers=default_header, data=data, files=file_param) TASK_ID = response.json().get("_id") TASK_ID_LIST.append(TASK_ID) # Check if the status code is 201 and print success if response.status_code == 201: print(f"Status code: {response.status_code} - The request was successful and a new resource was created.") else: print(f"Status code: {response.status_code}") print(f"File name: {file_name}") pprint(response.json()) print("\n") # Get list of video files video_files = [f for f in os.listdir(video_folder) if f.endswith('.mp4')] # Create a ThreadPoolExecutor with ThreadPoolExecutor() as executor: # Use executor to run upload_video in parallel for all video files executor.map(upload_video, video_files) </code></pre>
Retrieve the unique identifier of the video
Now let's enumerate all the videos in our index. This allows us to retain the video ID of a specific video, the goal being to extract all the text embedded within it. Furthermore, akin to our methods in prior tutorials, I'm assembling a list of video IDs and their respective titles, designed to be subsequently fed into our Flask application.
<pre><code class="python"># List all the videos in an index default_header = { "x-api-key": API_KEY } INDEX_ID='644a73aa8b1dd6cde172a933' INDEXES_VIDEOS_URL = f"{API_URL}/indexes/{INDEX_ID}/videos" response = requests.get(INDEXES_VIDEOS_URL, headers=default_header) response_json = response.json() pprint(response_json) video_id_name_list = [{'video_id': video['_id'], 'video_name': video['metadata']['filename']} for video in response_json['data']] pprint(video_id_name_list) </code></pre>
Output:
<pre><code class="python">{'data': [{'_id': '###a917186daab572f349243', 'created_at': '2023-04-27T14:18:48Z', 'metadata': {'duration': 1300.173875, 'engine_id': 'marengo2.5', 'filename': 'A Brief History of Film.mp4', 'fps': 23.976023976023978, 'height': 720, 'size': 188214297, 'width': 1280}, 'updated_at': '2023-04-27T14:20:11Z'}, {'_id': '###3da86daab572f349241', 'created_at': '2023-04-27T13:08:19Z', 'metadata': {'duration': 550.7, 'engine_id': 'marengo2.5', 'filename': 'GPT - Explained!.mp4', 'fps': 30, 'height': 720, 'size': 22838593, 'width': 1152}, 'updated_at': '2023-04-27T13:08:42Z'}], 'page_info': {'limit_per_page': 10, 'page': 1, 'total_duration': 5402.873875, 'total_page': 1, 'total_results': 3}} [{'video_id': '###a849b86daab572f349242', 'video_name': 'A Brief History of Film.mp4'}, {'video_id': '###a73da86daab572f349241', 'video_name': 'GPT - Explained!.mp4'}] </code></pre>
Extract the text that appears on the screen
Time to put our plan into action! We'll now proceed to extract all textual content from the chosen video:
<pre><code class="python">VIDEO_ID = '###a849b86daab572f349242' TEXT_IN_VIDEO_URL = f"{API_URL}/indexes/{INDEX_ID}/videos/{VIDEO_ID}/text-in-video" response = requests.get(TEXT_IN_VIDEO_URL, headers=default_header) print (f"Status code: {response.status_code}") ocr_data = response.json() pprint (ocr_data) </code></pre>
Output:
<pre><code class="python">Status code: 200 {'data': [{'end': 3, 'start': 1, 'value': 'Film Thought Project'}, {'end': 6, 'start': 5, 'value': 'Film'}, {'end': 22, 'start': 18, 'value': "'L'arrivée d'un train en gare de La Ciotat"}, {'end': 28, 'start': 18, 'value': 'Year:'}, {'end': 28, 'start': 23, 'value': '2015'}, {'end': 28, 'start': 23, 'value': 'Production Co.'}, {'end': 28, 'start': 23, 'value': 'Alejandro G. Iñárritu'}, {'end': 28, 'start': 23, 'value': 'Regency Enterprises'}, {'end': 28, 'start': 23, 'value': "'The Revenant'"}, {'end': 30, 'start': 29, 'value': "Let's"}, {'end': 40, 'start': 32, 'value': 'Film:'}, {'end': 34, 'start': 33, 'value': 'Film Thought Project'}, {'end': 40, 'start': 35, 'value': 'Director:'}, {'end': 40, 'start': 35, 'value': 'Production Co.'}, {'end': 40, 'start': 36, 'value': 'Alfred Hitchcock'}, {'end': 40, 'start': 36, 'value': '1958'}, {'end': 40, 'start': 36, 'value': 'Alfred J. Hitchcock Productions'}, {'end': 40, 'start': 37, 'value': 'Year:'}, {'end': 40, 'start': 38, 'value': "'Vertigo'"}, {'end': 45, 'start': 44, 'value': 'PRESS START'}, {'end': 46, 'start': 45, 'value': '2020'}, {'end': 47, 'start': 46, 'value': '2018'}, {'end': 48, 'start': 47, 'value': '1975'}, {'end': 53, 'start': 49, 'value': '1870s'}, {'end': 61, 'start': 67, 'value': 'Eadweard Muybridge'}, {'end': 69, 'start': 75, 'value': 'See you soon'}], 'id': '###a849b86daab572f349242', 'index_id': '###a73aa8b1dd6cde172a933'} </code></pre>
As you can see, the API extracted all the texts on screen, line by line, like a charm. You can save these texts as metadata for downstream workflows such as filtering, classifying and searching content.
Text-in-video Search - searching for specific text within all indexed videos
Launching our search query utilizing the text_in_video search option to uncover pertinent text matches within our collection of indexed videos:
<pre><code class="python"># Construct the URL of the `/search` endpoint SEARCH_URL = f"{API_URL}/search/" # Declare a dictionary named `data` data = { "index_id": INDEX_ID, "query": "horse", "search_options": [ "text_in_video" ] } # Make a search request response = requests.post(SEARCH_URL, headers=default_header, json=data) if response.status_code == 200: print(f"Status code: {response.status_code} - Success") else: print(f"Status code: {response.status_code}") pprint(response.json()) search_data = response.json() </code></pre>
Output:
<pre><code class="python">Status code: 200 - Success {'data': [{'confidence': 'high', 'end': 64, 'metadata': [{'text': 'THE HORSE IN MOTION.', 'type': 'text_in_video'}], 'score': 92.28, 'start': 63, 'video_id': '###a849b86daab572f349242'}, {'confidence': 'high', 'end': 91, 'metadata': [{'text': 'THE HORSE IN MOTION.', 'type': 'text_in_video'}], 'score': 92.28, 'start': 88, 'video_id': '###a849b86daab572f349242'}], 'page_info': {'limit_per_page': 10, 'page_expired_at': '2023-05-12T00:03:43Z', 'total_results': 2}, 'search_pool': {'index_id': '###a73aa8b1dd6cde172a933', 'total_count': 3, 'total_duration': 5403}} </code></pre>
💡Bear in mind that the text-in-video search feature is set up to locate all occurrences within the indexed videos where the input query aligns (not necessarily word-for-word) with the text visually presented on screen as the video plays. For instance, if I enter "horse moving," the system will identify instances where the on-screen text reads "horse in motion." However, the confidence level of this match will be lower compared to when I input "horse in motion”. The confidence level depends on the percentage of words matched with the natural language query we input. For example, a two out of three-word match will yield a higher confidence level than a match with only one word.

A peek at Twelve Labs Playground's text-in-video search results for a given query

The specific video instance aligning with the input query being played

The model's confidence increases as soon as the query aligns with the on-screen text
Preparing the data for the Flask application to ensure our results will be presented neatly:
<pre><code class="python">video_data = [{'start': d['start'], 'end': d['end'], 'confidence': d['confidence'], 'text': d['metadata'][0]['text']} for d in search_data['data']] video_search_dict = {} for vd in video_data: if search_data['data'][0]['video_id'] in video_search_dict: video_search_dict[search_data['data'][0]['video_id']].append(vd) else: video_search_dict[search_data['data'][0]['video_id']] = [vd] pprint(video_search_dict) </code></pre>
Output:
<pre><code class="python"> {'###a849b86daab572f349242': [{'confidence': 'high', 'end': 64, 'start': 63, 'text': 'THE HORSE IN MOTION.'}, {'confidence': 'high', 'end': 91, 'start': 88, 'text': 'THE HORSE IN MOTION.'}]} </code></pre>
Further data preparation for the video OCR results, followed by our standard procedure of pickling everything:
<pre><code class="python">video_id = ocr_data.get('id') data_list = ocr_data.get('data') data_to_save = { 'video_id': video_id, 'data_list': data_list, 'video_id_name_list': video_id_name_list, 'video_search_dict': video_search_dict } import pickle # Save data to a pickle file with open('data.pkl', 'wb') as f: pickle.dump(data_to_save, f) </code></pre>
Building the Demo App
We're now at the final leg of our video OCR adventure - bringing together all elements to animate our results. Besides the standard configuration we implement for fetching videos from the local folder and loading the pickled data dispatched from the Jupyter notebook, this time we have some additional requirements - a conversion of timestamps from a seconds-only format to a minutes-and-seconds format. This makes the data visualization on the webpage more intuitive. Here's the code for the app.py file:
<pre><code class="python">from flask import Flask, render_template, send_from_directory import pickle import os from collections import defaultdict app = Flask(__name__) # Load data from a pickle file with open('data.pkl', 'rb') as f: loaded_data = pickle.load(f) # Access the data video_id = loaded_data['video_id'] data_list = loaded_data['data_list'] video_id_name_list = loaded_data['video_id_name_list'] video_search_dict = loaded_data['video_search_dict'] VIDEO_DIRECTORY = os.path.join(os.path.dirname(os.path.realpath(__file__)), "static") @app.route('/<path:filename>') def serve_video(filename): print(VIDEO_DIRECTORY, filename) return send_from_directory(directory=VIDEO_DIRECTORY, path=filename) @app.route('/') def home(): for item in data_list: if ":" not in str(item['start']): item['start'] = int(item['start']) item['start'] = f"{item['start'] // 60}:{item['start'] % 60:02}" if ":" not in str(item['end']): item['end'] = int(item['end']) item['end'] = f"{item['end'] // 60}:{item['end'] % 60:02}" video_id_name_dict = {video['video_id']: video['video_name'] for video in video_id_name_list} # video_name = video_id_name_dict.get(video_id) return render_template('index.html', data=data_list[:10], video_id_name_dict=video_id_name_dict, video_id=video_id, video_search_dict = video_search_dict) if __name__ == '__main__': app.run(debug=True) </code></pre>
HTML Template
Now, it's time to craft the final piece: our Jinja-2 based HTML template code. This utilizes all the data we've transmitted through the Flask app.py file. Our first task is to exhibit the Video OCR results. The video player will encompass the entire duration of the video, and beneath it, a table will display the start, end, and text discovered during that time interval on the screen. For enhanced clarity, the timestamps will be presented in a minutes-and-seconds format, and they will be clickable, enabling us to jump to the specific timestamp and get the video playing from that point. It's important to note that I've converted the timestamps back to seconds when passing them to the JavaScript function playVideo. This is due to the function being configured to accept timestamps in a seconds-only format for video playback.
<pre><code class="language-html"><!DOCTYPE html> <html> <head> <link rel="shortcut icon" href="#" /> <title>Video OCR</title> <style> body { text-align: center; font-family: Arial, sans-serif; color: #333; background-color: #f5f5f5; } h1, h2 { color: #444; } table { margin: 0 auto; border-collapse: collapse; width: 80%; margin-top: 20px; } th, td { border: 1px solid #ddd; padding: 8px; text-align: center; } th { padding-top: 12px; padding-bottom: 12px; text-decoration: underline; color: black; } video { width: 40%; height: auto; margin-top: 20px; } /* search style */ .video-container { text-align: center; margin-bottom: 2em; padding: 1em; background-color: #fff; border: 1px solid #ddd; border-radius: 4px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); } table { margin: 0 auto; margin-bottom: 1em; } th, td { padding: 0.5em; border: 1px solid #ddd; } </style> <script> function playVideo(timeString) { var timeParts = timeString.split(":"); var time = parseInt(timeParts[0]) * 60 + parseInt(timeParts[1]); var video = document.querySelector('#mainVideo'); video.currentTime = time; video.play(); } </script> </head> <body> <h1>Video OCR</h1> <h3>Video file: <i>{{ video_id_name_dict[video_id]}}</i></h3> <video id="mainVideo" controls> <source src="{{ url_for('static', filename=video_id_name_dict[video_id]|string) }}" type="video/mp4"> Your browser does not support the video tag. </video> <br /> <br /> <br /> <table> <tr> <th>Start</th> <th>End</th> <th>Value</th> </tr> {% for item in data %} <tr> <td><a href="javascript:void(0)" onclick="playVideo('{{ item['start'] }}')">{{ item['start'] }}</a></td> <td>{{ item['end'] }}</td> <td>{{ item['value'] }}</td> </tr> {% endfor %} </table> <br /> <br /> {% for video_id, results in video_search_dict.items() %} <div class="video-container"> <h1>Text-in-video Search Results</h1> <h2>Video file: <i>{{ video_id_name_dict[video_id] }}</i></h2> <h2>Entered query: <i>{{input_query}}</i></h2> {% for result in results %} <video controls preload="metadata" style="width: 40%;"> <source src="{{ url_for('static', filename=video_id_name_dict[video_id]) }}#t={{ result['start'] }},{{ result['end'] }}" type="video/mp4"> Your browser does not support the video tag. </video> <table> <tr> <th>Start</th> <th>End</th> <th>Confidence</th> <th>Text</th> </tr> <tr> <td>{{ result['start'] }}</td> <td>{{ result['end'] }}</td> <td>{{ result['confidence'] }}</td> <td>{{ result['text'] }}</td> </tr> </table> {% endfor %} </div> {% endfor %} </body> </html> </code></pre>
Running the Flask app
Awesome! let’s just run the last cell of our Jupyter notebook to launch our Flask app:
<pre><code class="python">%run app.py </code></pre>
You should see an output similar to the one below, confirming that everything went as anticipated 😊:

After clicking on the URL link http://127.0.0.1:5000, you should be greeted with the following web page:


Here's the Jupyter Notebook containing the complete code that we've put together throughout this tutorial - https://drive.google.com/drive/folders/1D97_UU2Z0lvp3y52BHV5GKkSNOQKv3Xi?usp=share_link
Outro
Anticipate more thrilling content on the horizon! If you haven't already, I warmly invite you to become part of our lively Discord community, teeming with individuals who share a fervor for multimodal AI.
See you next time,
Ankit
Crafting stellar Developer Experiences @Twelve Labs
Cheers!
Introduction
Video Optical Character Recognition (OCR) involves detecting and extracting text from video frames using computer vision and machine learning algorithms. With video OCR, you can easily sift through your video content, pinpointing the exact moments where certain words, phrases or even entire sentences make their appearance on the screen. Imagine the applications - from streamlining content search and navigation, to diving deep into content analysis, optimizing advertisement placement, summing up content, turbocharging SEO, and ensuring compliance and monitoring.
Examples of elements that can be recognized by video OCR include:
Slide content during presentations or meetings
Product names as they're showcased on screen, such as in advertisements, films, or TV shows
Athlete or team names as they're displayed on jerseys during sports broadcasts
Nametags and names visible during meetings or conferences
Scribbles on whiteboards within lecture videos
Documents captured within video footage
Handwritten texts appearing on screen
License plate numbers and building names
Subtitles, captions, and ending credits within films and interviews
In this tutorial we will explore how the Twelve Labs platform enables video OCR at two distinct levels. On the video level, we're taking on the entire video in one fell swoop, harnessing every morsel of text it holds. On the other hand, the index level approach sharpens our focus, honing in on a specific keyword or a cluster of keywords, which we'll input as natural language queries to perform a comprehensive search across a library of videos indexed on the Twelve Labs platform.
The cherry on top? With Twelve Labs API at your disposal, you can accomplish all of this without worrying about the nitty-gritty of implementing and maintaining the OCR process. We've got your back from development to infrastructure, and even ongoing support. So gear up, and let's embark on this exciting expedition into the realm of video OCR together.
Prerequisites
The Twelve Labs platform is presently in its open beta phase, and we are offering free video indexing credits for up to 10 hours upon sign-up. It'll be advantageous for you to signup and get acquainted with the foundational aspects of the Twelve Labs platform before diving into this tutorial. Things like understanding video indexing, indexing options, the Task API, and search options are vital to smoothly follow through this tutorial, all of which I've covered extensively in my first tutorial. However, if you hit a roadblock or find yourself lost at any juncture, don't hesitate to reach out. By the way, our response times on our Discord server are lightning fast 🚅🏎️⚡️ if Discord is your preferred platform.
Quick tour of the tutorial
Following our previous discourse, we will explore video OCR tackling it from two distinct angles and levels. Accordingly, I've divided this tutorial into two pivotal sections, followed by a finale where we bring everything together in a working demo web-app:
Video OCR - A three step process
The process of extracting all recognized text from a specific video entails these three steps:
Video Indexing - No surprises at this step; if you've been following along with my past tutorials, this step should feel like a familiar friend.
Retrieve the unique identifier of the video - Once Twelve Labs platform finishes indexing our video we will retrieve the unique identifier of the video we require the OCR for.
Extract the text that appears on the screen - We'll pinpoint the video by using the specific index we created and the video id associated with the video we need OCR for. The API will do the heavy lifting, serving up the results we're after.
Text-in-video Search - searching for specific text within all indexed videos
Video OCR enabled us to scrutinize an entire video and distill all instances of text. Now, the text-in-video search feature empowers us to zero in on precise moments or video snippets where the input or searched text materializes. This greatly diminishes the time spent perusing a sizable catalogue of videos, yielding accurate search results predicated on alignment of search terms with the text that becomes visible on screen during video playbacks.
In our initial tutorials, we delved into content search within indexed videos, using natural language queries and various search options like visual (audio-visual search), conversation (dialogue search), and text-in-video (OCR). In this tutorial, we're going to repurpose our approach, harnessing only OCR technology to search for text within videos. To optimize processing time and costs, we'll create an index using solely the text_in_video indexing option. Then, we'll fire off our search query with the text_in_video search option, enabling us to discover relevant text matches within the indexed videos.
Building the Demo App
To bring it all home, we'll take the data yielded by the API endpoints and showcase them on a webpage, spinning up a Flask-based demo app that serves up a simple HTML page. The result of the video OCR will be neatly tabulated, displaying timestamps and associated text, while the text search will show the query we used and the corresponding video segments we found in response.
Video OCR - A three step process
For the sake of simplicity, I've uploaded just two videos to an index using a pre-existing account. Feel free to sign up; given we're currently in open beta, you'll receive complimentary credits allowing you to index up to 10 hours of video content. If your needs extend beyond that, check out our pricing page for upgrading to the Developer plan.
Video Indexing
Here, we’re going to delve into the essential elements that we'll need to include in our Jupyter notebook. This includes the necessary imports, defining API URLs, creating the index, and uploading videos from our local file system to kick off the indexing process:
<pre><code class="python">%env API_URL = https://api.twelvelabs.io/v1.1 %env API_KEY= tlk_2FGGACN2TFAH1N2H1HBXR0BDQ9GV !pip install requests import os import requests import glob from pprint import pprint # Retrieve the URL of the API and my API key API_URL = os.getenv("API_URL") assert API_URL API_KEY = os.getenv("API_KEY") assert API_KEY </code></pre>
<pre><code class="python"># Construct the URL of the `/indexes` endpoint INDEXES_URL = f"{API_URL}/indexes" # Set the header of the request default_header = { "x-api-key": API_KEY } # Define a function to create an index with a given name def create_index(index_name, index_options, engine): # Declare a dictionary named data data = { "engine_id": engine, "index_options": index_options, "index_name": index_name, } # Create an index response = requests.post(INDEXES_URL, headers=default_header, json=data) # Store the unique identifier of your index INDEX_ID = response.json().get('_id') # Check if the status code is 201 and print success if response.status_code == 201: print(f"Status code: {response.status_code} - The request was successful and a new index was created.") else: print(f"Status code: {response.status_code}") pprint(response.json()) return INDEX_ID # Create the indexes INDEX_ID = create_index(index_name = "extract_text", index_options=["text_in_video"], engine = "marengo2.5") # Print the created index IDs print(f"Created index IDs: {INDEX_ID}") </code></pre>
Uploading two videos to the index we've just created. The videos are titled "A Brief History of Film" (courtesy of Film Thought Project, available at https://www.youtube.com/watch?v=utntGgcsZWI) and "GPT - Explained!" (courtesy of CodeEmporium, available at https://www.youtube.com/watch?v=3IweGfgytgY). I have downloaded these videos from their respective YouTube channels and saved them in a folder named 'static' on my local hard drive. We'll use these local files to index the videos onto the Twelve Labs platform:
<pre><code class="python">import os import requests from concurrent.futures import ThreadPoolExecutor TASKS_URL = f"{API_URL}/tasks" TASK_ID_LIST = [] video_folder = 'static' # folder containing the video files def upload_video(file_name): # Validate if a video already exists in the index task_list_response = requests.get( TASKS_URL, headers=default_header, params={"index_id": INDEX_ID, "filename": file_name}, ) if "data" in task_list_response.json(): task_list = task_list_response.json()["data"] if len(task_list) > 0: if task_list[0]['status'] == 'ready': print(f"Video '{file_name}' already exists in index {INDEX_ID}") else: print("task pending or validating") return # Proceed further to create a new task to index the current video if the video didn't exist in the index already print("Entering task creation code for the file: ", file_name) if file_name.endswith('.mp4'): # Make sure the file is an MP4 video file_path = os.path.join(video_folder, file_name) # Get the full path of the video file with open(file_path, "rb") as file_stream: data = { "index_id": INDEX_ID, "language": "en" } file_param = [ ("video_file", (file_name, file_stream, "application/octet-stream")),] #The video will be indexed on the platform using the same name as the video file itself. response = requests.post(TASKS_URL, headers=default_header, data=data, files=file_param) TASK_ID = response.json().get("_id") TASK_ID_LIST.append(TASK_ID) # Check if the status code is 201 and print success if response.status_code == 201: print(f"Status code: {response.status_code} - The request was successful and a new resource was created.") else: print(f"Status code: {response.status_code}") print(f"File name: {file_name}") pprint(response.json()) print("\n") # Get list of video files video_files = [f for f in os.listdir(video_folder) if f.endswith('.mp4')] # Create a ThreadPoolExecutor with ThreadPoolExecutor() as executor: # Use executor to run upload_video in parallel for all video files executor.map(upload_video, video_files) </code></pre>
Retrieve the unique identifier of the video
Now let's enumerate all the videos in our index. This allows us to retain the video ID of a specific video, the goal being to extract all the text embedded within it. Furthermore, akin to our methods in prior tutorials, I'm assembling a list of video IDs and their respective titles, designed to be subsequently fed into our Flask application.
<pre><code class="python"># List all the videos in an index default_header = { "x-api-key": API_KEY } INDEX_ID='644a73aa8b1dd6cde172a933' INDEXES_VIDEOS_URL = f"{API_URL}/indexes/{INDEX_ID}/videos" response = requests.get(INDEXES_VIDEOS_URL, headers=default_header) response_json = response.json() pprint(response_json) video_id_name_list = [{'video_id': video['_id'], 'video_name': video['metadata']['filename']} for video in response_json['data']] pprint(video_id_name_list) </code></pre>
Output:
<pre><code class="python">{'data': [{'_id': '###a917186daab572f349243', 'created_at': '2023-04-27T14:18:48Z', 'metadata': {'duration': 1300.173875, 'engine_id': 'marengo2.5', 'filename': 'A Brief History of Film.mp4', 'fps': 23.976023976023978, 'height': 720, 'size': 188214297, 'width': 1280}, 'updated_at': '2023-04-27T14:20:11Z'}, {'_id': '###3da86daab572f349241', 'created_at': '2023-04-27T13:08:19Z', 'metadata': {'duration': 550.7, 'engine_id': 'marengo2.5', 'filename': 'GPT - Explained!.mp4', 'fps': 30, 'height': 720, 'size': 22838593, 'width': 1152}, 'updated_at': '2023-04-27T13:08:42Z'}], 'page_info': {'limit_per_page': 10, 'page': 1, 'total_duration': 5402.873875, 'total_page': 1, 'total_results': 3}} [{'video_id': '###a849b86daab572f349242', 'video_name': 'A Brief History of Film.mp4'}, {'video_id': '###a73da86daab572f349241', 'video_name': 'GPT - Explained!.mp4'}] </code></pre>
Extract the text that appears on the screen
Time to put our plan into action! We'll now proceed to extract all textual content from the chosen video:
<pre><code class="python">VIDEO_ID = '###a849b86daab572f349242' TEXT_IN_VIDEO_URL = f"{API_URL}/indexes/{INDEX_ID}/videos/{VIDEO_ID}/text-in-video" response = requests.get(TEXT_IN_VIDEO_URL, headers=default_header) print (f"Status code: {response.status_code}") ocr_data = response.json() pprint (ocr_data) </code></pre>
Output:
<pre><code class="python">Status code: 200 {'data': [{'end': 3, 'start': 1, 'value': 'Film Thought Project'}, {'end': 6, 'start': 5, 'value': 'Film'}, {'end': 22, 'start': 18, 'value': "'L'arrivée d'un train en gare de La Ciotat"}, {'end': 28, 'start': 18, 'value': 'Year:'}, {'end': 28, 'start': 23, 'value': '2015'}, {'end': 28, 'start': 23, 'value': 'Production Co.'}, {'end': 28, 'start': 23, 'value': 'Alejandro G. Iñárritu'}, {'end': 28, 'start': 23, 'value': 'Regency Enterprises'}, {'end': 28, 'start': 23, 'value': "'The Revenant'"}, {'end': 30, 'start': 29, 'value': "Let's"}, {'end': 40, 'start': 32, 'value': 'Film:'}, {'end': 34, 'start': 33, 'value': 'Film Thought Project'}, {'end': 40, 'start': 35, 'value': 'Director:'}, {'end': 40, 'start': 35, 'value': 'Production Co.'}, {'end': 40, 'start': 36, 'value': 'Alfred Hitchcock'}, {'end': 40, 'start': 36, 'value': '1958'}, {'end': 40, 'start': 36, 'value': 'Alfred J. Hitchcock Productions'}, {'end': 40, 'start': 37, 'value': 'Year:'}, {'end': 40, 'start': 38, 'value': "'Vertigo'"}, {'end': 45, 'start': 44, 'value': 'PRESS START'}, {'end': 46, 'start': 45, 'value': '2020'}, {'end': 47, 'start': 46, 'value': '2018'}, {'end': 48, 'start': 47, 'value': '1975'}, {'end': 53, 'start': 49, 'value': '1870s'}, {'end': 61, 'start': 67, 'value': 'Eadweard Muybridge'}, {'end': 69, 'start': 75, 'value': 'See you soon'}], 'id': '###a849b86daab572f349242', 'index_id': '###a73aa8b1dd6cde172a933'} </code></pre>
As you can see, the API extracted all the texts on screen, line by line, like a charm. You can save these texts as metadata for downstream workflows such as filtering, classifying and searching content.
Text-in-video Search - searching for specific text within all indexed videos
Launching our search query utilizing the text_in_video search option to uncover pertinent text matches within our collection of indexed videos:
<pre><code class="python"># Construct the URL of the `/search` endpoint SEARCH_URL = f"{API_URL}/search/" # Declare a dictionary named `data` data = { "index_id": INDEX_ID, "query": "horse", "search_options": [ "text_in_video" ] } # Make a search request response = requests.post(SEARCH_URL, headers=default_header, json=data) if response.status_code == 200: print(f"Status code: {response.status_code} - Success") else: print(f"Status code: {response.status_code}") pprint(response.json()) search_data = response.json() </code></pre>
Output:
<pre><code class="python">Status code: 200 - Success {'data': [{'confidence': 'high', 'end': 64, 'metadata': [{'text': 'THE HORSE IN MOTION.', 'type': 'text_in_video'}], 'score': 92.28, 'start': 63, 'video_id': '###a849b86daab572f349242'}, {'confidence': 'high', 'end': 91, 'metadata': [{'text': 'THE HORSE IN MOTION.', 'type': 'text_in_video'}], 'score': 92.28, 'start': 88, 'video_id': '###a849b86daab572f349242'}], 'page_info': {'limit_per_page': 10, 'page_expired_at': '2023-05-12T00:03:43Z', 'total_results': 2}, 'search_pool': {'index_id': '###a73aa8b1dd6cde172a933', 'total_count': 3, 'total_duration': 5403}} </code></pre>
💡Bear in mind that the text-in-video search feature is set up to locate all occurrences within the indexed videos where the input query aligns (not necessarily word-for-word) with the text visually presented on screen as the video plays. For instance, if I enter "horse moving," the system will identify instances where the on-screen text reads "horse in motion." However, the confidence level of this match will be lower compared to when I input "horse in motion”. The confidence level depends on the percentage of words matched with the natural language query we input. For example, a two out of three-word match will yield a higher confidence level than a match with only one word.

A peek at Twelve Labs Playground's text-in-video search results for a given query

The specific video instance aligning with the input query being played

The model's confidence increases as soon as the query aligns with the on-screen text
Preparing the data for the Flask application to ensure our results will be presented neatly:
<pre><code class="python">video_data = [{'start': d['start'], 'end': d['end'], 'confidence': d['confidence'], 'text': d['metadata'][0]['text']} for d in search_data['data']] video_search_dict = {} for vd in video_data: if search_data['data'][0]['video_id'] in video_search_dict: video_search_dict[search_data['data'][0]['video_id']].append(vd) else: video_search_dict[search_data['data'][0]['video_id']] = [vd] pprint(video_search_dict) </code></pre>
Output:
<pre><code class="python"> {'###a849b86daab572f349242': [{'confidence': 'high', 'end': 64, 'start': 63, 'text': 'THE HORSE IN MOTION.'}, {'confidence': 'high', 'end': 91, 'start': 88, 'text': 'THE HORSE IN MOTION.'}]} </code></pre>
Further data preparation for the video OCR results, followed by our standard procedure of pickling everything:
<pre><code class="python">video_id = ocr_data.get('id') data_list = ocr_data.get('data') data_to_save = { 'video_id': video_id, 'data_list': data_list, 'video_id_name_list': video_id_name_list, 'video_search_dict': video_search_dict } import pickle # Save data to a pickle file with open('data.pkl', 'wb') as f: pickle.dump(data_to_save, f) </code></pre>
Building the Demo App
We're now at the final leg of our video OCR adventure - bringing together all elements to animate our results. Besides the standard configuration we implement for fetching videos from the local folder and loading the pickled data dispatched from the Jupyter notebook, this time we have some additional requirements - a conversion of timestamps from a seconds-only format to a minutes-and-seconds format. This makes the data visualization on the webpage more intuitive. Here's the code for the app.py file:
<pre><code class="python">from flask import Flask, render_template, send_from_directory import pickle import os from collections import defaultdict app = Flask(__name__) # Load data from a pickle file with open('data.pkl', 'rb') as f: loaded_data = pickle.load(f) # Access the data video_id = loaded_data['video_id'] data_list = loaded_data['data_list'] video_id_name_list = loaded_data['video_id_name_list'] video_search_dict = loaded_data['video_search_dict'] VIDEO_DIRECTORY = os.path.join(os.path.dirname(os.path.realpath(__file__)), "static") @app.route('/<path:filename>') def serve_video(filename): print(VIDEO_DIRECTORY, filename) return send_from_directory(directory=VIDEO_DIRECTORY, path=filename) @app.route('/') def home(): for item in data_list: if ":" not in str(item['start']): item['start'] = int(item['start']) item['start'] = f"{item['start'] // 60}:{item['start'] % 60:02}" if ":" not in str(item['end']): item['end'] = int(item['end']) item['end'] = f"{item['end'] // 60}:{item['end'] % 60:02}" video_id_name_dict = {video['video_id']: video['video_name'] for video in video_id_name_list} # video_name = video_id_name_dict.get(video_id) return render_template('index.html', data=data_list[:10], video_id_name_dict=video_id_name_dict, video_id=video_id, video_search_dict = video_search_dict) if __name__ == '__main__': app.run(debug=True) </code></pre>
HTML Template
Now, it's time to craft the final piece: our Jinja-2 based HTML template code. This utilizes all the data we've transmitted through the Flask app.py file. Our first task is to exhibit the Video OCR results. The video player will encompass the entire duration of the video, and beneath it, a table will display the start, end, and text discovered during that time interval on the screen. For enhanced clarity, the timestamps will be presented in a minutes-and-seconds format, and they will be clickable, enabling us to jump to the specific timestamp and get the video playing from that point. It's important to note that I've converted the timestamps back to seconds when passing them to the JavaScript function playVideo. This is due to the function being configured to accept timestamps in a seconds-only format for video playback.
<pre><code class="language-html"><!DOCTYPE html> <html> <head> <link rel="shortcut icon" href="#" /> <title>Video OCR</title> <style> body { text-align: center; font-family: Arial, sans-serif; color: #333; background-color: #f5f5f5; } h1, h2 { color: #444; } table { margin: 0 auto; border-collapse: collapse; width: 80%; margin-top: 20px; } th, td { border: 1px solid #ddd; padding: 8px; text-align: center; } th { padding-top: 12px; padding-bottom: 12px; text-decoration: underline; color: black; } video { width: 40%; height: auto; margin-top: 20px; } /* search style */ .video-container { text-align: center; margin-bottom: 2em; padding: 1em; background-color: #fff; border: 1px solid #ddd; border-radius: 4px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); } table { margin: 0 auto; margin-bottom: 1em; } th, td { padding: 0.5em; border: 1px solid #ddd; } </style> <script> function playVideo(timeString) { var timeParts = timeString.split(":"); var time = parseInt(timeParts[0]) * 60 + parseInt(timeParts[1]); var video = document.querySelector('#mainVideo'); video.currentTime = time; video.play(); } </script> </head> <body> <h1>Video OCR</h1> <h3>Video file: <i>{{ video_id_name_dict[video_id]}}</i></h3> <video id="mainVideo" controls> <source src="{{ url_for('static', filename=video_id_name_dict[video_id]|string) }}" type="video/mp4"> Your browser does not support the video tag. </video> <br /> <br /> <br /> <table> <tr> <th>Start</th> <th>End</th> <th>Value</th> </tr> {% for item in data %} <tr> <td><a href="javascript:void(0)" onclick="playVideo('{{ item['start'] }}')">{{ item['start'] }}</a></td> <td>{{ item['end'] }}</td> <td>{{ item['value'] }}</td> </tr> {% endfor %} </table> <br /> <br /> {% for video_id, results in video_search_dict.items() %} <div class="video-container"> <h1>Text-in-video Search Results</h1> <h2>Video file: <i>{{ video_id_name_dict[video_id] }}</i></h2> <h2>Entered query: <i>{{input_query}}</i></h2> {% for result in results %} <video controls preload="metadata" style="width: 40%;"> <source src="{{ url_for('static', filename=video_id_name_dict[video_id]) }}#t={{ result['start'] }},{{ result['end'] }}" type="video/mp4"> Your browser does not support the video tag. </video> <table> <tr> <th>Start</th> <th>End</th> <th>Confidence</th> <th>Text</th> </tr> <tr> <td>{{ result['start'] }}</td> <td>{{ result['end'] }}</td> <td>{{ result['confidence'] }}</td> <td>{{ result['text'] }}</td> </tr> </table> {% endfor %} </div> {% endfor %} </body> </html> </code></pre>
Running the Flask app
Awesome! let’s just run the last cell of our Jupyter notebook to launch our Flask app:
<pre><code class="python">%run app.py </code></pre>
You should see an output similar to the one below, confirming that everything went as anticipated 😊:

After clicking on the URL link http://127.0.0.1:5000, you should be greeted with the following web page:


Here's the Jupyter Notebook containing the complete code that we've put together throughout this tutorial - https://drive.google.com/drive/folders/1D97_UU2Z0lvp3y52BHV5GKkSNOQKv3Xi?usp=share_link
Outro
Anticipate more thrilling content on the horizon! If you haven't already, I warmly invite you to become part of our lively Discord community, teeming with individuals who share a fervor for multimodal AI.
See you next time,
Ankit
Crafting stellar Developer Experiences @Twelve Labs
Cheers!
Introduction
Video Optical Character Recognition (OCR) involves detecting and extracting text from video frames using computer vision and machine learning algorithms. With video OCR, you can easily sift through your video content, pinpointing the exact moments where certain words, phrases or even entire sentences make their appearance on the screen. Imagine the applications - from streamlining content search and navigation, to diving deep into content analysis, optimizing advertisement placement, summing up content, turbocharging SEO, and ensuring compliance and monitoring.
Examples of elements that can be recognized by video OCR include:
Slide content during presentations or meetings
Product names as they're showcased on screen, such as in advertisements, films, or TV shows
Athlete or team names as they're displayed on jerseys during sports broadcasts
Nametags and names visible during meetings or conferences
Scribbles on whiteboards within lecture videos
Documents captured within video footage
Handwritten texts appearing on screen
License plate numbers and building names
Subtitles, captions, and ending credits within films and interviews
In this tutorial we will explore how the Twelve Labs platform enables video OCR at two distinct levels. On the video level, we're taking on the entire video in one fell swoop, harnessing every morsel of text it holds. On the other hand, the index level approach sharpens our focus, honing in on a specific keyword or a cluster of keywords, which we'll input as natural language queries to perform a comprehensive search across a library of videos indexed on the Twelve Labs platform.
The cherry on top? With Twelve Labs API at your disposal, you can accomplish all of this without worrying about the nitty-gritty of implementing and maintaining the OCR process. We've got your back from development to infrastructure, and even ongoing support. So gear up, and let's embark on this exciting expedition into the realm of video OCR together.
Prerequisites
The Twelve Labs platform is presently in its open beta phase, and we are offering free video indexing credits for up to 10 hours upon sign-up. It'll be advantageous for you to signup and get acquainted with the foundational aspects of the Twelve Labs platform before diving into this tutorial. Things like understanding video indexing, indexing options, the Task API, and search options are vital to smoothly follow through this tutorial, all of which I've covered extensively in my first tutorial. However, if you hit a roadblock or find yourself lost at any juncture, don't hesitate to reach out. By the way, our response times on our Discord server are lightning fast 🚅🏎️⚡️ if Discord is your preferred platform.
Quick tour of the tutorial
Following our previous discourse, we will explore video OCR tackling it from two distinct angles and levels. Accordingly, I've divided this tutorial into two pivotal sections, followed by a finale where we bring everything together in a working demo web-app:
Video OCR - A three step process
The process of extracting all recognized text from a specific video entails these three steps:
Video Indexing - No surprises at this step; if you've been following along with my past tutorials, this step should feel like a familiar friend.
Retrieve the unique identifier of the video - Once Twelve Labs platform finishes indexing our video we will retrieve the unique identifier of the video we require the OCR for.
Extract the text that appears on the screen - We'll pinpoint the video by using the specific index we created and the video id associated with the video we need OCR for. The API will do the heavy lifting, serving up the results we're after.
Text-in-video Search - searching for specific text within all indexed videos
Video OCR enabled us to scrutinize an entire video and distill all instances of text. Now, the text-in-video search feature empowers us to zero in on precise moments or video snippets where the input or searched text materializes. This greatly diminishes the time spent perusing a sizable catalogue of videos, yielding accurate search results predicated on alignment of search terms with the text that becomes visible on screen during video playbacks.
In our initial tutorials, we delved into content search within indexed videos, using natural language queries and various search options like visual (audio-visual search), conversation (dialogue search), and text-in-video (OCR). In this tutorial, we're going to repurpose our approach, harnessing only OCR technology to search for text within videos. To optimize processing time and costs, we'll create an index using solely the text_in_video indexing option. Then, we'll fire off our search query with the text_in_video search option, enabling us to discover relevant text matches within the indexed videos.
Building the Demo App
To bring it all home, we'll take the data yielded by the API endpoints and showcase them on a webpage, spinning up a Flask-based demo app that serves up a simple HTML page. The result of the video OCR will be neatly tabulated, displaying timestamps and associated text, while the text search will show the query we used and the corresponding video segments we found in response.
Video OCR - A three step process
For the sake of simplicity, I've uploaded just two videos to an index using a pre-existing account. Feel free to sign up; given we're currently in open beta, you'll receive complimentary credits allowing you to index up to 10 hours of video content. If your needs extend beyond that, check out our pricing page for upgrading to the Developer plan.
Video Indexing
Here, we’re going to delve into the essential elements that we'll need to include in our Jupyter notebook. This includes the necessary imports, defining API URLs, creating the index, and uploading videos from our local file system to kick off the indexing process:
<pre><code class="python">%env API_URL = https://api.twelvelabs.io/v1.1 %env API_KEY= tlk_2FGGACN2TFAH1N2H1HBXR0BDQ9GV !pip install requests import os import requests import glob from pprint import pprint # Retrieve the URL of the API and my API key API_URL = os.getenv("API_URL") assert API_URL API_KEY = os.getenv("API_KEY") assert API_KEY </code></pre>
<pre><code class="python"># Construct the URL of the `/indexes` endpoint INDEXES_URL = f"{API_URL}/indexes" # Set the header of the request default_header = { "x-api-key": API_KEY } # Define a function to create an index with a given name def create_index(index_name, index_options, engine): # Declare a dictionary named data data = { "engine_id": engine, "index_options": index_options, "index_name": index_name, } # Create an index response = requests.post(INDEXES_URL, headers=default_header, json=data) # Store the unique identifier of your index INDEX_ID = response.json().get('_id') # Check if the status code is 201 and print success if response.status_code == 201: print(f"Status code: {response.status_code} - The request was successful and a new index was created.") else: print(f"Status code: {response.status_code}") pprint(response.json()) return INDEX_ID # Create the indexes INDEX_ID = create_index(index_name = "extract_text", index_options=["text_in_video"], engine = "marengo2.5") # Print the created index IDs print(f"Created index IDs: {INDEX_ID}") </code></pre>
Uploading two videos to the index we've just created. The videos are titled "A Brief History of Film" (courtesy of Film Thought Project, available at https://www.youtube.com/watch?v=utntGgcsZWI) and "GPT - Explained!" (courtesy of CodeEmporium, available at https://www.youtube.com/watch?v=3IweGfgytgY). I have downloaded these videos from their respective YouTube channels and saved them in a folder named 'static' on my local hard drive. We'll use these local files to index the videos onto the Twelve Labs platform:
<pre><code class="python">import os import requests from concurrent.futures import ThreadPoolExecutor TASKS_URL = f"{API_URL}/tasks" TASK_ID_LIST = [] video_folder = 'static' # folder containing the video files def upload_video(file_name): # Validate if a video already exists in the index task_list_response = requests.get( TASKS_URL, headers=default_header, params={"index_id": INDEX_ID, "filename": file_name}, ) if "data" in task_list_response.json(): task_list = task_list_response.json()["data"] if len(task_list) > 0: if task_list[0]['status'] == 'ready': print(f"Video '{file_name}' already exists in index {INDEX_ID}") else: print("task pending or validating") return # Proceed further to create a new task to index the current video if the video didn't exist in the index already print("Entering task creation code for the file: ", file_name) if file_name.endswith('.mp4'): # Make sure the file is an MP4 video file_path = os.path.join(video_folder, file_name) # Get the full path of the video file with open(file_path, "rb") as file_stream: data = { "index_id": INDEX_ID, "language": "en" } file_param = [ ("video_file", (file_name, file_stream, "application/octet-stream")),] #The video will be indexed on the platform using the same name as the video file itself. response = requests.post(TASKS_URL, headers=default_header, data=data, files=file_param) TASK_ID = response.json().get("_id") TASK_ID_LIST.append(TASK_ID) # Check if the status code is 201 and print success if response.status_code == 201: print(f"Status code: {response.status_code} - The request was successful and a new resource was created.") else: print(f"Status code: {response.status_code}") print(f"File name: {file_name}") pprint(response.json()) print("\n") # Get list of video files video_files = [f for f in os.listdir(video_folder) if f.endswith('.mp4')] # Create a ThreadPoolExecutor with ThreadPoolExecutor() as executor: # Use executor to run upload_video in parallel for all video files executor.map(upload_video, video_files) </code></pre>
Retrieve the unique identifier of the video
Now let's enumerate all the videos in our index. This allows us to retain the video ID of a specific video, the goal being to extract all the text embedded within it. Furthermore, akin to our methods in prior tutorials, I'm assembling a list of video IDs and their respective titles, designed to be subsequently fed into our Flask application.
<pre><code class="python"># List all the videos in an index default_header = { "x-api-key": API_KEY } INDEX_ID='644a73aa8b1dd6cde172a933' INDEXES_VIDEOS_URL = f"{API_URL}/indexes/{INDEX_ID}/videos" response = requests.get(INDEXES_VIDEOS_URL, headers=default_header) response_json = response.json() pprint(response_json) video_id_name_list = [{'video_id': video['_id'], 'video_name': video['metadata']['filename']} for video in response_json['data']] pprint(video_id_name_list) </code></pre>
Output:
<pre><code class="python">{'data': [{'_id': '###a917186daab572f349243', 'created_at': '2023-04-27T14:18:48Z', 'metadata': {'duration': 1300.173875, 'engine_id': 'marengo2.5', 'filename': 'A Brief History of Film.mp4', 'fps': 23.976023976023978, 'height': 720, 'size': 188214297, 'width': 1280}, 'updated_at': '2023-04-27T14:20:11Z'}, {'_id': '###3da86daab572f349241', 'created_at': '2023-04-27T13:08:19Z', 'metadata': {'duration': 550.7, 'engine_id': 'marengo2.5', 'filename': 'GPT - Explained!.mp4', 'fps': 30, 'height': 720, 'size': 22838593, 'width': 1152}, 'updated_at': '2023-04-27T13:08:42Z'}], 'page_info': {'limit_per_page': 10, 'page': 1, 'total_duration': 5402.873875, 'total_page': 1, 'total_results': 3}} [{'video_id': '###a849b86daab572f349242', 'video_name': 'A Brief History of Film.mp4'}, {'video_id': '###a73da86daab572f349241', 'video_name': 'GPT - Explained!.mp4'}] </code></pre>
Extract the text that appears on the screen
Time to put our plan into action! We'll now proceed to extract all textual content from the chosen video:
<pre><code class="python">VIDEO_ID = '###a849b86daab572f349242' TEXT_IN_VIDEO_URL = f"{API_URL}/indexes/{INDEX_ID}/videos/{VIDEO_ID}/text-in-video" response = requests.get(TEXT_IN_VIDEO_URL, headers=default_header) print (f"Status code: {response.status_code}") ocr_data = response.json() pprint (ocr_data) </code></pre>
Output:
<pre><code class="python">Status code: 200 {'data': [{'end': 3, 'start': 1, 'value': 'Film Thought Project'}, {'end': 6, 'start': 5, 'value': 'Film'}, {'end': 22, 'start': 18, 'value': "'L'arrivée d'un train en gare de La Ciotat"}, {'end': 28, 'start': 18, 'value': 'Year:'}, {'end': 28, 'start': 23, 'value': '2015'}, {'end': 28, 'start': 23, 'value': 'Production Co.'}, {'end': 28, 'start': 23, 'value': 'Alejandro G. Iñárritu'}, {'end': 28, 'start': 23, 'value': 'Regency Enterprises'}, {'end': 28, 'start': 23, 'value': "'The Revenant'"}, {'end': 30, 'start': 29, 'value': "Let's"}, {'end': 40, 'start': 32, 'value': 'Film:'}, {'end': 34, 'start': 33, 'value': 'Film Thought Project'}, {'end': 40, 'start': 35, 'value': 'Director:'}, {'end': 40, 'start': 35, 'value': 'Production Co.'}, {'end': 40, 'start': 36, 'value': 'Alfred Hitchcock'}, {'end': 40, 'start': 36, 'value': '1958'}, {'end': 40, 'start': 36, 'value': 'Alfred J. Hitchcock Productions'}, {'end': 40, 'start': 37, 'value': 'Year:'}, {'end': 40, 'start': 38, 'value': "'Vertigo'"}, {'end': 45, 'start': 44, 'value': 'PRESS START'}, {'end': 46, 'start': 45, 'value': '2020'}, {'end': 47, 'start': 46, 'value': '2018'}, {'end': 48, 'start': 47, 'value': '1975'}, {'end': 53, 'start': 49, 'value': '1870s'}, {'end': 61, 'start': 67, 'value': 'Eadweard Muybridge'}, {'end': 69, 'start': 75, 'value': 'See you soon'}], 'id': '###a849b86daab572f349242', 'index_id': '###a73aa8b1dd6cde172a933'} </code></pre>
As you can see, the API extracted all the texts on screen, line by line, like a charm. You can save these texts as metadata for downstream workflows such as filtering, classifying and searching content.
Text-in-video Search - searching for specific text within all indexed videos
Launching our search query utilizing the text_in_video search option to uncover pertinent text matches within our collection of indexed videos:
<pre><code class="python"># Construct the URL of the `/search` endpoint SEARCH_URL = f"{API_URL}/search/" # Declare a dictionary named `data` data = { "index_id": INDEX_ID, "query": "horse", "search_options": [ "text_in_video" ] } # Make a search request response = requests.post(SEARCH_URL, headers=default_header, json=data) if response.status_code == 200: print(f"Status code: {response.status_code} - Success") else: print(f"Status code: {response.status_code}") pprint(response.json()) search_data = response.json() </code></pre>
Output:
<pre><code class="python">Status code: 200 - Success {'data': [{'confidence': 'high', 'end': 64, 'metadata': [{'text': 'THE HORSE IN MOTION.', 'type': 'text_in_video'}], 'score': 92.28, 'start': 63, 'video_id': '###a849b86daab572f349242'}, {'confidence': 'high', 'end': 91, 'metadata': [{'text': 'THE HORSE IN MOTION.', 'type': 'text_in_video'}], 'score': 92.28, 'start': 88, 'video_id': '###a849b86daab572f349242'}], 'page_info': {'limit_per_page': 10, 'page_expired_at': '2023-05-12T00:03:43Z', 'total_results': 2}, 'search_pool': {'index_id': '###a73aa8b1dd6cde172a933', 'total_count': 3, 'total_duration': 5403}} </code></pre>
💡Bear in mind that the text-in-video search feature is set up to locate all occurrences within the indexed videos where the input query aligns (not necessarily word-for-word) with the text visually presented on screen as the video plays. For instance, if I enter "horse moving," the system will identify instances where the on-screen text reads "horse in motion." However, the confidence level of this match will be lower compared to when I input "horse in motion”. The confidence level depends on the percentage of words matched with the natural language query we input. For example, a two out of three-word match will yield a higher confidence level than a match with only one word.

A peek at Twelve Labs Playground's text-in-video search results for a given query

The specific video instance aligning with the input query being played

The model's confidence increases as soon as the query aligns with the on-screen text
Preparing the data for the Flask application to ensure our results will be presented neatly:
<pre><code class="python">video_data = [{'start': d['start'], 'end': d['end'], 'confidence': d['confidence'], 'text': d['metadata'][0]['text']} for d in search_data['data']] video_search_dict = {} for vd in video_data: if search_data['data'][0]['video_id'] in video_search_dict: video_search_dict[search_data['data'][0]['video_id']].append(vd) else: video_search_dict[search_data['data'][0]['video_id']] = [vd] pprint(video_search_dict) </code></pre>
Output:
<pre><code class="python"> {'###a849b86daab572f349242': [{'confidence': 'high', 'end': 64, 'start': 63, 'text': 'THE HORSE IN MOTION.'}, {'confidence': 'high', 'end': 91, 'start': 88, 'text': 'THE HORSE IN MOTION.'}]} </code></pre>
Further data preparation for the video OCR results, followed by our standard procedure of pickling everything:
<pre><code class="python">video_id = ocr_data.get('id') data_list = ocr_data.get('data') data_to_save = { 'video_id': video_id, 'data_list': data_list, 'video_id_name_list': video_id_name_list, 'video_search_dict': video_search_dict } import pickle # Save data to a pickle file with open('data.pkl', 'wb') as f: pickle.dump(data_to_save, f) </code></pre>
Building the Demo App
We're now at the final leg of our video OCR adventure - bringing together all elements to animate our results. Besides the standard configuration we implement for fetching videos from the local folder and loading the pickled data dispatched from the Jupyter notebook, this time we have some additional requirements - a conversion of timestamps from a seconds-only format to a minutes-and-seconds format. This makes the data visualization on the webpage more intuitive. Here's the code for the app.py file:
<pre><code class="python">from flask import Flask, render_template, send_from_directory import pickle import os from collections import defaultdict app = Flask(__name__) # Load data from a pickle file with open('data.pkl', 'rb') as f: loaded_data = pickle.load(f) # Access the data video_id = loaded_data['video_id'] data_list = loaded_data['data_list'] video_id_name_list = loaded_data['video_id_name_list'] video_search_dict = loaded_data['video_search_dict'] VIDEO_DIRECTORY = os.path.join(os.path.dirname(os.path.realpath(__file__)), "static") @app.route('/<path:filename>') def serve_video(filename): print(VIDEO_DIRECTORY, filename) return send_from_directory(directory=VIDEO_DIRECTORY, path=filename) @app.route('/') def home(): for item in data_list: if ":" not in str(item['start']): item['start'] = int(item['start']) item['start'] = f"{item['start'] // 60}:{item['start'] % 60:02}" if ":" not in str(item['end']): item['end'] = int(item['end']) item['end'] = f"{item['end'] // 60}:{item['end'] % 60:02}" video_id_name_dict = {video['video_id']: video['video_name'] for video in video_id_name_list} # video_name = video_id_name_dict.get(video_id) return render_template('index.html', data=data_list[:10], video_id_name_dict=video_id_name_dict, video_id=video_id, video_search_dict = video_search_dict) if __name__ == '__main__': app.run(debug=True) </code></pre>
HTML Template
Now, it's time to craft the final piece: our Jinja-2 based HTML template code. This utilizes all the data we've transmitted through the Flask app.py file. Our first task is to exhibit the Video OCR results. The video player will encompass the entire duration of the video, and beneath it, a table will display the start, end, and text discovered during that time interval on the screen. For enhanced clarity, the timestamps will be presented in a minutes-and-seconds format, and they will be clickable, enabling us to jump to the specific timestamp and get the video playing from that point. It's important to note that I've converted the timestamps back to seconds when passing them to the JavaScript function playVideo. This is due to the function being configured to accept timestamps in a seconds-only format for video playback.
<pre><code class="language-html"><!DOCTYPE html> <html> <head> <link rel="shortcut icon" href="#" /> <title>Video OCR</title> <style> body { text-align: center; font-family: Arial, sans-serif; color: #333; background-color: #f5f5f5; } h1, h2 { color: #444; } table { margin: 0 auto; border-collapse: collapse; width: 80%; margin-top: 20px; } th, td { border: 1px solid #ddd; padding: 8px; text-align: center; } th { padding-top: 12px; padding-bottom: 12px; text-decoration: underline; color: black; } video { width: 40%; height: auto; margin-top: 20px; } /* search style */ .video-container { text-align: center; margin-bottom: 2em; padding: 1em; background-color: #fff; border: 1px solid #ddd; border-radius: 4px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); } table { margin: 0 auto; margin-bottom: 1em; } th, td { padding: 0.5em; border: 1px solid #ddd; } </style> <script> function playVideo(timeString) { var timeParts = timeString.split(":"); var time = parseInt(timeParts[0]) * 60 + parseInt(timeParts[1]); var video = document.querySelector('#mainVideo'); video.currentTime = time; video.play(); } </script> </head> <body> <h1>Video OCR</h1> <h3>Video file: <i>{{ video_id_name_dict[video_id]}}</i></h3> <video id="mainVideo" controls> <source src="{{ url_for('static', filename=video_id_name_dict[video_id]|string) }}" type="video/mp4"> Your browser does not support the video tag. </video> <br /> <br /> <br /> <table> <tr> <th>Start</th> <th>End</th> <th>Value</th> </tr> {% for item in data %} <tr> <td><a href="javascript:void(0)" onclick="playVideo('{{ item['start'] }}')">{{ item['start'] }}</a></td> <td>{{ item['end'] }}</td> <td>{{ item['value'] }}</td> </tr> {% endfor %} </table> <br /> <br /> {% for video_id, results in video_search_dict.items() %} <div class="video-container"> <h1>Text-in-video Search Results</h1> <h2>Video file: <i>{{ video_id_name_dict[video_id] }}</i></h2> <h2>Entered query: <i>{{input_query}}</i></h2> {% for result in results %} <video controls preload="metadata" style="width: 40%;"> <source src="{{ url_for('static', filename=video_id_name_dict[video_id]) }}#t={{ result['start'] }},{{ result['end'] }}" type="video/mp4"> Your browser does not support the video tag. </video> <table> <tr> <th>Start</th> <th>End</th> <th>Confidence</th> <th>Text</th> </tr> <tr> <td>{{ result['start'] }}</td> <td>{{ result['end'] }}</td> <td>{{ result['confidence'] }}</td> <td>{{ result['text'] }}</td> </tr> </table> {% endfor %} </div> {% endfor %} </body> </html> </code></pre>
Running the Flask app
Awesome! let’s just run the last cell of our Jupyter notebook to launch our Flask app:
<pre><code class="python">%run app.py </code></pre>
You should see an output similar to the one below, confirming that everything went as anticipated 😊:

After clicking on the URL link http://127.0.0.1:5000, you should be greeted with the following web page:


Here's the Jupyter Notebook containing the complete code that we've put together throughout this tutorial - https://drive.google.com/drive/folders/1D97_UU2Z0lvp3y52BHV5GKkSNOQKv3Xi?usp=share_link
Outro
Anticipate more thrilling content on the horizon! If you haven't already, I warmly invite you to become part of our lively Discord community, teeming with individuals who share a fervor for multimodal AI.
See you next time,
Ankit
Crafting stellar Developer Experiences @Twelve Labs
Cheers!