Are you struggling to understand video content in different languages? Or perhaps finding it difficult to make your content accessible to a global audience? π
In this tutorial, we'll introduce you to the MultiLingual Video Transcriber Application and explain how it was developed as a solution. This application uses AI models from Twelve Labs to understand videos and provide seamless transcription across multiple languages.
What sets this program apart is its ability to adjust transcriptions based on user-selected proficiency levelsβbeginner, intermediate, and advanced. Users get transcriptions or translations tailored to their chosen level. Additionally, the application provides accurate timestamps, allowing users to track spoken words with their transcriptions. This feature enables easy navigation and comprehension of the content. Let's explore how this application works and how you can build similar solutions using the TwelveLabs Python SDK.
You can explore the demo of the application here: Video Multilingual Transcriber
If you want to access the code and experiment with the app directly, you can use this Replit Template.
β
This section outlines the application flow for developing the MultiLingual Video Transcriber Application. The process involves obtaining a video and generating transcriptions in various languages and proficiency levels based on user preferences. This application goes beyond simple transcription, offering a more comprehensive solution.
The system architecture comprises four main components: the Frontend Layer, Backend Layer, Storage Layer, and TwelveLabs Service. Here's how it works: users upload a video (potentially in a foreign language) and select their desired transcript language and proficiency level (beginner, intermediate, or advanced).
Upon clicking the submit button after video upload, an Index ID is generated and stored in the session state for future use. Next, the system creates a Task ID by uploading the video, which then yields a Video ID once indexing is complete. The indexing process utilizes Marengo 2.6 as the embedding engine, while the Pegasus 1.1 Engine (Generative Engine) handles transcript generation through the Generate API.
To enhance user accessibility, the application generates timestamps alongside the transcript. This feature allows for interactive synchronization between the transcript and the video playback.
β
.env
file with your API Key, along with the main file.API_KEY=your_api_key_here
β
With these steps completed, you're now ready to dive in and develop the application!
β
In this tutorial, we'll build a Flask application with a minimal frontend. Here's the directory structure:
.
βββ app.py
βββ requirements.txt
βββ static
β βββ style.css
β βββ main.js
βββ templates
β βββ index.html
βββ uploads
β
Now that you've completed the previous steps, it's time to build the Flask application. This will provide a simple way for you to upload videos and generate transcripts in various languages based on user preferences.
You can find the required dependencies to prepare and create a virtual environment here: requirements.txt
Create a virtual Python environment, then set up the environment for the application using the following command:
pip install -r requirements.txt
β
This section focuses on the main application utility function, which contains the crucial logic and instruction flow. We'll break down the main application into various sections:
β
Here, we'll discuss how the index is configured using the Twelve Labs SDK. This Flask application allows users to upload and process videos for various purposes. It employs a secure filename handling system and session management to ensure reliable operation in a production environment.
# Importing of the necesary modules
from flask import Flask, render_template, request, jsonify, send_from_directory, session
from werkzeug.utils import secure_filename
import os
import uuid
from twelvelabs import TwelveLabs
from twelvelabs.models.task import Task
from dotenv import load_dotenv
load_dotenv()
# Loading the Twelve Labs API from the environment
API_KEY = os.getenv("API_KEY")
app = Flask(__name__)
app.secret_key = os.urandom(24)
UPLOAD_FOLDER = 'uploads'
ALLOWED_EXTENSIONS = {'mp4', 'avi', 'mov'}
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
app.config['MAX_CONTENT_LENGTH'] = 100 * 1024 * 1024
# Intiailizing the client of the Twelve Labs SDK
client = TwelveLabs(api_key=API_KEY)
def allowed_file(filename):
return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
@app.route('/')
def index():
return render_template('index.html')
# Utility function to check the status
def on_task_update(task: Task):
print(f"Status={task.status}")
def process_video(filepath, language, difficulty):
try:
if 'index_id' not in session:
# Index Name
index_name = f"Translate{uuid.uuid4().hex[:8]}"
# Defining Engine
engines = [
{
"name": "pegasus1.1",
"options": ["visual", "conversation"]
},
{
"name": "marengo2.6",
"options": ["visual", "conversation", "text_in_video", "logo"]
}
]
# Creation of Index with the config
index = client.index.create(
name=index_name,
engines=engines
)
# Storing the Index ID in the session to use it again
session['index_id'] = index.id
print(f"Created new index with ID: {index.id}")
else:
print(f"Using existing index with ID: {session['index_id']}")
# Creation of the task
task = client.task.create(index_id=session['index_id'], file=filepath)
task.wait_for_done(sleep_interval=5, callback=on_task_update)
if task.status != "ready":
raise RuntimeError(f"Indexing failed with status {task.status}")
print(f"The unique identifier of your video is {task.video_id}.")
β
We manage the session state to handle the creation of the Index ID robustly. A UUID creates a random number appended to the Index name, ensuring a unique index name for each new session. This new Index ID is saved in the session for future reference.
The index configuration defines two engines: Marengo 2.6 (Embedding Engine) for indexing the video, and Pegasus 1.1 (Generative Engine) for accessing the indexed video and generating content based on open-ended prompts or other parameters.
To create a task, we provide the Index ID and video file path. We can track the task's processing status using task.status. Once indexing is complete, the resulting Video ID is used in the next step.
β
This section covers the generation of text from the indexed video based on user-selected difficulty levels and language preferences. Our prompt engineering system adapts the video transcript's complexity and detail level to match the user's comprehension level while maintaining accurate timestamps and translations.
# Updated prompt with difficulty level
difficulty_prompts = {
"beginner": "Provide a simplified and easy-to-understand",
"intermediate": "Provide a moderately detailed",
"advanced": "Provide a comprehensive and detailed"
}
# To store the difficulty level from the user form
base_prompt = difficulty_prompts.get(difficulty, difficulty_prompts["intermediate"])
# Open Ended Prompt for Generation
prompt = f"Provide the Only Transcript in the Translated {language.capitalize()} Language, {base_prompt} level with the timestamp duration (in the format of ss : ss) of the Indexed Video Content."
res = client.generate.text(video_id=task.video_id, prompt=prompt, temperature=0.25)
print(res)
return {
'status': 'ready',
'message': 'File processed successfully',
'transcript': res.data,
'video_path': f'/uploads/{os.path.basename(filepath)}'
}
except Exception as e:
print(f"Error processing video: {str(e)}")
return {'status': 'error', 'message': str(e)}
β
We use a dictionary-based approach to map difficulty levels to appropriate prompt templates, with a default fallback to the intermediate level. The system maintains a low temperature value of 0.25 to ensure consistent and reliable output generation. The response includes the processed transcript with timestamps.
β
In this section, we'll explore how we generate and manage secure file upload endpoints and handle user file submissions in our Flask application. This component manages the file upload process, validates file types, handles language and difficulty preferences, and ensures secure storage of uploaded videos while maintaining proper error handling throughout the workflow.
# Route to handle file uploads - only accepts POST requests
@app.route('/upload', methods=['POST'])
def upload_file():
# Check if a file was included in the request
if 'file' not in request.files:
return jsonify({'status': 'error', 'message': 'No file part'}), 400
# Get the file from the request
file = request.files['file']
# Extract language preference from form data, default to German if not specified
language = request.form.get('language', 'german')
difficulty = request.form.get('difficulty', 'intermediate')
# Validate that a file was actually selected
if file.filename == '':
return jsonify({'status': 'error', 'message': 'No selected file'}), 400
# Check if file extension is allowed and proceed with processing
if file and allowed_file(file.filename):
filename = secure_filename(file.filename)
filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
file.save(filepath)
# Process the video with specified language and difficulty
result = process_video(filepath, language, difficulty)
# Return the processing result as JSON
return jsonify(result)
# Return error if file type is not allowed
return jsonify({'status': 'error', 'message': 'File type not allowed'}), 400
# To serve uploaded files
@app.route('/uploads/<filename>')
def uploaded_file(filename):
return send_from_directory(app.config['UPLOAD_FOLDER'], filename)
β
We provide an upload endpoint that processes incoming video files and their associated metadata. The upload route uses Werkzeug's secure_filename
utility for secure filename handling and implements robust error checking with appropriate HTTP status codes. If no language and difficulty preferences are set, the system defaults to German language and intermediate difficulty. The file serving route allows secure access to uploaded files from a designated directory, ensuring proper access control and file system security.
β
This section explores key handling utility functions found in the app.js file. This JavaScript file manages file uploads, form submissions, video playback synchronization with time stamped transcripts, error handling, and other application parameters. We'll break down the main.js application into two sections:
β
This section provides an overview of a comprehensive client-side file validation and submission system. It implements essential functionality for validating video files, handling file removals, and managing asynchronous form submissions throughout the upload and processing workflow.
// Validates the uploaded File
function validateFile(file) {
const validTypes = ['video/mp4', 'video/avi', 'video/quicktime'];
const maxSize = 100 * 1024 * 1024; // 100MB
if (!validTypes.includes(file.type)) {
updateStatus('Please select a valid video file (MP4, AVI, or MOV)', 'error');
return false;
}
if (file.size > maxSize) {
updateStatus('File size must be less than 100MB', 'error');
return false;
}
return true;
}
// Utiltiy function to handle the removal of a selected file
function handleFileRemove(e) {
e.preventDefault();
fileInput.value = '';
selectedFile.classList.add('hidden');
uploadPrompt.textContent = 'Choose a video or drag it here';
updateStatus('', '');
}
// Form submission handling
async function handleFormSubmit(e) {
e.preventDefault();
const formData = new FormData(e.target);
if (!fileInput.files || !fileInput.files[0]) {
updateStatus('Please select a file first', 'error');
return;
}
const loadingOverlay = document.getElementById('loading-overlay');
loadingOverlay.classList.remove('hidden');
updateStatus('Uploading file and processing...', 'loading');
// Hide the previous results
hideResult();
try {
// Send POST request to upload endpoint with form
const response = await fetch('/upload', {
method: 'POST',
body: formData
});
// Parse the JSON response
const data = await response.json();
console.log('Server response:', data);
// Check if upload and processing were successful
if (response.ok && data.status === 'ready') {
// Update status and display results
updateStatus('Processing complete!', 'success');
showResult();
displayTranscript(data.transcript);
displayVideo(data.video_path);
} else {
// Throw error if processing failed
throw new Error(data.message || 'An error occurred during processing');
}
} catch (error) {
// log and display of errors
console.error('Error:', error);
updateStatus(`Error: ${error.message}`, 'error');
} finally {
// Hide loading overlay regardless of outcome
loadingOverlay.classList.add('hidden');
}
}
β
The validateFile
function ensures only correct video formats and file sizes are accepted, while handleFileRemove
resets the file selection state upon removal. The handleFormSubmit
function enables asynchronous uploads, managing errors, filling states, and user feedback comprehensively.
The process features appropriate UI updatesβincluding loading overlays, success messages, and error notificationsβat various stages. To maintain a smooth user experience, the implementation ensures consistent error handling and validation throughout the uploading and processing phases.
β
This section focuses on developing a robust JavaScript transcript parsing system. The code handles various timestamp formats and text patterns common in video transcripts, ensuring reliable extraction and structuring of time-coded content.
The parser converts raw transcript data into a format suitable for display and synchronization with video playback. It also manages multiple edge cases and format variations that may arise from the transcription process.
// Parses a transcript string into structured data with timestamps and text
function parseTranscript(transcript) {
console.log('Raw transcript:', transcript);
// Initialize Map to store unique entries (prevents duplicates)
const entries = new Map();
if (!transcript) {
console.error('Empty transcript received');
return [];
}
// Extract data from JSON response and handle escapes
let transcriptText = transcript;
try {
if (typeof transcript === 'string' && (transcript.includes('"id":') || transcript.includes("'id':"))) {
const dataMatch = transcript.match(/['"]data['"]\s*:\s*['"]([^]+?)['"]\s*$/);
if (dataMatch && dataMatch[1]) {
transcriptText = dataMatch[1]
.replace(/\\n/g, '\n')
.replace(/\\'/g, "'")
.replace(/\\"/g, '"')
.replace(/\\\\/g, '\\');
}
}
} catch (e) {
console.error('Error parsing JSON response:', e);
}
// Different timestamp patterns
const patterns = [
// HH:MM - HH:MM : "text"
/(\d{2}):(\d{2})\s*-\s*(\d{2}):(\d{2})\s*:\s*["']([^"']+)["']/g,
// HH:MM - HH:MM: text
/(\d{2}):(\d{2})\s*-\s*(\d{2}):(\d{2})\s*:\s*([^"\n]+)/g,
// Simple format with quotes
/(\d{2}):(\d{2})\s*-\s*(\d{2}):(\d{2})\s*["']([^"']+)["']/g
];
// Try each pattern against the transcript text for parsing
for (const pattern of patterns) {
let match;
while ((match = pattern.exec(transcriptText)) !== null) {
try {
const [_, startMin, startSec, endMin, endSec, text] = match;
// Convert timestamps to seconds
const startTime = parseInt(startMin) * 60 + parseInt(startSec);
const endTime = parseInt(endMin) * 60 + parseInt(endSec);
// Skip invalid timestamps
if (isNaN(startTime) || isNaN(endTime)) continue;
// Clean up theo transcript text, if appears
const cleanText = text
.replace(/^["'\s]+|["'\s]+$/g, '')
.replace(/\\n/g, ' ')
.replace(/\*\*/g, '')
.replace(/\\'/g, "'")
.replace(/\\"/g, '"')
.replace(/\\\\/g, '\\')
.replace(/\s+/g, ' ')
.trim();
if (cleanText && !cleanText.includes('Note:')) {
const key = `${startTime}-${cleanText.substring(0, 50)}`;
entries.set(key, {
start: startTime,
end: endTime,
text: cleanText
});
}
} catch (e) {
console.error('Error processing match:', e);
continue;
}
}
}
// Convert Map to Array and sort by start time
const sortedEntries = Array.from(entries.values())
.sort((a, b) => a.start - b.start);
// Log processed result
console.log('Parsed entries:', sortedEntries);
return sortedEntries;
}
β
For consistent processing, the system employs regular expressions to match different timestamp patterns, converts time markers to seconds, and performs comprehensive text cleaning to remove unwanted artifacts. A Map structure prevents duplicate entries and maintains chronological order. Detailed logging is implemented throughout the parsing process for debugging purposes.
The final output is a list of transcript entries containing start time, end time, and cleaned text content, ready for integration with video players and transcript display components.
You can find the complete version of the JavaScript file containing the code discussed above in this app.js file.
β
First, the user selects the target language for translation and the desired difficulty level. Then, they upload the video. Once the upload is complete, the indexing process begins.
After indexing and task creation are finished, the video ID is used to generate the transcription based on the user's preferences. The demo below showcases this generation process using the Twelve Labs SDK.
To explore Twelve Labs further, try generating video-based use cases in various sectors such as content creation, education, or any other area that interests you.
β
Understanding how multilingual video transcription applications work and how they're developed allows you to implement innovative ideas and create products that satisfy a wide variety of video solutions. Here are some use cases that could benefit users working with video content:
π Global Content Creators: Generate transcriptions in multiple languages simultaneously, enabling instant content localization for video content.
π International Education: Make educational content accessible by automatically transcribing lectures into various languages and difficulty levels. This can also be leveraged for language learning from video content where resources are limited.
πΌ Cross-Cultural Business: Facilitate communication in multinational settings by generating meeting transcripts.
β
Thank you for following along with this tutorial on the development and functionality of the Video Multi-Lingual Transcriber application with Twelve Labs. We hope this guide helps you to ignite the idea of combining the video understanding and the application workflow. We welcome your ideas on how to enhance the user experience and address any challenges.
β
Learn more about the engines used for the generation task, Marengo 2.6 (Embedding Engine) and Pegasus 1.1 (Generator Engine). To further explore Twelve Labs and enhance your understanding of video content analysis, check out these valuable resources:
We encourage you to leverage these resources to expand your knowledge and create innovative applications using Twelve Labs video understanding technology.
Introducing our new Embed API in Open Beta, enabling customers to generate state-of-the-art multimodal embeddings.
Learn how to build a semantic video search engine with the powerful integration of Twelve Labs' Embed API with ApertureDB for advanced semantic video search.
Whether you're looking to find the perfect berry-toned lipstick or just curious about spotting specific colors in your videos, this guide will help you leverage cutting-edge AI to do so effortlessly.
Leverage Twelve Labs Embed API and LanceDB to create AI applications that can process and analyze video content with unprecedented accuracy and efficiency.