Author
Hrishikesh Yadav
Date Published
November 19, 2024
Tags
API Tutorial
Applications
Developers
Generate API
Open Source
Video Language Models
Share
Join our newsletter
You’re now subscribed to the Twelve Labs Newsletter! You'll be getting the latest news and updates in video understanding.
Oh no, something went wrong.
Please try again.

Introduction

Are you struggling to understand video content in different languages? Or perhaps finding it difficult to make your content accessible to a global audience? 🌍

In this tutorial, we'll introduce you to the MultiLingual Video Transcriber Application and explain how it was developed as a solution. This application uses AI models from Twelve Labs to understand videos and provide seamless transcription across multiple languages.

What sets this program apart is its ability to adjust transcriptions based on user-selected proficiency levelsβ€”beginner, intermediate, and advanced. Users get transcriptions or translations tailored to their chosen level. Additionally, the application provides accurate timestamps, allowing users to track spoken words with their transcriptions. This feature enables easy navigation and comprehension of the content. Let's explore how this application works and how you can build similar solutions using the TwelveLabs Python SDK.

You can explore the demo of the application here: Video Multilingual Transcriber

If you want to access the code and experiment with the app directly, you can use this Replit Template.

‍

Prerequisites

Working of the Application

This section outlines the application flow for developing the MultiLingual Video Transcriber Application. The process involves obtaining a video and generating transcriptions in various languages and proficiency levels based on user preferences. This application goes beyond simple transcription, offering a more comprehensive solution.

The system architecture comprises four main components: the Frontend Layer, Backend Layer, Storage Layer, and TwelveLabs Service. Here's how it works: users upload a video (potentially in a foreign language) and select their desired transcript language and proficiency level (beginner, intermediate, or advanced).

Upon clicking the submit button after video upload, an Index ID is generated and stored in the session state for future use. Next, the system creates a Task ID by uploading the video, which then yields a Video ID once indexing is complete. The indexing process utilizes Marengo 2.6 as the embedding engine, while the Pegasus 1.1 Engine (Generative Engine) handles transcript generation through the Generate API.

To enhance user accessibility, the application generates timestamps alongside the transcript. This feature allows for interactive synchronization between the transcript and the video playback.

‍

Preparation Steps

  1. Obtain your API Key from the Twelve Labs Playground and prepare the environment variable.
  2. Clone the project from Github or use the Replit Template
  3. Set up the .env file with your API Key, along with the main file.
API_KEY=your_api_key_here

‍

With these steps completed, you're now ready to dive in and develop the application!

‍

Walkthrough for VidScribe - Video Multilingual Transcriber

In this tutorial, we'll build a Flask application with a minimal frontend. Here's the directory structure:

.
β”œβ”€β”€ app.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ static
β”‚   β”œβ”€β”€ style.css
β”‚   └── main.js
β”œβ”€β”€ templates
β”‚   └── index.html
└── uploads

‍

Creating the Flask Application

Now that you've completed the previous steps, it's time to build the Flask application. This will provide a simple way for you to upload videos and generate transcripts in various languages based on user preferences.

You can find the required dependencies to prepare and create a virtual environment here: requirements.txt

Create a virtual Python environment, then set up the environment for the application using the following command:

pip install -r requirements.txt

‍

1 - Setting up the Main Application

This section focuses on the main application utility function, which contains the crucial logic and instruction flow. We'll break down the main application into various sections:

  • Creation of the Index
  • Generation of the Result
  • Uploading Component

‍

1.1 - Creation of the IndexΒ 

Here, we'll discuss how the index is configured using the Twelve Labs SDK. This Flask application allows users to upload and process videos for various purposes. It employs a secure filename handling system and session management to ensure reliable operation in a production environment.

# Importing of the necesary modules

from flask import Flask, render_template, request, jsonify, send_from_directory, session
from werkzeug.utils import secure_filename
import os
import uuid
from twelvelabs import TwelveLabs
from twelvelabs.models.task import Task
from dotenv import load_dotenv

load_dotenv()

# Loading the Twelve Labs API from the environment
API_KEY = os.getenv("API_KEY")

app = Flask(__name__)
app.secret_key = os.urandom(24)  

UPLOAD_FOLDER = 'uploads'
ALLOWED_EXTENSIONS = {'mp4', 'avi', 'mov'}
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
app.config['MAX_CONTENT_LENGTH'] = 100 * 1024 * 1024


# Intiailizing the client of the Twelve Labs SDK
client = TwelveLabs(api_key=API_KEY)

def allowed_file(filename):
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

@app.route('/')
def index():
    return render_template('index.html')

# Utility function to check the status
def on_task_update(task: Task):
    print(f"Status={task.status}")


def process_video(filepath, language, difficulty):
    try:
        if 'index_id' not in session:

	     # Index Name 
            index_name = f"Translate{uuid.uuid4().hex[:8]}"

	     # Defining Engine
            engines = [
                {
                    "name": "pegasus1.1",
                    "options": ["visual", "conversation"]
                },
                {
                    "name": "marengo2.6",
                    "options": ["visual", "conversation", "text_in_video", "logo"]
                }
            ]

	     # Creation of Index with the config
            index = client.index.create(
                name=index_name,
                engines=engines
            )

            # Storing the Index ID in the session to use it again
            session['index_id'] = index.id
            print(f"Created new index with ID: {index.id}")
        else:
            print(f"Using existing index with ID: {session['index_id']}")

	
	 # Creation of the task
        task = client.task.create(index_id=session['index_id'], file=filepath)
        task.wait_for_done(sleep_interval=5, callback=on_task_update)
        
        if task.status != "ready":
            raise RuntimeError(f"Indexing failed with status {task.status}")
        
        print(f"The unique identifier of your video is {task.video_id}.")

‍

We manage the session state to handle the creation of the Index ID robustly. A UUID creates a random number appended to the Index name, ensuring a unique index name for each new session. This new Index ID is saved in the session for future reference.

The index configuration defines two engines: Marengo 2.6 (Embedding Engine) for indexing the video, and Pegasus 1.1 (Generative Engine) for accessing the indexed video and generating content based on open-ended prompts or other parameters.

To create a task, we provide the Index ID and video file path. We can track the task's processing status using task.status. Once indexing is complete, the resulting Video ID is used in the next step.

‍

1.2Β  - Generation of the Result

This section covers the generation of text from the indexed video based on user-selected difficulty levels and language preferences. Our prompt engineering system adapts the video transcript's complexity and detail level to match the user's comprehension level while maintaining accurate timestamps and translations.

        # Updated prompt with difficulty level
        difficulty_prompts = {
            "beginner": "Provide a simplified and easy-to-understand",
            "intermediate": "Provide a moderately detailed",
            "advanced": "Provide a comprehensive and detailed"
        }
        
	 # To store the difficulty level from the user form
        base_prompt = difficulty_prompts.get(difficulty, difficulty_prompts["intermediate"])
	 # Open Ended Prompt for Generation
        prompt = f"Provide the Only Transcript in the Translated {language.capitalize()} Language, {base_prompt} level with the timestamp duration (in the format of ss : ss) of the Indexed Video Content."
        
        res = client.generate.text(video_id=task.video_id, prompt=prompt, temperature=0.25)
        print(res)
        return {
            'status': 'ready',
            'message': 'File processed successfully',
            'transcript': res.data,
            'video_path': f'/uploads/{os.path.basename(filepath)}'
        }
        
    except Exception as e:
        print(f"Error processing video: {str(e)}")
        return {'status': 'error', 'message': str(e)}

‍

We use a dictionary-based approach to map difficulty levels to appropriate prompt templates, with a default fallback to the intermediate level. The system maintains a low temperature value of 0.25 to ensure consistent and reliable output generation. The response includes the processed transcript with timestamps.

‍

1.3Β  - Uploading Component

In this section, we'll explore how we generate and manage secure file upload endpoints and handle user file submissions in our Flask application. This component manages the file upload process, validates file types, handles language and difficulty preferences, and ensures secure storage of uploaded videos while maintaining proper error handling throughout the workflow.

# Route to handle file uploads - only accepts POST requests
@app.route('/upload', methods=['POST'])
def upload_file():

    # Check if a file was included in the request
    if 'file' not in request.files:
        return jsonify({'status': 'error', 'message': 'No file part'}), 400
    
    # Get the file from the request
    file = request.files['file']
    # Extract language preference from form data, default to German if not specified
    language = request.form.get('language', 'german')
    difficulty = request.form.get('difficulty', 'intermediate')
    
    # Validate that a file was actually selected
    if file.filename == '':
        return jsonify({'status': 'error', 'message': 'No selected file'}), 400

    # Check if file extension is allowed and proceed with processing
    if file and allowed_file(file.filename):
        filename = secure_filename(file.filename)
        filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
        file.save(filepath)
        
	 # Process the video with specified language and difficulty
        result = process_video(filepath, language, difficulty)

        # Return the processing result as JSON
        return jsonify(result)
    # Return error if file type is not allowed
    return jsonify({'status': 'error', 'message': 'File type not allowed'}), 400

# To serve uploaded files
@app.route('/uploads/<filename>')
def uploaded_file(filename):
    return send_from_directory(app.config['UPLOAD_FOLDER'], filename)

‍

We provide an upload endpoint that processes incoming video files and their associated metadata. The upload route uses Werkzeug's secure_filename utility for secure filename handling and implements robust error checking with appropriate HTTP status codes. If no language and difficulty preferences are set, the system defaults to German language and intermediate difficulty. The file serving route allows secure access to uploaded files from a designated directory, ensuring proper access control and file system security.

‍

2 - Handling the component with JavaScript

This section explores key handling utility functions found in the app.js file. This JavaScript file manages file uploads, form submissions, video playback synchronization with time stamped transcripts, error handling, and other application parameters. We'll break down the main.js application into two sections:

  • File Upload and Validation
  • Transcript Processing

‍

2.1 - File Upload and Validation

This section provides an overview of a comprehensive client-side file validation and submission system. It implements essential functionality for validating video files, handling file removals, and managing asynchronous form submissions throughout the upload and processing workflow.

     // Validates the uploaded File
     function validateFile(file) {
        const validTypes = ['video/mp4', 'video/avi', 'video/quicktime'];
        const maxSize = 100 * 1024 * 1024; // 100MB

        if (!validTypes.includes(file.type)) {
            updateStatus('Please select a valid video file (MP4, AVI, or MOV)', 'error');
            return false;
        }

        if (file.size > maxSize) {
            updateStatus('File size must be less than 100MB', 'error');
            return false;
        }

        return true;
    }

    // Utiltiy function to handle the removal of a selected file
    function handleFileRemove(e) {
        e.preventDefault();
        fileInput.value = '';
        selectedFile.classList.add('hidden');
        uploadPrompt.textContent = 'Choose a video or drag it here';
        updateStatus('', '');
    }

    // Form submission handling
    async function handleFormSubmit(e) {
        e.preventDefault();
        const formData = new FormData(e.target);
        
        if (!fileInput.files || !fileInput.files[0]) {
            updateStatus('Please select a file first', 'error');
            return;
        }

        const loadingOverlay = document.getElementById('loading-overlay');
        loadingOverlay.classList.remove('hidden');
        updateStatus('Uploading file and processing...', 'loading');
        // Hide the previous results
        hideResult();

        try {
         // Send POST request to upload endpoint with form
            const response = await fetch('/upload', {
                method: 'POST',
                body: formData
            });

	     // Parse the JSON response
            const data = await response.json();
            console.log('Server response:', data);

            // Check if upload and processing were successful
            if (response.ok && data.status === 'ready') {
                // Update status and display results
                updateStatus('Processing complete!', 'success');
                showResult();
                displayTranscript(data.transcript);
                displayVideo(data.video_path);
            } else {
                // Throw error if processing failed
                throw new Error(data.message || 'An error occurred during processing');
            }
        } catch (error) {
            // log and display of errors
            console.error('Error:', error);
            updateStatus(`Error: ${error.message}`, 'error');
        } finally {
            // Hide loading overlay regardless of outcome
            loadingOverlay.classList.add('hidden');
        }
    }

‍

The validateFile function ensures only correct video formats and file sizes are accepted, while handleFileRemove resets the file selection state upon removal. The handleFormSubmit function enables asynchronous uploads, managing errors, filling states, and user feedback comprehensively.

The process features appropriate UI updatesβ€”including loading overlays, success messages, and error notificationsβ€”at various stages. To maintain a smooth user experience, the implementation ensures consistent error handling and validation throughout the uploading and processing phases.

‍

2.2 - Utility for Transcript Parsing for Open Ended Prompt Result

This section focuses on developing a robust JavaScript transcript parsing system. The code handles various timestamp formats and text patterns common in video transcripts, ensuring reliable extraction and structuring of time-coded content.

The parser converts raw transcript data into a format suitable for display and synchronization with video playback. It also manages multiple edge cases and format variations that may arise from the transcription process.

    // Parses a transcript string into structured data with timestamps and text
    function parseTranscript(transcript) {
        console.log('Raw transcript:', transcript);
        // Initialize Map to store unique entries (prevents duplicates)
        const entries = new Map();
        
        if (!transcript) {
            console.error('Empty transcript received');
            return [];
        }
    
        // Extract data from JSON response and handle escapes
        let transcriptText = transcript;
        try {
            if (typeof transcript === 'string' && (transcript.includes('"id":') || transcript.includes("'id':"))) {
                const dataMatch = transcript.match(/['"]data['"]\s*:\s*['"]([^]+?)['"]\s*$/);
                if (dataMatch && dataMatch[1]) {
                    transcriptText = dataMatch[1]
                        .replace(/\\n/g, '\n')
                        .replace(/\\'/g, "'")
                        .replace(/\\"/g, '"')
                        .replace(/\\\\/g, '\\');
                }
            }
        } catch (e) {
            console.error('Error parsing JSON response:', e);
        }
    
        // Different timestamp patterns
        const patterns = [
            // HH:MM - HH:MM : "text"
            /(\d{2}):(\d{2})\s*-\s*(\d{2}):(\d{2})\s*:\s*["']([^"']+)["']/g,
            // HH:MM - HH:MM: text
            /(\d{2}):(\d{2})\s*-\s*(\d{2}):(\d{2})\s*:\s*([^"\n]+)/g,
            // Simple format with quotes
            /(\d{2}):(\d{2})\s*-\s*(\d{2}):(\d{2})\s*["']([^"']+)["']/g
        ];
    
        // Try each pattern against the transcript text for parsing
        for (const pattern of patterns) {
            let match;
            while ((match = pattern.exec(transcriptText)) !== null) {
                try {
                    const [_, startMin, startSec, endMin, endSec, text] = match;
                    // Convert timestamps to seconds
                    const startTime = parseInt(startMin) * 60 + parseInt(startSec);
                    const endTime = parseInt(endMin) * 60 + parseInt(endSec);
    
                    // Skip invalid timestamps
                    if (isNaN(startTime) || isNaN(endTime)) continue;
                    // Clean up theo transcript text, if appears
                    const cleanText = text
                        .replace(/^["'\s]+|["'\s]+$/g, '')
                        .replace(/\\n/g, ' ')
                        .replace(/\*\*/g, '')
                        .replace(/\\'/g, "'")
                        .replace(/\\"/g, '"')
                        .replace(/\\\\/g, '\\')
                        .replace(/\s+/g, ' ')
                        .trim();
    
                    if (cleanText && !cleanText.includes('Note:')) {
                        const key = `${startTime}-${cleanText.substring(0, 50)}`;
                        entries.set(key, {
                            start: startTime,
                            end: endTime,
                            text: cleanText
                        });
                    }
                } catch (e) {
                    console.error('Error processing match:', e);
                    continue;
                }
            }
        }
    
        // Convert Map to Array and sort by start time
        const sortedEntries = Array.from(entries.values())
            .sort((a, b) => a.start - b.start);
        // Log processed result
        console.log('Parsed entries:', sortedEntries);
        return sortedEntries;
    }

‍

For consistent processing, the system employs regular expressions to match different timestamp patterns, converts time markers to seconds, and performs comprehensive text cleaning to remove unwanted artifacts. A Map structure prevents duplicate entries and maintains chronological order. Detailed logging is implemented throughout the parsing process for debugging purposes.

The final output is a list of transcript entries containing start time, end time, and cleaned text content, ready for integration with video players and transcript display components.

You can find the complete version of the JavaScript file containing the code discussed above in this app.js file.

‍

Demo Application

First, the user selects the target language for translation and the desired difficulty level. Then, they upload the video. Once the upload is complete, the indexing process begins.

After indexing and task creation are finished, the video ID is used to generate the transcription based on the user's preferences. The demo below showcases this generation process using the Twelve Labs SDK.

To explore Twelve Labs further, try generating video-based use cases in various sectors such as content creation, education, or any other area that interests you.

‍

More Ideas to Experiment with the Tutorial

Understanding how multilingual video transcription applications work and how they're developed allows you to implement innovative ideas and create products that satisfy a wide variety of video solutions. Here are some use cases that could benefit users working with video content:

🌍 Global Content Creators: Generate transcriptions in multiple languages simultaneously, enabling instant content localization for video content.

πŸŽ“ International Education: Make educational content accessible by automatically transcribing lectures into various languages and difficulty levels. This can also be leveraged for language learning from video content where resources are limited.

πŸ’Ό Cross-Cultural Business: Facilitate communication in multinational settings by generating meeting transcripts.

‍

Conclusion

Thank you for following along with this tutorial on the development and functionality of the Video Multi-Lingual Transcriber application with Twelve Labs. We hope this guide helps you to ignite the idea of combining the video understanding and the application workflow. We welcome your ideas on how to enhance the user experience and address any challenges.

‍

Additional Resources

Learn more about the engines used for the generation task, Marengo 2.6 (Embedding Engine) and Pegasus 1.1 (Generator Engine). To further explore Twelve Labs and enhance your understanding of video content analysis, check out these valuable resources:

  • Discord Community: Join our vibrant community of developers and enthusiasts to discuss ideas, ask questions, and share your projects. Join the Twelve Labs Discord
  • Sample Applications: Explore a variety of sample applications to inspire your next project or learn new implementation techniques.
  • Explore Tutorials: Dive deeper into Twelve Labs capabilities with our comprehensive tutorials.

We encourage you to leverage these resources to expand your knowledge and create innovative applications using Twelve Labs video understanding technology.

Generation Examples
No items found.
No items found.
Comparison against existing models
No items found.

Related articles

Introducing Twelve Labs Embed API Open Beta

Introducing our new Embed API in Open Beta, enabling customers to generate state-of-the-art multimodal embeddings.

Manish Maheshwari
Semantic Video Search Engine with Twelve Labs and ApertureDB

Learn how to build a semantic video search engine with the powerful integration of Twelve Labs' Embed API with ApertureDB for advanced semantic video search.

James Le
Building a Shade Finder App: Using Twelve Labs' API to Pinpoint Specific Colors in Videos

Whether you're looking to find the perfect berry-toned lipstick or just curious about spotting specific colors in your videos, this guide will help you leverage cutting-edge AI to do so effortlessly.

Meeran Kim
Building Advanced Video Understanding Applications: Integrating Twelve Labs Embed API with LanceDB for Multimodal AI

Leverage Twelve Labs Embed API and LanceDB to create AI applications that can process and analyze video content with unprecedented accuracy and efficiency.

James Le, Manish Maheshwari