A Comparative Analysis of Text Detection Performance in Video Frames: Google Cloud Video Intelligence API and Leo AI

Published on: July 15, 2024

Leo AI and Vertex AI

Abstract

This study evaluates and compares the performance of text detection models in video frames provided by Google's Cloud Video Intelligence API Leo AI. The models were benchmarked using Google's documentation on how to use an appropriate accuracy formula for evaluating models , focusing on true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions. The analysis revealed significant differences in performance between the two models, with Leo AI demonstrating superior accuracy and fewer false negatives.

Introduction

Text detection in video frames is a critical task for various applications, including automated content moderation, video indexing, and optical character recognition (OCR). Accurate text detection ensures reliable extraction of information and improved user experience. This study aims to compare the effectiveness of Google's Cloud Video Intelligence API and Leo AI's text detection model using a predefined accuracy formula.

Methodology

Data Collection

A dataset of video frames was used to evaluate the performance of the text detection models. Each frame was manually annotated to identify the presence or absence of text.

Evaluation Metrics

The models were assessed based on the following metrics:

  • True Positive (TP): Frames where text was correctly detected.
  • True Negative (TN): Frames correctly identified as containing no text.
  • False Positive (FP): Frames incorrectly identified as containing text.
  • False Negative (FN): Frames where text was present but not detected.
Accuracy Formula

The accuracy formula used is defined as:

\[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]

Models Evaluated
  1. Google Cloud Video Intelligence API
  2. Leo AI

Accuracy Calculation

Google's Cloud Video Intelligence API
\[ \text{Accuracy} = \frac{2 + 0}{2 + 0 + 6 + 7} = \frac{2}{15} ≈ 0.133 \]
Leo AI
\[ \text{Accuracy} = \frac{8 + 0}{8 + 0 + 5 + 1} = \frac{8}{14} ≈ 0.571 \]

Discussion

Performance Comparison
  • Accuracy: Leo AI demonstrated a significantly higher accuracy (0.571) compared to Google's Cloud Video Intelligence API (0.133).
  • True Positives: Leo AI detected text correctly in 4x more frames than Google Cloud Video Intelligence API.
  • False Positives: Google Cloud Video Intelligence API had 6 incorrect text detections, slightly more than Leo AI's 5.
  • False Negatives: Google Cloud Video Intelligence API failed to detect text in 7x more frames that contained text than Leo AI.
Analysis

The results indicate that Leo AI outperforms Google's Cloud Video Intelligence API in text detection accuracy. The higher true positive rate and lower false negative rate suggest that Leo AI is more reliable in identifying text in video frames. The comparable false positive rates indicate that both models struggle similarly with incorrectly predicting the presence of text in empty frames.

Limitations
  • Data Set: The study's results are based on a specific dataset, which may not be representative of all possible video frames.
  • True Negatives: Both models failed to identify any true negatives, which may be due to the nature of the dataset or model-specific limitations.
Conclusion

This comparative study highlights Leo AI's superior performance in text detection within video frames compared to Google's Cloud Video Intelligence API. The findings suggest that Leo AI's model is more effective in identifying text accurately and minimizing false negatives, making it a more reliable choice for applications requiring high text detection accuracy. Further research with larger and more diverse datasets could provide additional insights into the models' performance across different scenarios.

Future Work

Future research should explore:

  • Evaluation of models on more diverse datasets.
  • Optimization techniques to improve true negative detection.
  • Integration of text detection models with other video analysis tools for enhanced performance.

This case study provides valuable insights for developers and researchers looking to implement text detection in video frames, guiding them towards choosing the more accurate and reliable model for their specific needs.

Lera Leonteva
Lera Leonteva

Lera Leonteva is the CEO and Founder of Leo AI. A former ethical hacker, she has worked on Harvey AI and ContractMatrix for Allen & Overy, lead a team under PwC's AI initiative and lead the security team at a San Francisco startup.