Vedant Misra
Northrop Grumman Object Detection
mlCompleted (Hackathon MVP)

Northrop Grumman Object Detection

Multi-Model Ensemble Object Detection Hub

Developed a multi-model object detection platform integrating YOLOv8, YOLOv7, and OWL-ViT architectures with an IoU-based merging algorithm for zero-shot localization.

Role

Full-stack Developer / AI Engineer

Timeline

2024

Context

Northrop Grumman Hackathon 2024

3

AI Models

YOLOv8, YOLOv7, OWL-ViT

Real-Time

Video Analysis

Low-latency WebSockets

Zero-Shot

Detection

Prompt-based localization

PythonFlaskFlask-SocketIOPyTorchYOLOv8OWL-ViTOpenCVWebSockets

Overview

A comprehensive web-based object detection system developed for the 2024 Northrop Grumman Hackathon. The project implements a multi-model ensemble approach, integrating YOLOv8, YOLOv7, and OWL-ViT (Open-World Localization) to provide high-accuracy detection for both static image uploads and real-time video streams. The system features a bespoke IoU-based merging algorithm to reconcile detections from different architectures, offering a robust solution for both standard and zero-shot object localization tasks.

Problem Statement

The project aimed to solve the challenge of creating a versatile object detection system that isn't limited by the fixed classes of a single pre-trained model.

  • User/Business Need: Efficient, real-time object identification and localization for security or operational awareness.
  • Technical Problem: Balancing the speed of traditional YOLO models with the flexibility of open-set/zero-shot models like OWL-ViT.
  • Constraints: Real-time performance requirements for video streaming and the need for a user-friendly interface for non-technical evaluators.

Goals and Scope

  • Intended Goals: Deliver a working prototype that demonstrates high-accuracy object detection across multiple model types.
  • Functional Scope: Support for image uploads, real-time camera feed analysis, and selection between different AI models.
  • Technical Scope: Implementation of a Flask-based backend, integration of PyTorch-based models, and a responsive JavaScript frontend.
  • Expected Deliverables: A functional web application with two primary entry points (upload and real-time).

What I Built

I built a complete end-to-end object detection platform consisting of:

  1. Flask Backend: Managed model inference, image processing, and WebSocket communication.
  2. Detection Ensemble: An integration layer for YOLOv7, YOLOv8, and OWL-ViT.
  3. Real-Time Streaming Layer: A SocketIO-based implementation for low-latency video frame processing.
  4. Ensemble Merging Logic: A custom algorithm to deduplicate and prioritize detections from multiple concurrent models.
  5. Web Interface: A professional UI with model-specific confidence visualization and annotated results.

Key Features and Capabilities

1. Multi-Model Detection

  • Description: Users can select between YOLOv8 (high speed), YOLOv7 (legacy performance), or OWL-ViT (open-world localization).
  • How it works: The backend selects the appropriate inference pipeline based on user input from the dropdown menu.

2. "Combined" Ensemble Mode

  • Description: Merges detections from YOLOv7 and OWL-ViT.
  • How it works: Uses a shared IoU (Intersection over Union) threshold to detect overlapping boxes. If a match is found, it prioritizes the detection with higher confidence.

3. Real-Time Video Analysis

  • Description: Processes webcam frames in real-time.
  • How it works: Client-side JS captures frames from getUserMedia, sends them via WebSockets to the server, and the server returns annotated frames for display on an HTML5 canvas.

4. Zero-Shot Localization

  • Description: Ability to detect custom objects not explicitly trained in YOLO.
  • How it works: Uses OWL-ViT with custom text prompts defined in constants.py (e.g., "bookshelf with books", "icecream").

Technical Architecture

  • Major Components: Flask Web Server, SocketIO Engine, PyTorch Inference Engines.

  • Frontend/Backend Structure: Client-side JavaScript (State management and rendering) connected to a Python/Flask backend via REST (Upload) and WebSockets (Real-time).

  • Data Flow:

    • Image Upload -> Flask API -> Model Inference -> OpenCV Annotation -> Base64 Response.
    • Webcam Frame -> SocketIO -> Flask -> model_v8 -> Annotated Frame -> SocketIO Emit.
  • Model Components:

    • YOLOv8: Ultralytics implementation (yolov8n.pt).
    • YOLOv7: Loaded via torch.hub from WongKinYiu/yolov7.
    • OWL-ViT: Hugging Face Transformers (google/owlvit-base-patch32).
  • Storage Layers: Local filesystem for model weights; temporary in-memory processing for video frames.

Technical Decisions and Tradeoffs

  • WebSocket vs REST: Chose WebSockets for real-time video to reduce the overhead of TCP handshakes for every frame.
  • Ensemble vs Single Model: Opted for an ensemble approach to cover both common objects (YOLO) and specific, niche objects (OWL-ViT) at the cost of higher memory/CPU usage.
  • Base64 Transmission: Simplified implementation by transmitting base64 images, which is easier for debugging than raw binary streams, though slightly less efficient in terms of bandwidth.

Outcomes and Value

  • Feature Delivered: Built a fully functional end-to-end multi-model object detection web app.
  • Prototype Completed: Reached a professional, demo-ready MVP stage.
  • Measurable Outcome: Successfully integrated real-time processing with ensemble-based detection.