Vedant Misra — Software Engineer

Overview

A comprehensive web-based object detection system developed for the 2024 Northrop Grumman Hackathon. The project implements a multi-model ensemble approach, integrating YOLOv8, YOLOv7, and OWL-ViT (Open-World Localization) to provide high-accuracy detection for both static image uploads and real-time video streams. The system features a bespoke IoU-based merging algorithm to reconcile detections from different architectures, offering a robust solution for both standard and zero-shot object localization tasks.

Problem Statement

The project aimed to solve the challenge of creating a versatile object detection system that isn't limited by the fixed classes of a single pre-trained model.

User/Business Need: Efficient, real-time object identification and localization for security or operational awareness.
Technical Problem: Balancing the speed of traditional YOLO models with the flexibility of open-set/zero-shot models like OWL-ViT.
Constraints: Real-time performance requirements for video streaming and the need for a user-friendly interface for non-technical evaluators.

Goals and Scope

Intended Goals: Deliver a working prototype that demonstrates high-accuracy object detection across multiple model types.
Functional Scope: Support for image uploads, real-time camera feed analysis, and selection between different AI models.
Technical Scope: Implementation of a Flask-based backend, integration of PyTorch-based models, and a responsive JavaScript frontend.
Expected Deliverables: A functional web application with two primary entry points (upload and real-time).

What I Built

I built a complete end-to-end object detection platform consisting of:

Flask Backend: Managed model inference, image processing, and WebSocket communication.
Detection Ensemble: An integration layer for YOLOv7, YOLOv8, and OWL-ViT.
Real-Time Streaming Layer: A SocketIO-based implementation for low-latency video frame processing.
Ensemble Merging Logic: A custom algorithm to deduplicate and prioritize detections from multiple concurrent models.
Web Interface: A professional UI with model-specific confidence visualization and annotated results.

Key Features and Capabilities

1. Multi-Model Detection

Description: Users can select between YOLOv8 (high speed), YOLOv7 (legacy performance), or OWL-ViT (open-world localization).
How it works: The backend selects the appropriate inference pipeline based on user input from the dropdown menu.

2. "Combined" Ensemble Mode

Description: Merges detections from YOLOv7 and OWL-ViT.
How it works: Uses a shared IoU (Intersection over Union) threshold to detect overlapping boxes. If a match is found, it prioritizes the detection with higher confidence.

3. Real-Time Video Analysis

Description: Processes webcam frames in real-time.
How it works: Client-side JS captures frames from getUserMedia, sends them via WebSockets to the server, and the server returns annotated frames for display on an HTML5 canvas.

4. Zero-Shot Localization

Description: Ability to detect custom objects not explicitly trained in YOLO.
How it works: Uses OWL-ViT with custom text prompts defined in constants.py (e.g., "bookshelf with books", "icecream").

Technical Architecture

Major Components: Flask Web Server, SocketIO Engine, PyTorch Inference Engines.
Frontend/Backend Structure: Client-side JavaScript (State management and rendering) connected to a Python/Flask backend via REST (Upload) and WebSockets (Real-time).
Data Flow:
- Image Upload -> Flask API -> Model Inference -> OpenCV Annotation -> Base64 Response.
- Webcam Frame -> SocketIO -> Flask -> model_v8 -> Annotated Frame -> SocketIO Emit.
Model Components:
- YOLOv8: Ultralytics implementation (yolov8n.pt).
- YOLOv7: Loaded via torch.hub from WongKinYiu/yolov7.
- OWL-ViT: Hugging Face Transformers (google/owlvit-base-patch32).
Storage Layers: Local filesystem for model weights; temporary in-memory processing for video frames.

Technical Decisions and Tradeoffs

WebSocket vs REST: Chose WebSockets for real-time video to reduce the overhead of TCP handshakes for every frame.
Ensemble vs Single Model: Opted for an ensemble approach to cover both common objects (YOLO) and specific, niche objects (OWL-ViT) at the cost of higher memory/CPU usage.
Base64 Transmission: Simplified implementation by transmitting base64 images, which is easier for debugging than raw binary streams, though slightly less efficient in terms of bandwidth.

Outcomes and Value

Feature Delivered: Built a fully functional end-to-end multi-model object detection web app.
Prototype Completed: Reached a professional, demo-ready MVP stage.
Measurable Outcome: Successfully integrated real-time processing with ensemble-based detection.

Northrop Grumman Object Detection