Overview
A comprehensive web-based object detection system developed for the 2024 Northrop Grumman Hackathon. The project implements a multi-model ensemble approach, integrating YOLOv8, YOLOv7, and OWL-ViT (Open-World Localization) to provide high-accuracy detection for both static image uploads and real-time video streams. The system features a bespoke IoU-based merging algorithm to reconcile detections from different architectures, offering a robust solution for both standard and zero-shot object localization tasks.
Problem Statement
The project aimed to solve the challenge of creating a versatile object detection system that isn't limited by the fixed classes of a single pre-trained model.
- User/Business Need: Efficient, real-time object identification and localization for security or operational awareness.
- Technical Problem: Balancing the speed of traditional YOLO models with the flexibility of open-set/zero-shot models like OWL-ViT.
- Constraints: Real-time performance requirements for video streaming and the need for a user-friendly interface for non-technical evaluators.
Goals and Scope
- Intended Goals: Deliver a working prototype that demonstrates high-accuracy object detection across multiple model types.
- Functional Scope: Support for image uploads, real-time camera feed analysis, and selection between different AI models.
- Technical Scope: Implementation of a Flask-based backend, integration of PyTorch-based models, and a responsive JavaScript frontend.
- Expected Deliverables: A functional web application with two primary entry points (upload and real-time).
What I Built
I built a complete end-to-end object detection platform consisting of:
- Flask Backend: Managed model inference, image processing, and WebSocket communication.
- Detection Ensemble: An integration layer for YOLOv7, YOLOv8, and OWL-ViT.
- Real-Time Streaming Layer: A SocketIO-based implementation for low-latency video frame processing.
- Ensemble Merging Logic: A custom algorithm to deduplicate and prioritize detections from multiple concurrent models.
- Web Interface: A professional UI with model-specific confidence visualization and annotated results.
Key Features and Capabilities
1. Multi-Model Detection
- Description: Users can select between YOLOv8 (high speed), YOLOv7 (legacy performance), or OWL-ViT (open-world localization).
- How it works: The backend selects the appropriate inference pipeline based on user input from the dropdown menu.
2. "Combined" Ensemble Mode
- Description: Merges detections from YOLOv7 and OWL-ViT.
- How it works: Uses a shared IoU (Intersection over Union) threshold to detect overlapping boxes. If a match is found, it prioritizes the detection with higher confidence.
3. Real-Time Video Analysis
- Description: Processes webcam frames in real-time.
- How it works: Client-side JS captures frames from
getUserMedia, sends them via WebSockets to the server, and the server returns annotated frames for display on an HTML5 canvas.
4. Zero-Shot Localization
- Description: Ability to detect custom objects not explicitly trained in YOLO.
- How it works: Uses OWL-ViT with custom text prompts defined in
constants.py(e.g., "bookshelf with books", "icecream").
Technical Architecture
-
Major Components: Flask Web Server, SocketIO Engine, PyTorch Inference Engines.
-
Frontend/Backend Structure: Client-side JavaScript (State management and rendering) connected to a Python/Flask backend via REST (Upload) and WebSockets (Real-time).
-
Data Flow:
Image Upload -> Flask API -> Model Inference -> OpenCV Annotation -> Base64 Response.Webcam Frame -> SocketIO -> Flask -> model_v8 -> Annotated Frame -> SocketIO Emit.
-
Model Components:
YOLOv8: Ultralytics implementation (yolov8n.pt).YOLOv7: Loaded viatorch.hubfromWongKinYiu/yolov7.OWL-ViT: Hugging Face Transformers (google/owlvit-base-patch32).
-
Storage Layers: Local filesystem for model weights; temporary in-memory processing for video frames.
Technical Decisions and Tradeoffs
- WebSocket vs REST: Chose WebSockets for real-time video to reduce the overhead of TCP handshakes for every frame.
- Ensemble vs Single Model: Opted for an ensemble approach to cover both common objects (YOLO) and specific, niche objects (OWL-ViT) at the cost of higher memory/CPU usage.
- Base64 Transmission: Simplified implementation by transmitting base64 images, which is easier for debugging than raw binary streams, though slightly less efficient in terms of bandwidth.
Outcomes and Value
- Feature Delivered: Built a fully functional end-to-end multi-model object detection web app.
- Prototype Completed: Reached a professional, demo-ready MVP stage.
- Measurable Outcome: Successfully integrated real-time processing with ensemble-based detection.

