Jetson Nano 4GB for AI

Utilizing Jetson Nano for my pet monitoring

2025.06.06 ~


Hardware Composition

  • Jetson Nano 4GB Dev kit : Used for inference accelerating(CUDA, TensorRT)
  • Arduino Nano : Used for controlling servo motor(PWM Generator)
  • Rasberrypi CAM v2 : CSI camera directly connected to Jetson Nano

Jetson - Arduino : USB connection / Arudion : D9, D10


Overview

Real-time object tracking system implemented on Jetson Nano 4GB using CenterNet model optimized with TensorRT for GPU acceleration. The system performs object detection at 1280x720 resolution and controls 2-axis servo motors via Arduino for camera positioning.

Core Architecture:

  • Multi-threaded pipeline: Video inference, center tracking, display, and control threads running concurrently
  • GPU-accelerated processing: CUDA kernels for preprocessing and postprocessing eliminate CPU-GPU memory transfers
  • PID servo control: Adaptive dead zone algorithm with object size-based stabilization
  • Double-buffered display: 30fps real-time visualization with video recording capability

Technical Stack:

  • Detection Model: CenterNet with 128x128 feature map output
  • Optimization: TensorRT FP16 inference on Maxwell GPU (sm_53)
  • Communication: USB serial protocol with ACK-based synchronization
  • Performance: GPU-only inference pipeline with atomic operations for thread safety

Key Features:

  • Adaptive step control: Variable servo movement (1-5 steps) based on PID output magnitude
  • Frame dropping strategy: Non-blocking display updates maintain real-time performance
  • Object-aware dead zone: Dynamic stabilization based on tracked object dimensions
  • Synchronous/asynchronous modes: Configurable operation for different performance requirements

Dection Model

CenterNet Mobilenetv2 0.5x


Code Explaination

global.h

centralizes global variable declarations like g_running, g_object_mutex, and g_threshold for use across the entire pipeline (video inference, center tracking, post-processing, and display).

It provides thread synchronization through mutexes and atomic variables, ensuring safe access to shared data between multiple threads.

Key features:

  • ODR compliance: Uses extern declarations with definitions in main.cpp
  • Thread safety: Mutexes and atomic types for concurrent access
  • Modular communication: Single interface for inter-module data sharing

types.h

types.h defines globally used structs TrackedObject and BBox.

ODR(One Definition Rule) is strictly maintained through this simple header file.


cetner_tracking.cpp

trackingThread -> updating servo with some interver(10ms) / the action is taken when the tracked object timestamp is updated. (it’s prevented with mutex in video inference)

Defining Classes - PIDController / AdaptiveDeadZone / CenterTracker

PIDController : We use only PI terms since derivative terms have less meaning when tracking motion is changing the frame, making real dt measurement unreliable. (kp=0.4f, ki=0.002f) Integral windup protection is constraint between [-100.0f, 100.0f].

ServoController : It sends char with serial communication. If you want to use synchronous mode, it offers isWaitingForAck(). But it does not perfectly works.

AdaptiveDeadZone : AdpativeDeadZone is the key stablization algorithm that maintains the blurring due to inferenced frame motion delay. We choose the max length of the tracked object width and height + ratio is same as the frame ratio(currently 1280:720)

CenterTracker : Based on Trackedobject send serial command (‘w’, ‘a’, ‘s’, ‘d’) to arduino through usb connect. Features include adaptive step control where step count (1-5) is calculated as min(5, max(1, abs(pid_output)/8.0 + 1)), timestamp-based duplicate prevention, confidence gating below global threshold, and PID reset after 3 seconds without valid detection. The system operates in synchronous mode (waits for ACK) or asynchronous mode (continuous processing) based on g_sync_mode.


display.cpp

It operates using double buffering where one buffer serves as the front buffer for display output while the other serves as the back buffer for receiving new frame data from the inference pipeline(video_inference.cpp).

The updateDisplay() function writes new frames to the back buffer non-blockingly, while displayThread() reads from the front buffer at 30fps and swaps buffers atomically when new data is ready. When the back buffer is busy, frames are immediately dropped to maintain real-time performance without blocking the main inference pipeline.


main.cpp

Main Thread Flow: Argument validation → Signal handler registration → Component initialization → Thread launching → Video inference execution → Cleanup

Parallel Threads:

  • Control Thread: Handles user input (‘t’, ‘r’, ‘q’) for tracking/recording control

  • Tracking Thread: Executes trackingThread() for servo control coordination

  • Display Thread: Manages OpenCV display and video recording via startDisplayThread()

Synchronization Points: Global variables (g_running, g_sync_mode) coordinate thread communication and graceful shutdown across all components


video_inference.cpp

CPU Frame → GPU Upload → GPU Resize/Normalize → GPU Inference → GPU Postprocess → CPU BBox → CPU Display

Key Features

  • tensorRT model load or engine build
  • mutex to update tracked_object
  • buffer binding

cuda_kernels.cu

It contains two main CUDA kernels for CenterNet inference pipeline that directly utilize TensorRT buffers: normalizeAndReorder for preprocessing (ImageNet normalization + HWC→CHW conversion) and postProcessKernel for CenterNet-specific postprocessing (heatmap→bounding box with 3x3 NMS).

Key Architecture Benefits

  • GPU-only pipeline: Eliminates CPU-GPU memory transfers during inference
  • Parallel efficiency: Each pixel processed simultaneously (total pixels for preprocessing, 128x128 feature map for postprocessing)
  • Memory optimization: Direct TensorRT buffer integration with zero-copy operations
  • CenterNet optimizations: Local maximum detection, coordinate transformation, and thread-safe atomic operations
  • Real-time performance: Optimized for Jetson Nano’s Maxwell architecture (sm_53)

cuda_postprocess.cu

Contains CenterNet-specific post-processing kernel that converts TensorRT model outputs into bounding boxes. The postProcessKernel performs sigmoid activation on heatmap values, applies 3x3 local maximum suppression for peak detection, and transforms feature map coordinates through multiple scaling stages (feature→model→original image space) to generate final bounding boxes.

Key Architecture Benefits

  • CenterNet algorithm implementation: Heatmap peak detection with confidence thresholding
  • 3x3 Local NMS: Eliminates duplicate detections in neighborhood regions
  • Multi-stage coordinate transformation: Feature map (128x128) → Model input → Original image scaling
  • Thread-safe atomic operations: Uses atomicAdd() for concurrent bounding box storage
  • GPU-parallel processing: Each thread handles one pixel in 128x128 feature map simultaneously

cuda_postprocess.cpp

Contains GPUPostProcessor class implementation that serves as a C++ wrapper for CUDA post-processing operations. The class manages GPU memory allocation for bounding box storage, launches the postProcessKernel via launchPostProcessKernel(), and handles GPU-to-CPU memory transfers to return detection results as std::vector<BBox>.

Key Architecture Benefits

  • RAII memory management: Automatic GPU memory allocation in constructor, cleanup in destructor
  • GPU-CPU bridge: Seamless integration between CUDA kernels and C++ STL containers
  • Memory optimization: Pre-allocated GPU buffers (d_output_boxes, d_box_count) with MAX_BOXES=100 limit
  • Zero-copy on GPU: All post-processing operations remain on GPU until final result transfer
  • Thread-safe design: Uses atomic operations from CUDA kernel for concurrent bounding box storage

Dev Log

Jetson Nano Development Log


Photos