Real time pedestrian and objects detection using enhanced YOLO integrated with learning complexity-aware cascades

,


INTRODUCTION
Utilizing computer vision for real-time pedestrian detection is crucial for safety in autonomous vehicles, surveillance, robotics, and automation.While you only look once (YOLO) models, particularly YOLOv3, have shown promise, challenges like occlusion and false positives persist.Adaptations, such as complexity-aware cascades and enhanced YOLOv2, aim to overcome these hurdles.Additionally, studies propose combining YOLOv3 with cascaded region-based convolutional neural network (R-CNN) and employing Kalman filters for real-time object tracking.Recent algorithms like YOLOv3-tiny, YOLOv4, and YOLOv4-tiny have found applications in pedestrian, vehicle, and obstacle detection [1].This paper contributes an enhanced YOLO iteration, focusing on small object identification and false positive reduction.


Real time pedestrian and objects detection using enhanced YOLO integrated with … (Ahmed Lateef Khalaf) 363 Further research explores YOLO deployment for real-time object detection and tracking and the synergy of YOLO with Kalman filters in low-light conditions [2].While Liu's work emphasizes YOLOv3 for pedestrian recognition in smart urban settings [3], the broader field of deep learning-based intelligent transportation systems primarily centers on vehicle detection.An emerging breakthrough is the fully convolutional one-stage (FCOS) 3D object identification method [4].Figure 1 shows the illustrates bounding boxes incorporating predictive information about dimensions and spatial positioning.Researchers, exemplified by [5], have enhanced the YOLO model, introducing complexity-aware cascades for efficient real-time object and pedestrian recognition.Their study compares the proposed model with leading techniques like faster R-CNN and RetinaNet, showcasing competitive performance across publicly available datasets, emphasizing the importance of object content comparison in model development and evaluation [6].In the realm of computer vision, the concept of "pedestrian well-exposure" is crucial for the clarity of pedestrian recognition systems.Researchers, as demonstrated in [7], have introduced an innovative algorithm assessing pedestrians' well-exposure in images using a deep neural network.This approach enhances pedestrian recognition precision by predicting exposure levels for each pixel as shown in Figure 2. Thorough testing on diverse datasets demonstrates the superiority of their system over existing techniques.Additionally, the use of log-normalized heat maps aids in visualizing these aspects.This research contributes insights into optimizing real-time pedestrian and object detection through effective exposure assessment.In a comprehensive study [8], a multifaceted framework for pedestrian recognition is introduced, incorporating tasks such as evaluating well-exposure using deep learning to enhance pedestrian recognition.Considering environmental factors, especially "pedestrian well-exposure," is crucial when developing computer vision algorithms for pedestrian identification.This consideration improves the precision of pedestrian detection for real-world applicability.Addressing the challenge of pedestrian recognition involves analyzing statistical attributes inherent in datasets, exemplified by the prevalent smaller objects in the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset [9]. Figure 3 illustrates a heatmap indicating the clustering tendency of child pedestrians, emphasizing the need for specific attention to small item detection, especially in central image regions, for accurate identification of compact pedestrians [10].

RELATED WORK
Numerous efforts in pedestrian detection involve manually crafted features, such as the "histogram of oriented gradient for person detection" descriptor [11], strengthened by integral channel features (ICF).The aggregate channel features (ACF) technique extends Haar features, histograms, and local sums [12], demonstrating efficacy in real-time facial recognition when fused with AdaBoost classifiers.The fast R-CNN method employs region proposals for object detection [13].The KITTI vision benchmark suite dataset is widely used, evaluating diverse object detection algorithms, including YOLO.The city persons dataset [14] focuses on pedestrians in urban environments, inspiring innovations like the scale-aware fast R-CNN.In contemporary object detection, deep convolutional neural networks (CNNs) are prominent, addressing objects of varying sizes and aspect ratios [15].Scene-adaptive people recognition systems, trained with target data [16], showcase remarkable effectiveness in pedestrian detection.While CNN methodologies have triumphed, the utility of manually constructed feature-based algorithms, like histogram of oriented gradients (HOG), remains evident.Strategies like quick feature pyramids and ICF enhance model performance on the INRIA pedestrian dataset [17].For the PASCAL VOC dataset, [18] suggests adopting scale invariant CNNs to address objects with varying sizes and aspect ratios.Despite strides by deep CNNs, the pivotal roles of HOG and ICF persist in object detection.Recent breakthroughs and datasets like City People and KITTI have propelled object detection.Noteworthy is the rise of single-stage detectors prioritizing speed, exemplified by SSD [19].YOLOv2 [20], an enhanced version, incorporates anchor boxes, and K-means clustering, enhancing training models.Efficient proposes a compound scaling strategy for elevated object detection performance, while RetinaNet introduces a focal loss function to mitigate class imbalance issues [21].

METHOD
The common objects in context (COCO) dataset [22] plays a prominent role as a widely used benchmark in the domains of object detection, segmentation, and captioning tasks.Recognized as one of the most expansive and diverse datasets, COCO boasts a collection exceeding 330,000 images as illustrated in Figure 4 and encompasses more than 2.5 million annotated instances across a spectrum of 80 distinct item categories.The dataset's popularity within the realm of computer vision research is owed to its challenging nature, often presenting intricate scenes and instances of occlusion that test the limits of algorithms.The Caltech pedestrian dataset [23] stands as a definitive reference for evaluating pedestrian detection methodologies.Spanning a continuous ten-hour sequence captured at 640×480 resolution and a frame rate of 30 frames per second, this dataset is meticulously annotated with bounding boxes encompassing both the entire scene and all individuals present.Comprising over 350,000 images and approximately 250,000 frames, its appeal arises from the intricate occlusion scenarios it covers and the realistic portrayal of pedestrian-congested environments.In a collaborative endeavor between the KITTI dataset has emerged as a pioneering benchmark within the field of autonomous driving research.Notably, this dataset incorporates a dedicated segment focused on pedestrian detection.With a collection surpassing 15,000 images captioned with individuals and 7,481 images capturing urban landscapes from a moving vehicle, it is widely acknowledged that the dataset's depiction of pedestrian scenarios amid natural barriers stands among the most authentic and valuable resources available.In our investigation of pedestrian detection methods, we initially utilized a pre-trained HOG detector from the OpenCV library, known for its efficiency [24].The detector showed promising results in image-based detection but exhibited limitations in dynamic video scenes.To assess its performance in a video context, we developed a technique to extract frames and applied the HOG detector to each frame, creating a composite image to visualize detected pedestrians over time.Despite encouraging image-based results, the method faced challenges in adapting to video-based detection, particularly in real-time scenarios and dynamic contexts, revealing the need for more advanced object detection techniques.Recognizing these challenges, this is crucial for applications like autonomous vehicles, surveillance, and robotics, where accurate identification is paramount in complex real-world scenarios.The goal was to leverage recent advances in object detection technology to overcome the limitations posed by the HOG-based detection method, especially in scenarios involving moving objects, occlusions, and varying lighting conditions [25].
The YOLO network originated in the Darknet framework but was adapted for integration into Google Colab by transitioning to a Python-based neural network framework.This transformation involved converting Darknet into TensorFlow, resulting in a customized version called Darkflow as show in Figure 5.The YOLOv2 network architecture was based on a chosen design and its weights were initialized using pretrained weights from the original authors, trained on the COCO dataset, a comprehensive benchmark with a diverse array of object categories.
Pretrained weights were integrated to provide the model with a learning head start, enhancing its performance for the specific study.Refinements to the output layer focused on a single class (individuals), and adjustments to the second-to-last convolutional layer optimized the number of filters for improved predictions.The COCO dataset was utilized for training, involving the dissection of videos into frames, resulting in a subset of 75,057 frames featuring pedestrians.Annotation data was converted to Darkflow-compatible XML format, and the dataset was split into training (67,552 frames) and test sets (7,505 frames).The SORT algorithm was implemented for real-time tracking, generating output files for video assembly.
The SA YOLOv4 system divides the input image into inner and outer halves for nearby and distant pedestrian detection.Both halves undergo identical convolutional layer processing, generating feature maps.Two compact neural networks are trained to discern pedestrians at various distances.Non-maximum suppression (NMS) merges sub-network outcomes, assigning confidence scores and bounding boxes to each recognized pedestrian.SA YOLOv4 harmonizes outputs, achieving comprehensive pedestrian detection across different scales.In the YOLOv4 framework, class prediction optimization during training uses the binary cross-entropy loss mechanism.CSPDarknet-53 architecture replaces earlier counterparts, enhancing efficiency by eliminating redundant processing.The integration of a spatial attention module (SAM) captures spatial interdependencies for more precise object detection outcomes.YOLOv4 combines bag of freebies (BOF) and bag of specials (BOS) strategies, making it a highly advanced system with innovative elements and enhancements that collectively amplify object detection precision and effectiveness.The SA YOLOv4 architecture, an enhanced iteration of the YOLOv4 object detection model, was meticulously crafted to excel in identifying individuals of varying statures on urban streets.Guided by the scene's geometry, an initial segmentation of the input image into three distinct sections is executed, with particular emphasis on pedestrians situated within the central portion depicted in Figure 6.Subsequently, both the entire image and its focal region undergo processing within the network.Through a process of stacked input convolution, the network generates feature maps.These derived feature maps are then fed into two distinct networks, each of which specializes in discerning pedestrians of different scales (ranging from larger to smaller dimensions).
The network processes both the entire image and its central portion, creating feature maps through convolutional layers.It then bifurcates into two sub-networks for pedestrian recognition across varying scales.Each sub-network extracts scale-specific attributes through convolutional layers, and the resulting feature vector undergoes fully connected layers, generating output vectors for classification scores and bounding box coordinates.NMS amalgamates outcomes for a unified detection result.The YOLO model, with cross-view computational models, shows promise in early skin disorder detection and crack resistance prediction.Its application in clinical studies benefits from combining deep learning with imaging technologies.However, domain complexity limits its full utilization.Hybrid deep learning algorithms offer increased predictive accuracy for image edge smoothing, with accelerated training durations and competence across COCO datasets.Refining YOLO architectures is crucial as deep learning gains popularity, with innovative design components contributing to advancements like cross-validation and interpretable machines as shown in Figure 7.

RESULTS AND DISCUSSION
The system proposed for real-time object recognition in edge contexts, particularly on embedded hardware platforms, achieved a processing rate of 450 milliseconds.Aware cascade learning enhances performance by leveraging knowledge from the source task.Addressing color bias, the system employs color transformation and edge segmentation.Integrating deep learning into remote sensing shows potential for performance improvement, extracting general features and progressively specializing for specific tasks.Transfer learning initializes weight processes using pre-trained weights from deep YOLO models.The YOLO algorithm involves training a source network on a COCO dataset and transferring knowledge to a new network for different tasks.Gamma correction and tone mapping enhance highlights and tonal appearance.RAW image capture results in dark, unsuitable images for computer vision applications as shown in Figure 9.The findings demonstrate YOLOV8's capability to efficiently retrieve frames from various sources while upholding YOLO's exceptional object detection performance.On a GTX 1060 system, it achieved real-time object recognition for image dimensions of 128×128 and 256×256 with a latency of less than 0.1 seconds.Conversely, performance declined when processing larger images (416×416 or 608×608) on the same hardware, resulting in delays of 1.4 seconds for YOLOV8 at 416×416 and 2.8 seconds at 608×608.In contrast, YOLOV4 showcased a real-time processing delay under 0.3 seconds for the 608×608 image size.Object and pedestrian detection accuracy in relation to existing methods is provided in Table 1.

CONCLUSION
The insights disclosed in the preceding sections unveil potential pathways to enhance the performance of the pedestal detection algorithm.To begin with, extending the training duration of the network with a diverse range of learning rates could notably bolster its ability to identify individuals who have adeptly navigated local minima.Secondly, optimizing the balance between resolution and frame rate has the potential to elevate real-time performance, with the added possibility of improving frame rates by lowering the resolution of the video material.Prior to the real-world deployment of such systems, it is crucial to conduct thorough testing in controlled environments involving real vehicles.Additionally, a promising avenue of exploration is the feasibility of integrating the proposed YOLOV4 as an auxiliary element for real-time processing within existing object identification methodologies.These guidelines can offer valuable insights into determining the most suitable hardware configuration for intelligent video applications built upon the YOLO framework.By addressing these research objectives in the future, the resilience and practicality of the pedestrian recognition system can be significantly enhanced for real-world scenarios.This study sheds light on the limitations of the current YOLO algorithm in terms of reliable real-time pedestrian identification, while concurrently proposing remedies through strategies like real-world testing, integration of YOLOV4 for real-time processing, and extended training with varied learning rates.By effectively addressing the deficiencies of the present YOLO algorithm, this research not only paves the way for selecting optimal hardware configurations but also contributes to the overall enhancement of accuracy and real-time performance in object detection software.

Figure 1 .
Figure 1.Illustrates bounding boxes incorporating predictive information about dimensions and spatial positioning

Figure 2 .
Figure 2. Sequential operation of YOLO learning chips within an aware cascade framework, highlighting their modules, and fundamental concept

Figure 3 .
Figure 3. Example of COCO dataset

Figure 7 .
Figure 7. Steps for classifying an object and pedestrian for getting ensemble and holistic score

Figure 8 .
Figure 8. Tridimensional graph illustrates how probability and distance interact during the decoding and reconstruction of encoded bars for object and pedestrian detection

Figure 9 .
Figure 9. Illustrates the identification and prediction of pedestrians on the road across diverse instances

Table 1 .
Outlines the accuracy of object and pedestrian detection in comparison to contemporary methodologies Real time pedestrian and objects detection using enhanced YOLO integrated with … (Ahmed Lateef Khalaf) 369