Ai for object classification

In this post I will show you the easiest way to combine AI, convolution neural network(CNN) and docker container to classified object in real time. So all thing you need to know is basic knowledge about docker and neural network. If you are very new to programming, don’t worry, just follow the step below, and you will have a program classified object in real time.

in the video above I’m driving a car go around with a camera on top, to tracking other car and person inside it. I use CUDA Yolo + Nvidia GPU. You can also do the same, all you need to do is download my Docker file and run it.

For who need to understand the theories behind, I will summaries like this. The docker file will create a Ubuntu Linux environment and install Nvidia GPU+OpenCV+darknet in to it. Darknet is a wonderful neural network, it was train by around 10 millions picture and can real-time recognize about 70 categories (car, dog, cat, ship, plane….). If you want to learn more about darknet, you can read my article : https://thanhnguyensite.net/2020/11/05/neural-network/

OK! now let’s go the AI world:

Darknet Nvidia-Docker Ubuntu 16.04

Prerequisites

  1. Make sure you have the NVidia driver for your machine

Find out your the Graphics Card model

lspci | grep VGA

https://www.nvidia.com/Download/index.aspx?lang=en-us

How to install NVidia Drivers on Linux https://gist.github.com/wangruohui/df039f0dc434d6486f5d4d098aa52d07#install-nvidia-graphics-driver-via-runfile

  1. Install Docker and NVidia Docker https://github.com/NVIDIA/nvidia-docker

Steps to run

  1. Clone this repo:
git clone https://gitlab.com/thanhnguyen1181991/darknet-docker.git
  1. Build the machine (this step might take a while, go make some coffee)
docker build -t darknet .
  1. On start.sh make sure you have the correct address of your webcam, in file start.sh line 8, if you use laptop onboard webcam, then choose: device=/dev/bus/usb/003/004:/dev/video0, if use external webcam, then: device=/dev/bus/usb/003/004:/dev/video0

Find your webcam bus

lsusb -t

Change the following line with the correct webcam bus

--device=/dev/bus/usb/003/002:/dev/video0
  1. Map a local folder to the Docker Container

Format:

/local/folder:/docker/folder

on start.sh change the following line

-v /home/projects:/dev/projects \
  1. Run the machine with Webcam
sh start.sh

Darknet

Make sure you have the weights for what you want to run

More information at https://pjreddie.com/darknet/

Keras basics

Keras is a high-level deep learning API designed to make neural network development easier and more productive. Today, it is most commonly used through TensorFlow as tf.keras.

Why Keras became popular

One reason Keras became popular is that it lets developers build useful models with a small amount of code. Instead of focusing on low-level tensor operations too early, you can focus on model structure, training, and evaluation.

Core concepts

  • Layers: building blocks such as Dense, Conv2D, LSTM, Dropout
  • Models: a stack or graph of layers
  • Loss functions: how the model measures error
  • Optimizers: how the model updates weights
  • Metrics: what you track during training

A minimal example

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(64, activation="relu", input_shape=(10,)),
    layers.Dense(32, activation="relu"),
    layers.Dense(1, activation="sigmoid")
])

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

What Keras is good for

  • Rapid prototyping
  • Educational deep learning projects
  • Production training pipelines with TensorFlow
  • Computer vision, tabular ML, NLP, and time-series work

Final thoughts

Keras lowers the barrier to entry for deep learning. It is not only beginner-friendly; it is also powerful enough for many real-world machine learning systems.

Yolo for real-time detection

YOLO, short for You Only Look Once, changed how many engineers think about object detection. Earlier detection systems often used multi-stage pipelines that proposed regions first and classified them later. YOLO reframed detection as a direct prediction problem: take an image, run one forward pass, and predict bounding boxes plus class scores quickly enough for real-time use.

That design choice made YOLO especially attractive in robotics, video analytics, and autonomous systems, where latency matters as much as raw accuracy.

Bounding boxes for object detection example
Object detection is not just classification. The model must also localize the object with a useful bounding box. Source: Wikimedia Commons, Intersection over Union – object detection bounding boxes.jpg.

What problem YOLO solves

Image classification answers a simple question: what is in this image? Detection answers a harder one: what objects are present, where are they, and which class belongs to each box?

That difference matters in real systems. A vehicle or robot does not only need to know that a scene contains a pedestrian. It needs to know where the pedestrian is, how that person moves across frames, and whether the detection is stable enough to influence planning.

YOLO became popular because it made this detection step fast and practical at scale.

Why YOLO became influential

There are three main reasons YOLO spread so widely:

  • Speed: real-time inference made it attractive for video and edge deployment.
  • Simplicity: a single unified detector was easier to explain and deploy than older multi-stage systems.
  • Strong engineering ecosystem: later implementations and tooling made training, exporting, and deployment more accessible.

Over time, the YOLO family evolved a lot. Anchor strategies changed, backbones improved, post-processing changed, and modern variants added tasks such as segmentation, pose, tracking, and oriented boxes. But the core identity remained: detection should be fast enough to use in real systems.

How YOLO works at a practical level

Conceptually, YOLO takes an image and predicts object locations and categories in one inference path. A modern pipeline usually includes:

  1. image resize and normalization
  2. feature extraction with a backbone network
  3. multi-scale detection heads
  4. confidence scoring and class prediction
  5. post-processing to remove duplicate boxes

The exact architecture depends on the version you use, but the operational idea is stable: produce detections quickly enough for downstream systems to react.

from ultralytics import YOLO

model = YOLO("yolo11n.pt")
results = model("street_scene.jpg")

for result in results:
    for box in result.boxes:
        print(box.cls, box.conf, box.xyxy)

The code above looks simple, but the real engineering work is often around dataset quality, deployment constraints, label consistency, camera setup, and tracking across frames.

Where YOLO fits in a larger system

YOLO is rarely the whole perception stack. In a deployed system it usually feeds into something larger:

  • multi-object tracking
  • sensor fusion
  • risk estimation
  • behavior planning
  • alerting or actuation logic

For example, in an autonomous-driving context, YOLO-style detection may identify vehicles, pedestrians, bikes, and traffic cones. But planning still needs temporal tracking, motion prediction, and safety rules before it can turn those detections into driving decisions.

What YOLO is especially good at

YOLO tends to work well when you need:

  • fast detection on live video
  • compact deployment on edge hardware
  • simple integration into monitoring or robotics pipelines
  • good tradeoffs between speed and accuracy

This is why it appears so often in drones, traffic monitoring, warehouse robots, industrial safety, and smart-camera systems.

Where YOLO is not enough by itself

Even a strong detector has limits. YOLO alone does not solve:

  • precise depth estimation
  • fine-grained pixel segmentation
  • long-term tracking identity under heavy occlusion
  • full scene understanding for planning
  • robust performance under severe domain shift without retraining

That is why practical systems usually pair detection with tracking, segmentation, map context, or other sensors.

Real engineering concerns

If you plan to deploy YOLO in a real product, the hard questions are usually not about the marketing benchmark. They are about:

  • label quality and class definitions
  • how often false positives appear in safety-critical scenes
  • how small distant objects can be while still being detected
  • latency on the exact target hardware
  • nighttime, rain, motion blur, or camera vibration
  • monitoring drift after deployment

In many teams, the biggest performance gains come from better data and better deployment choices, not from chasing a new model name every week.

Conclusion

YOLO became influential because it made object detection fast, practical, and easy to integrate into real systems. It remains a strong choice when engineers need real-time perception on images or video. But the best way to use YOLO is to treat it as one reliable module inside a broader perception stack, not as the full system by itself.

References