Extended kalman filter basics

The Extended Kalman Filter, or EKF, is a state estimation algorithm used when the system or the sensor model is nonlinear. It is a natural extension of the standard Kalman Filter, which assumes linear dynamics and linear measurements.

Why do we need EKF?

In robotics and autonomous driving, many important relationships are nonlinear. For example, a radar sensor may measure range, angle, and radial velocity instead of simple Cartesian coordinates. A robot may also rotate while moving, which creates nonlinear motion equations.

Main idea

EKF still follows the same two-step structure as the standard Kalman Filter:

  • Prediction: estimate the next state based on the motion model
  • Update: correct the estimate using the sensor measurement

The difference is that EKF linearizes the nonlinear functions around the current estimate by using Jacobian matrices.

Where EKF is used

  • Robot localization
  • Sensor fusion with radar, lidar, and IMU
  • Mobile robot tracking
  • Autonomous vehicle state estimation

A simplified workflow

1. Predict x(k|k-1) using the motion model
2. Predict covariance P(k|k-1)
3. Compute Jacobian matrices
4. Compare expected measurement with real measurement
5. Update the state and covariance

Strengths and limitations

EKF works well when the system is only mildly nonlinear and the estimate remains close to reality. However, if the model is strongly nonlinear or the initial estimate is poor, EKF may diverge.

Final thoughts

EKF remains one of the most important filters in robotics and autonomous systems because it provides a practical compromise between mathematical tractability and real-world usefulness.

Tensorflow basics

Throughout this lesson, you’ll apply your knowledge of neural networks on real datasets using TensorFlow (link for China), an open source Deep Learning library created by Google.

You’ll use TensorFlow to classify images from the notMNIST dataset – a dataset of images of English letters from A to J. You can see a few example images below.

Your goal is to automatically detect the letter based on the image in the dataset. You’ll be working on your own computer for this lab, so, first things first, install TensorFlow!

Install

OS X, Linux, Windows

Prerequisites

Intro to TensorFlow requires Python 3.4 or higher and Anaconda. If you don’t meet all of these requirements, please install the appropriate package(s).

Install TensorFlow

You’re going to use an Anaconda environment for this class. If you’re unfamiliar with Anaconda environments, check out the official documentation. More information, tips, and troubleshooting for installing tensorflow on Windows can be found here.

Note: If you’ve already created the environment for Term 1, you shouldn’t need to do so again here!

Run the following commands to setup your environment:

conda create --name=IntroToTensorFlow python=3 anaconda
source activate IntroToTensorFlow
conda install -c conda-forge tensorflow

That’s it! You have a working environment with TensorFlow. Test it out with the code in the Hello, world! section below.

Docker on Windows

Docker instructions were offered prior to the availability of a stable Windows installation via pip or Anaconda. Please try Anaconda first, Docker instructions have been retained as an alternative to an installation via Anaconda.

Install Docker

Download and install Docker from the official Docker website.

Run the Docker Container

Run the command below to start a jupyter notebook server with TensorFlow:

docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow

Users in China should use the b.gcr.io/tensorflow/tensorflow instead of gcr.io/tensorflow/tensorflow

You can access the jupyter notebook at localhost:8888. The server includes 3 examples of TensorFlow notebooks, but you can create a new notebook to test all your code.

Hello, world!

Try running the following code in your Python console to make sure you have TensorFlow properly installed. The console will print “Hello, world!” if TensorFlow is installed. Don’t worry about understanding what it does. You’ll learn about it in the next section.

import tensorflow as tf

# Create TensorFlow object called tensor
hello_constant = tf.constant('Hello World!')

with tf.Session() as sess:
    # Run the tf.constant operation in the session
    output = sess.run(hello_constant)
    print(output)


Errors

If you’re getting the error tensorflow.python.framework.errors.InvalidArgumentError: Placeholder:0 is both fed and fetched, you’re running an older version of TensorFlow. Uninstall TensorFlow, and reinstall it using the instructions above. For more solutions, check out the Common Problems section.

TensorFlow Math

Getting the input is great, but now you need to use it. You’re going to use basic math functions that everyone knows and loves – add, subtract, multiply, and divide – with tensors. (There’s many more math functions you can check out in the documentation.)

Addition

x = tf.add(5, 2)  # 7

You’ll start with the add function. The tf.add() function does exactly what you expect it to do. It takes in two numbers, two tensors, or one of each, and returns their sum as a tensor.

Subtraction and Multiplication

Here’s an example with subtraction and multiplication.

x = tf.subtract(10, 4) # 6
y = tf.multiply(2, 5)  # 10

The x tensor will evaluate to 6, because 10 - 4 = 6. The y tensor will evaluate to 10, because 2 * 5 = 10. That was easy!

Converting types

It may be necessary to convert between types to make certain operators work together. For example, if you tried the following, it would fail with an exception:

tf.subtract(tf.constant(2.0),tf.constant(1))  # Fails with ValueError: Tensor conversion requested dtype float32 for Tensor with dtype int32: 

That’s because the constant 1 is an integer but the constant 2.0 is a floating point value and subtract expects them to match.

In cases like these, you can either make sure your data is all of the same type, or you can cast a value to another type. In this case, converting the 2.0 to an integer before subtracting, like so, will give the correct result:

tf.subtract(tf.cast(tf.constant(2.0), tf.int32), tf.constant(1))   # 1

Quiz

Let’s apply what you learned to convert an algorithm to TensorFlow. The code below is a simple algorithm using division and subtraction. Convert the following algorithm in regular Python to TensorFlow and print the results of the session. You can use tf.constant() for the values 102, and 1.

Convolution networks basics

Understanding of Convolutional Neural Network (CNN) — Deep Learning

In neural networks, Convolutional neural network (ConvNets or CNNs) is one of the main categories to do images recognition, images classifications. Objects detections, recognition faces etc., are some of the areas where CNNs are widely used.

CNN image classifications takes an input image, process it and classify it under certain categories (Eg., Dog, Cat, Tiger, Lion). Computers sees an input image as array of pixels and it depends on the image resolution. Based on the image resolution, it will see h x w x d( h = Height, w = Width, d = Dimension ). Eg., An image of 6 x 6 x 3 array of matrix of RGB (3 refers to RGB values) and an image of 4 x 4 x 1 array of matrix of grayscale image.

Image for post
Figure 1 : Array of RGB Matrix

Technically, deep learning CNN models to train and test, each input image will pass it through a series of convolution layers with filters (Kernals), Pooling, fully connected layers (FC) and apply Softmax function to classify an object with probabilistic values between 0 and 1. The below figure is a complete flow of CNN to process an input image and classifies the objects based on values.

Image for post
Figure 2 : Neural network with many convolutional layers

Convolution Layer

Convolution is the first layer to extract features from an input image. Convolution preserves the relationship between pixels by learning image features using small squares of input data. It is a mathematical operation that takes two inputs such as image matrix and a filter or kernel.

Image for post
Figure 3: Image matrix multiplies kernel or filter matrix

Consider a 5 x 5 whose image pixel values are 0, 1 and filter matrix 3 x 3 as shown in below

Image for post
Figure 4: Image matrix multiplies kernel or filter matrix

Then the convolution of 5 x 5 image matrix multiplies with 3 x 3 filter matrix which is called “Feature Map” as output shown in below

Image for post
Figure 5: 3 x 3 Output matrix

Convolution of an image with different filters can perform operations such as edge detection, blur and sharpen by applying filters. The below example shows various convolution image after applying different types of filters (Kernels).

Image for post
Figure 7 : Some common filters

Strides

Stride is the number of pixels shifts over the input matrix. When the stride is 1 then we move the filters to 1 pixel at a time. When the stride is 2 then we move the filters to 2 pixels at a time and so on. The below figure shows convolution would work with a stride of 2.

Image for post
Figure 6 : Stride of 2 pixels

Padding

Sometimes filter does not fit perfectly fit the input image. We have two options:

  • Pad the picture with zeros (zero-padding) so that it fits
  • Drop the part of the image where the filter did not fit. This is called valid padding which keeps only valid part of the image.

Non Linearity (ReLU)

ReLU stands for Rectified Linear Unit for a non-linear operation. The output is ƒ(x) = max(0,x).

Why ReLU is important : ReLU’s purpose is to introduce non-linearity in our ConvNet. Since, the real world data would want our ConvNet to learn would be non-negative linear values.

Image for post
Figure 7 : ReLU operation

There are other non linear functions such as tanh or sigmoid that can also be used instead of ReLU. Most of the data scientists use ReLU since performance wise ReLU is better than the other two.

Pooling Layer

Pooling layers section would reduce the number of parameters when the images are too large. Spatial pooling also called subsampling or downsampling which reduces the dimensionality of each map but retains important information. Spatial pooling can be of different types:

  • Max Pooling
  • Average Pooling
  • Sum Pooling

Max pooling takes the largest element from the rectified feature map. Taking the largest element could also take the average pooling. Sum of all elements in the feature map call as sum pooling.

Image for post
Figure 8 : Max Pooling

Fully Connected Layer

The layer we call as FC layer, we flattened our matrix into vector and feed it into a fully connected layer like a neural network.

Image for post
Figure 9 : After pooling layer, flattened as FC layer

In the above diagram, the feature map matrix will be converted as vector (x1, x2, x3, …). With the fully connected layers, we combined these features together to create a model. Finally, we have an activation function such as softmax or sigmoid to classify the outputs as cat, dog, car, truck etc.,

Image for post
Figure 10 : Complete CNN architecture

Summary

  • Provide input image into convolution layer
  • Choose parameters, apply filters with strides, padding if requires. Perform convolution on the image and apply ReLU activation to the matrix.
  • Perform pooling to reduce dimensionality size
  • Add as many convolutional layers until satisfied
  • Flatten the output and feed into a fully connected layer (FC Layer)
  • Output the class using an activation function (Logistic Regression with cost functions) and classifies images.

In the next post, I would like to talk about some popular CNN architectures such as AlexNet, VGGNet, GoogLeNet, and ResNet.

Lenet for traffic signs

Load Data

Load the MNIST data, which comes pre-loaded with TensorFlow.

from tensorflow.examples.tutorials.mnist import input_data
 mnist = input_data.read_data_sets("MNIST_data/", reshape=False)
 X_train, y_train           = mnist.train.images, mnist.train.labels
 X_validation, y_validation = mnist.validation.images, mnist.validation.labels
 X_test, y_test             = mnist.test.images, mnist.test.labels
 assert(len(X_train) == len(y_train))
 assert(len(X_validation) == len(y_validation))
 assert(len(X_test) == len(y_test))
 print()
 print("Image Shape: {}".format(X_train[0].shape))
 print()
 print("Training Set:   {} samples".format(len(X_train)))
 print("Validation Set: {} samples".format(len(X_validation)))
 print("Test Set:       {} samples".format(len(X_test)))

The MNIST data that TensorFlow pre-loads comes as 28x28x1 images.

However, the LeNet architecture only accepts 32x32xC images, where C is the number of color channels.

In order to reformat the MNIST data into a shape that LeNet will accept, we pad the data with two rows of zeros on the top and bottom, and two columns of zeros on the left and right (28+2+2 = 32).

You do not need to modify this section.

import numpy as np

# Pad images with 0s
X_train      = np.pad(X_train, ((0,0),(2,2),(2,2),(0,0)), 'constant')
X_validation = np.pad(X_validation, ((0,0),(2,2),(2,2),(0,0)), 'constant')
X_test       = np.pad(X_test, ((0,0),(2,2),(2,2),(0,0)), 'constant')
    
print("Updated Image Shape: {}".format(X_train[0].shape))

Visualize Data

View a sample from the dataset.

You do not need to modify this section.

import random
 import numpy as np
 import matplotlib.pyplot as plt
 %matplotlib inline
 index = random.randint(0, len(X_train))
 image = X_train[index].squeeze()
 plt.figure(figsize=(1,1))
 plt.imshow(image, cmap="gray")
 print(y_train[index])

Preprocess Data

Shuffle the training data.

You do not need to modify this section.

from sklearn.utils import shuffle
 X_train, y_train = shuffle(X_train, y_train)

Setup TensorFlow

The EPOCH and BATCH_SIZE values affect the training speed and model accuracy.

You do not need to modify this section.In [ ]:

import tensorflow as tf
EPOCHS <strong>=</strong> 10
BATCH_SIZE <strong>=</strong> 128




TODO: Implement LeNet-5

Implement the LeNet-5 neural network architecture.

This is the only cell you need to edit.

Input

The LeNet architecture accepts a 32x32xC image as input, where C is the number of color channels. Since MNIST images are grayscale, C is 1 in this case.

Architecture

Layer 1: Convolutional. The output shape should be 28x28x6.

Activation. Your choice of activation function.

Pooling. The output shape should be 14x14x6.

Layer 2: Convolutional. The output shape should be 10x10x16.

Activation. Your choice of activation function.

Pooling. The output shape should be 5x5x16.

Flatten. Flatten the output shape of the final pooling layer such that it’s 1D instead of 3D. The easiest way to do is by using tf.contrib.layers.flatten, which is already imported for you.

Layer 3: Fully Connected. This should have 120 outputs.

Activation. Your choice of activation function.

Layer 4: Fully Connected. This should have 84 outputs.

Activation. Your choice of activation function.

Layer 5: Fully Connected (Logits). This should have 10 outputs.

Output

Return the result of the 2nd fully connected layer.

from tensorflow.contrib.layers import flatten
 def LeNet(x):    
     # Arguments used for tf.truncated_normal, randomly defines variables for the weights and biases for each layer
     mu = 0
     sigma = 0.1
 <code># TODO: Layer 1: Convolutional. Input = 32x32x1. Output = 28x28x6. # TODO: Activation. # TODO: Pooling. Input = 28x28x6. Output = 14x14x6. # TODO: Layer 2: Convolutional. Output = 10x10x16. # TODO: Activation. # TODO: Pooling. Input = 10x10x16. Output = 5x5x16. # TODO: Flatten. Input = 5x5x16. Output = 400. # TODO: Layer 3: Fully Connected. Input = 400. Output = 120. # TODO: Activation. # TODO: Layer 4: Fully Connected. Input = 120. Output = 84. # TODO: Activation. # TODO: Layer 5: Fully Connected. Input = 84. Output = 10. return logits</code>

Features and Labels

Train LeNet to classify MNIST data.

x is a placeholder for a batch of input images. y is a placeholder for a batch of output labels.

You do not need to modify this section.

x = tf.placeholder(tf.float32, (None, 32, 32, 1))
y = tf.placeholder(tf.int32, (None))
one_hot_y = tf.one_hot(y, 10)

Transfer learning basics

The Four Main Cases When Using Transfer Learning

Transfer learning involves taking a pre-trained neural network and adapting the neural network to a new, different data set.

Depending on both:

  • the size of the new data set, and
  • the similarity of the new data set to the original data set

the approach for using transfer learning will be different. There are four main cases:

  1. new data set is small, new data is similar to original training data
  2. new data set is small, new data is different from original training data
  3. new data set is large, new data is similar to original training data
  4. new data set is large, new data is different from original training data

Four Cases When Using Transfer Learning

A large data set might have one million images. A small data could have two-thousand images. The dividing line between a large data set and small data set is somewhat subjective. Overfitting is a concern when using transfer learning with a small data set.

Images of dogs and images of wolves would be considered similar; the images would share common characteristics. A data set of flower images would be different from a data set of dog images.

Each of the four transfer learning cases has its own approach. In the following sections, we will look at each case one by one.

Demonstration Network

To explain how each situation works, we will start with a generic pre-trained convolutional neural network and explain how to adjust the network for each case. Our example network contains three convolutional layers and three fully connected layers:

General Overview of a Neural Network

Here is an generalized overview of what the convolutional neural network does:

  • the first layer will detect edges in the image
  • the second layer will detect shapes
  • the third convolutional layer detects higher level features

Each transfer learning case will use the pre-trained convolutional neural network in a different way.

Case 1: Small Data Set, Similar Data

Case 1: Small Data Set with Similar Data

If the new data set is small and similar to the original training data:

  • slice off the end of the neural network
  • add a new fully connected layer that matches the number of classes in the new data set
  • randomize the weights of the new fully connected layer; freeze all the weights from the pre-trained network
  • train the network to update the weights of the new fully connected layer

To avoid overfitting on the small data set, the weights of the original network will be held constant rather than re-training the weights.

Since the data sets are similar, images from each data set will have similar higher level features. Therefore most or all of the pre-trained neural network layers already contain relevant information about the new data set and should be kept.

Here’s how to visualize this approach:

Neural Network with Small Data Set, Similar Data

Case 2: Small Data Set, Different Data

Case 2: Small Data Set, Different Data

If the new data set is small and different from the original training data:

  • slice off most of the pre-trained layers near the beginning of the network
  • add to the remaining pre-trained layers a new fully connected layer that matches the number of classes in the new data set
  • randomize the weights of the new fully connected layer; freeze all the weights from the pre-trained network
  • train the network to update the weights of the new fully connected layer

Because the data set is small, overfitting is still a concern. To combat overfitting, the weights of the original neural network will be held constant, like in the first case.

But the original training set and the new data set do not share higher level features. In this case, the new network will only use the layers containing lower level features.

Here is how to visualize this approach:

Neural Network with Small Data Set, Different Data

Case 3: Large Data Set, Similar Data

Case 3: Large Data Set, Similar Data

If the new data set is large and similar to the original training data:

  • remove the last fully connected layer and replace with a layer matching the number of classes in the new data set
  • randomly initialize the weights in the new fully connected layer
  • initialize the rest of the weights using the pre-trained weights
  • re-train the entire neural network

Overfitting is not as much of a concern when training on a large data set; therefore, you can re-train all of the weights.

Because the original training set and the new data set share higher level features, the entire neural network is used as well.

Here is how to visualize this approach:

Neural Network with Large Data Set, Similar Data

Case 4: Large Data Set, Different Data

Case 4: Large Data Set, Different Data

If the new data set is large and different from the original training data:

  • remove the last fully connected layer and replace with a layer matching the number of classes in the new data set
  • retrain the network from scratch with randomly initialized weights
  • alternatively, you could just use the same strategy as the “large and similar” data case

Even though the data set is different from the training data, initializing the weights from the pre-trained network might make training faster. So this case is exactly the same as the case with a large, similar data set.

If using the pre-trained network as a starting point does not produce a successful model, another option is to randomly initialize the convolutional neural network weights and train the network from scratch.

Here is how to visualize this approach:

Neural Network with Large Data Set, Different Data

Computer vision in practice

Computer vision is no longer just a research topic for image classification demos. In robotics, autonomous driving, industrial inspection, and smart infrastructure, it has become a practical engineering discipline. The real question is not whether a model can recognize an object in a clean dataset. The real question is whether the full vision stack can keep working when calibration drifts, lighting changes, motion blur appears, and the system still has to make a decision in real time.

That is why modern computer vision should be understood as a pipeline, not as a single neural network. A production-grade vision system usually combines geometry, calibration, image preprocessing, feature extraction, learning-based perception, tracking, and sensor fusion.

Pinhole camera model diagram
Projection geometry still matters in practical vision systems. Source: Wikimedia Commons, Pinhole camera model technical version.svg.

Why advanced computer vision matters

A camera gives dense information, but raw pixels do not help a robot or vehicle by themselves. A useful system must convert pixels into structured understanding. Depending on the task, that may mean lane boundaries, traffic-light state, object boxes, semantic masks, depth estimates, keypoints, optical flow, or an updated vehicle pose.

In autonomous systems, advanced computer vision usually supports tasks such as:

  • object detection and classification
  • semantic and instance segmentation
  • depth estimation and 3D scene understanding
  • lane and road-boundary estimation
  • visual odometry, SLAM, and relocalization
  • sensor fusion with radar, LiDAR, IMU, and maps

Each of these tasks looks different on paper, but they share the same foundation: image geometry, stable calibration, robust preprocessing, and the ability to reason over time rather than over one frame only.

The foundation: calibration and geometry

Before discussing neural networks, it is worth remembering that a camera is still a geometric sensor. If the system does not know its intrinsics, distortion coefficients, and mounting relationship to the vehicle or robot frame, the rest of the pipeline becomes less trustworthy.

In practice, advanced computer vision often starts with:

  • intrinsics such as focal lengths and optical center
  • distortion coefficients for radial and tangential distortion
  • extrinsics between cameras and the body frame
  • synchronization across sensors and compute nodes

This is one reason OpenCV calibration tools remain important even in deep-learning pipelines. If the image geometry is inconsistent, depth, stitching, epipolar constraints, and multi-camera fusion all degrade.

import cv2 as cv
import numpy as np

frame = cv.imread("road_scene.jpg")
camera_matrix = np.array([
    [fx, 0, cx],
    [0, fy, cy],
    [0,  0,  1],
], dtype=np.float32)
dist_coeffs = np.array([k1, k2, p1, p2, k3], dtype=np.float32)

undistorted = cv.undistort(frame, camera_matrix, dist_coeffs)

That single undistortion step can reduce downstream errors in lane fitting, feature tracking, and multi-camera alignment.

The practical computer vision pipeline

A useful way to think about advanced computer vision is as a layered pipeline:

  1. Capture: acquire synchronized frames with known timing.
  2. Calibrate and rectify: correct distortion and align geometry.
  3. Preprocess: resize, normalize, denoise, or convert color spaces.
  4. Perceive: detect objects, segment classes, estimate flow or depth.
  5. Track: stabilize detections over time and estimate motion.
  6. Fuse: combine vision with radar, LiDAR, IMU, odometry, or maps.
  7. Decide: pass structured outputs to planning or control.

This layered view matters because many real failures come from the interfaces between stages, not from the model headline itself. A detector can be accurate in isolation and still fail in production if calibration is stale or the timestamps are misaligned.

What “advanced” really means in modern vision

In day-to-day engineering, advanced computer vision usually means combining several levels of reasoning instead of relying on one handcrafted trick.

1. Detection

Detection predicts what objects are present and where they are. This is the world of YOLO-style real-time detectors and larger transformer-based detectors. For many systems, detection is the first semantic layer that turns pixels into entities: vehicles, pedestrians, bikes, cones, or signs.

2. Segmentation

Segmentation goes beyond boxes. It asks which pixels belong to lanes, curb, sidewalk, road, sky, vehicle, or person. That matters when the system needs drivable-area estimation or precise free-space boundaries instead of only rough boxes.

3. Depth and geometry

Depth can come from stereo disparity, structure from motion, multi-view triangulation, or learned monocular depth models. In production systems, metric depth from vision alone is often less reliable than fused depth, but relative structure from vision remains extremely valuable.

4. Motion and tracking

A single frame can be ambiguous. Tracking over time makes vision more robust. This includes optical flow, keypoint tracking, re-identification, motion estimation, and multi-object tracking. In autonomous systems, temporal stability is often as important as per-frame accuracy.

Classical vision still matters

Deep learning dominates many benchmarks, but classical computer vision is still useful for real systems because it is interpretable, cheap, and often a strong debugging tool. Engineers still use:

  • thresholding and color filtering
  • edge detection and Hough transforms
  • homography and perspective transforms
  • feature matching and bundle adjustment
  • PnP, epipolar geometry, and triangulation
Lane detection pipeline diagram
Even simple pipelines show how multiple image-processing stages work together before a final decision is made. Source: Wikimedia Commons, Lane Detection Algorithm.svg.

These methods are especially helpful when building baselines, validating learned models, or narrowing down whether a failure is caused by geometry, data quality, or the network itself.

Where advanced vision becomes difficult

Real-world scenes are messy. A system may work well in daytime testing and fail in heavy glare, rain, or crowded urban environments. Some of the hardest problems are:

  • small or distant objects
  • occlusion between dynamic agents
  • weather and low light
  • domain shift between training data and deployment scenes
  • latency and compute limits on edge hardware
  • uncertainty that is not communicated clearly to planning

This is why advanced vision is rarely just about model accuracy. It is also about timing budgets, hardware constraints, calibration lifecycle, monitoring, and fallback behavior.

What good engineers watch closely

If you are reviewing a production vision system, these questions matter more than a flashy benchmark slide:

  • How often is calibration checked and refreshed?
  • How stable is performance across weather and lighting conditions?
  • What is the end-to-end latency from frame capture to output?
  • How is temporal consistency enforced?
  • Which failures are handled by fusion with other sensors?
  • How is uncertainty exposed to downstream planning or control?

Those questions usually reveal whether the vision stack is a research demo or an engineering system that can survive outside the lab.

Conclusion

Advanced computer vision is best understood as a full pipeline that converts raw pixels into reliable scene understanding. Calibration, geometry, preprocessing, learned perception, temporal tracking, and sensor fusion all matter. When those pieces work together, cameras become one of the richest sensors in robotics and autonomous driving. When they do not, even a strong model can become unreliable very quickly.

References

Neural networks basics

Perhaps the hottest topic in the world right now is artificial intelligence. When people talk about this, they often talk about machine learning, and specifically, neural networks.

Now, neural networks should be familiar with you. If you put your hands like this, left and right, do it, then between your hands is a big neural network called your brain with something like 10 to 11 neurons, is crazy. What people have done in the last decades kind of abstracted this big mass in your brain into a basis set of equations that emulate a network of artificial neurons. Then people have invented ways to train these systems based on data.

So, rather than instructing a machine with rules like a piece of software, these neural networks are trained based on data.

So, you’re going to learn the very basics for now, perception, backpropagation, terminology that doesn’t make sense yet, but by the end of this unit, you should be able to write and code and train your own neural network.

That’s is so fun!

A Note on Deep Learning

The following lessons contain introductory and intermediate material on neural networks, building a neural network from scratch, using TensorFlow, and Convolutional Neural Networks:

  • Neural Networks
  • TensorFlow
  • Deep Neural Networks
  • Convolutional Neural Networks

Linear to Logistic Regression

Linear regression helps predict values on a continuous spectrum, like predicting what the price of a house will be.

How about classifying data among discrete classes?

Here are examples of classification tasks:

  • Determining whether a patient has cancer
  • Identifying the species of a fish
  • Figuring out who’s talking on a conference call

Classification problems are important for self-driving cars. Self-driving cars might need to classify whether an object crossing the road is a car, pedestrian, and a bicycle. Or they might need to identify which type of traffic sign is coming up, or what a stop light is indicating.

In the next video, Luis will demonstrate a classification algorithm called “logistic regression”. He’ll use logistic regression to predict whether a student will be accepted to a university.

Linear regression leads to logistic regression and ultimately neural networks, a more advanced classification tool.

QuiZ:

So let’s say we’re studying the housing market and our task is to predict the price of a house given its size. So we have a small house that costs $70,000 and a big house that costs $160,000.

We’d like to estimate the price of these medium-sized house over here. So how do we do it?

Well, first we put them in a grid where the x-axis represents the size of the house in square feet and the y-axis represents the price of the house. And to help us out, we have collected some previous data in the form of these blue dots.

These are other houses that we’ve looked at and we’ve recorded their prices with respect to their size. And here we can see the small house is priced at $70,000 and the big one at $160,000.

Now it’s time for a small quiz.

What do you think is the best estimate for the price of the medium house given this data?

Would it be $80,000, $120,000 or $190,000?

Yes you are right: The answer is 120,000. But how we do that?

Well to help us out, we can see that these points can form a line. And we can draw the line that best fits this data. Now on this line, we can see that our best guess for the price of the house is this point here over the line which corresponds to $120000.

So if you said $120000, that is correct.

This method is known as linear regression. You can think of linear regression as a painter who would look at your data and draw the best fitting line through it. And you may ask, “How do we find this line?”

Well, that’s what the rest of the section will be about.

Linear to Logistic Regression

Linear regression helps predict values on a continuous spectrum, like predicting what the price of a house will be.

How about classifying data among discrete classes?

Here are examples of classification tasks:

  • Determining whether a patient has cancer
  • Identifying the species of a fish
  • Figuring out who’s talking on a conference call

Classification problems are important for self-driving cars. Self-driving cars might need to classify whether an object crossing the road is a car, pedestrian, and a bicycle. Or they might need to identify which type of traffic sign is coming up, or what a stop light is indicating.

In the next video, I will demonstrate a classification algorithm called “logistic regression”. I’ll use logistic regression to predict whether a student will be accepted to a university.

Linear regression leads to logistic regression and ultimately neural networks, a more advanced classification tool.

Problem:

So, let’s start with one classification example.

Let’s say we are the admissions office at a university and our job is to accept or reject students. So, in order to evaluate students, we have two pieces of information, the results of a test and their grades in school.

So, let’s take a look at some sample students. We’ll start with Student 1 who got 9 out of 10 in the test and 8 out of 10 in the grades. That student did quite well and got accepted. Then we have Student 2 who got 3 out of 10 in the test and 4 out of 10 in the grades, and that student got rejected.

And now, we have a new Student 3 who got 7 out of 10 in the test and 6 out of 10 in the grades, and we’re wondering if the student gets accepted or not. So, our first way to find this out is to plot students in a graph with the horizontal axis corresponding to the score on the test and the vertical axis corresponding to the grades, and the students would fit here.

The students who got three and four gets located in the point with coordinates (3,4), and the student who got nine and eight gets located in the point with coordinates (9,8).

And now we’ll do what we do in most of our algorithms, which is to look at the previous data.

This is how the previous data looks. These are all the previous students who got accepted or rejected.

The blue points correspond to students that got accepted, and the red points to students that got rejected.

So we can see in this diagram that the students would did well in the test and grades are more likely to get accepted, and the students who did poorly in both are more likely to get rejected.

So let’s start with a quiz.

The quiz says, does the Student 3 get accepted or rejected?

What do you think?

The answer is: Student 3 is pass.

Correct. Well, it seems that this data can be nicely separated by a line which is this line over here,

and it seems that most students over the line get accepted and most students under the line get rejected.

So this line is going to be our model. The model makes a couple of mistakes since there area few blue points that are under the line and a few red points over the line. But we’re not going to care about those. I will say that it’s safe to predict that if a point is over the line the student gets accepted and if it’s under the line then the student gets rejected.

So based on this model we’ll look at the new student that we see that they are over here at the point 7:6 which is above the line. So we can assume with some confidence that the student gets accepted. so if you answered yes, that’s the correct answer.

And now a question arises. The question is, how do we find this line?

So we can kind of eyeball it. But the computer can’t. We’ll dedicate the rest of the session to show you algorithms that will find this line, not only for this example, but for much more general and complicated cases. But we will talk about that in my next post. See you later!

Region masking for lane detection

Region masking is one of the simplest ideas in computer vision, yet it solves a very practical problem: most of the pixels in an image are irrelevant to the task you care about. If you are trying to detect lane lines from a front-facing road camera, the sky, nearby buildings, dashboard reflections, and other distant objects often add noise instead of useful signal.

Region masking fixes that by keeping only the part of the image where lane structure is likely to appear. In other words, it tells the pipeline: “Look here first, and ignore the rest.”

Lane detection example
Lane-detection pipelines often restrict processing to a selected road region before extracting lines. Source: Wikimedia Commons, Lane Detection Example.jpg.

Why region masking matters

A road image contains too much information. Even after color filtering or edge detection, many edges have nothing to do with lanes. Guard rails, shadows, trees, cars, and building contours can all create strong gradients. If we let the algorithm examine the whole frame equally, false positives become much more likely.

Region masking narrows the search space. It reduces distracting edges, makes later stages more stable, and often improves speed because fewer pixels need to be processed.

This is why region masking appears so often in educational lane-detection projects: it is simple, visual, and immediately useful.

The basic idea

The most common approach is to define a polygon that covers the road area ahead of the ego vehicle. On a forward-facing camera, that polygon is often trapezoidal:

  • wide near the bottom of the image, where the road is close to the car
  • narrower near the horizon, where perspective makes the lane lines converge

Everything outside that polygon is masked out. The result is an image where only the expected lane region remains.

This is a simple example of a region of interest, or ROI. OpenCV uses the term ROI broadly for image subregions, and this is exactly the kind of focused subregion that helps classical lane pipelines stay practical.

How masking fits into the lane pipeline

In a classical lane-detection workflow, region masking usually happens after an early preprocessing step but before final line fitting:

  1. load and optionally undistort the image
  2. convert color space or grayscale
  3. apply color thresholding or Canny edge detection
  4. apply a polygon mask to keep the road region only
  5. run a Hough transform or another line-extraction method
  6. fit and smooth the final lane estimate

Notice what region masking does not do: it does not detect lanes by itself. It improves the conditions for the rest of the pipeline by removing clutter.

OpenCV implementation

In OpenCV, region masking is often implemented with a blank mask image plus a filled polygon. The polygon is painted white, then combined with the processed image using a bitwise operation.

import cv2 as cv
import numpy as np

image = cv.imread("road.jpg", cv.IMREAD_GRAYSCALE)
edges = cv.Canny(image, 50, 150)

height, width = edges.shape
mask = np.zeros_like(edges)

vertices = np.array([[
    (80, height),
    (width // 2 - 40, height // 2 + 40),
    (width // 2 + 40, height // 2 + 40),
    (width - 80, height),
]], dtype=np.int32)

cv.fillPoly(mask, vertices, 255)
masked_edges = cv.bitwise_and(edges, mask)

The key functions here are:

  • cv.fillPoly() to define the polygonal road area
  • cv.bitwise_and() to keep only the selected region

These are simple tools, but they remain useful in real engineering pipelines because they make the geometry explicit.

How to choose the region correctly

The mask shape should reflect the camera geometry and road setup. If the camera is fixed in a known position, the lane region will usually appear in a predictable part of the image. That makes a hard-coded trapezoid acceptable for a first prototype.

But in a production setting, several factors complicate that assumption:

  • camera pitch may change with acceleration or road slope
  • different vehicles may mount the camera at different heights
  • curves, merges, and hills can move the useful lane region
  • urban scenes do not always follow simple straight-road geometry

That is why static masks are best understood as a baseline. In stronger systems, the region may be adjusted dynamically using calibration, perspective transforms, prior lane estimates, or even learned drivable-area segmentation.

Why this still matters in modern systems

You might ask whether region masking still matters now that deep learning can segment lanes directly. The answer is yes, but in a different role.

Even in modern pipelines, region masking can still help:

  • build interpretable baselines before using neural networks
  • reduce noise in classical preprocessing steps
  • speed up geometric post-processing
  • debug whether a failure comes from image quality or model quality

When a team cannot tell whether the camera sees the lane clearly, a simple ROI-based pipeline is often a very good diagnostic tool.

Common failure modes

Region masking works well only when the chosen region aligns with reality. It can fail when:

  • the road curves sharply outside the predefined polygon
  • the horizon shifts because of hills or vehicle pitch
  • lane boundaries are partly hidden by traffic
  • the camera viewpoint changes between datasets
  • the useful road structure falls outside the selected mask

If the mask is too wide, it lets noise in. If it is too narrow, it removes the actual lane. Good masking is always a tradeoff between focus and flexibility.

A practical engineering checklist

If I were reviewing a lane pipeline that uses region masking, I would ask:

  • Is the ROI still valid after camera calibration or mounting changes?
  • Does the mask hold up on curves, hills, and merges?
  • Is the mask applied before or after the most noise-sensitive step?
  • Can the region adapt over time, or is it fixed forever?
  • Is there a fallback when the lane estimate leaves the masked area?

Those questions reveal whether the ROI is just a tutorial shortcut or a well-understood engineering choice.

Conclusion

Region masking is a small technique with a large practical impact. By focusing computation on the likely road area, it reduces false positives and makes lane-detection pipelines cleaner and more stable. It is not a full perception solution by itself, but it remains a valuable building block for classical lane detection, debugging, and understanding how road geometry interacts with camera vision.

References

Canny edge detection in practice

Canny edge detection is one of those classic computer vision tools that still deserves attention. Even in an era dominated by deep learning, engineers keep returning to Canny because it is fast, interpretable, and useful for building baselines, debugging camera pipelines, and extracting geometric structure from images.

It is especially valuable when you want to highlight boundaries such as lane markings, road edges, object contours, or structural lines before running later stages of a pipeline.

Lane detection example using edge extraction
A simple lane pipeline often starts with edge extraction before fitting lines or estimating road structure. Source: Wikimedia Commons, Lane Detection Example.jpg.

Why Canny still matters

Most raw images contain far more information than a geometric pipeline needs. Canny helps reduce that image to a sparse set of likely boundaries. That makes it useful in applications such as:

  • lane detection prototypes
  • document and industrial inspection
  • line or contour extraction
  • preprocessing for feature-based pipelines
  • debugging lighting and contrast issues in camera systems

It is not a full perception system, but it is often a strong first step when the engineer wants structure rather than semantics.

How the algorithm works

Canny is a multi-stage edge detector. Its power comes from the fact that it does not simply compute gradients and stop there. It also tries to suppress noise and keep only meaningful, thin edges.

  1. Noise reduction: smooth the image, usually with a Gaussian filter.
  2. Gradient computation: estimate intensity changes, often with Sobel operators.
  3. Non-maximum suppression: keep only local maxima so the edge stays thin.
  4. Hysteresis thresholding: use lower and upper thresholds to decide which edges survive.

This last step is one reason Canny stays practical. Weak responses can still survive if they connect to strong edges, which often preserves useful line structure while discarding isolated noise.

What the thresholds really do

Most failures with Canny are not about the algorithm itself. They are about poor threshold choices.

  • If the thresholds are too low, noise floods the result.
  • If the thresholds are too high, meaningful boundaries disappear.
  • If the image is not smoothed well, texture and noise create unstable edges.

That is why Canny tuning is often scene-dependent. A bright daytime road scene may support different thresholds than a nighttime wet road.

import cv2 as cv

image = cv.imread("road.jpg", cv.IMREAD_GRAYSCALE)
blurred = cv.GaussianBlur(image, (5, 5), 0)
edges = cv.Canny(blurred, threshold1=50, threshold2=150)

That simple code hides an important practical truth: preprocessing usually matters as much as the call to cv.Canny() itself.

How engineers use Canny in lane detection

Canny became a familiar tool in beginner autonomous-driving projects because it works well as part of a classical lane-detection pipeline. A common sequence is:

  1. convert the image to grayscale
  2. blur to reduce noise
  3. run Canny to extract edges
  4. apply a region of interest mask
  5. fit candidate lane lines with a Hough transform
Lane detection pipeline diagram
Canny often plays the role of edge extractor inside a broader lane-detection pipeline. Source: Wikimedia Commons, Lane Detection Algorithm.svg.

This does not solve all real road cases, but it teaches the underlying geometry clearly and remains useful for debugging modern lane systems.

Where Canny is strong

  • fast and cheap to run
  • easy to interpret visually
  • good for highlighting sharp structure
  • useful in classical pipelines and for debugging learned models

Where Canny is weak

  • it does not understand semantics
  • it is sensitive to threshold selection
  • it struggles with shadows, glare, and heavy texture
  • it cannot decide whether an edge belongs to a lane, a crack, or a shadow by itself

That is why modern systems usually combine Canny-like geometric ideas with learning-based perception when the scene is complex.

A practical way to use it today

Even if your final product relies on neural networks, Canny still has value. I would use it for:

  • building a classical baseline quickly
  • validating whether the camera image quality is good enough for geometry
  • checking if calibration or blur is destroying useful structure
  • explaining perception stages to new engineers on the team

It is one of those rare classical methods that remains useful both educationally and operationally.

Conclusion

Canny edge detection still matters because it turns a noisy image into a much clearer geometric signal. It is not a semantic perception tool, but it remains valuable for lane-finding pipelines, contour extraction, debugging, and building intuition about how vision systems interpret structure. For many engineering teams, Canny is still one of the quickest ways to understand what the camera can and cannot see.

References

Color selection basics

Finding Lane Lines on the Road

Which of the following features could be useful in the identification of lane lines on the road?

Answer : Color, shape, orientation, Position of the image.

Coding up a Color Selection

Let’s code up a simple color selection in Python.

No need to download or install anything, you can just follow along in the browser for now.

We’ll be working with the same image you saw previously.

Check out the code below. First, I import pyplot and image from matplotlib. I also import numpy for operating on the image.

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np

I then read in an image and print out some stats. I’ll grab the x and y sizes and make a copy of the image to work with. NOTE: Always make a copy of arrays or other variables in Python. If instead, you say “a = b” then all changes you make to “a” will be reflected in “b” as well!

# Read in the image and print out some stats
image = mpimg.imread('test.jpg')
print('This image is: ',type(image), 
         'with dimensions:', image.shape)

# Grab the x and y size and make a copy of the image
ysize = image.shape[0]
xsize = image.shape[1]
# Note: always make a copy rather than simply using "="
color_select = np.copy(image)

Next I define a color threshold in the variables red_thresholdgreen_threshold, and blue_threshold and populate rgb_threshold with these values. This vector contains the minimum values for red, green, and blue (R,G,B) that I will allow in my selection.

# Define our color selection criteria
# Note: if you run this code, you'll find these are not sensible values!!
# But you'll get a chance to play with them soon in a quiz
red_threshold = 0
green_threshold = 0
blue_threshold = 0
rgb_threshold = [red_threshold, green_threshold, blue_threshold]

Next, I’ll select any pixels below the threshold and set them to zero.

After that, all pixels that meet my color criterion (those above the threshold) will be retained, and those that do not (below the threshold) will be blacked out.

# Identify pixels below the threshold
thresholds = (image[:,:,0] < rgb_threshold[0]) \
            | (image[:,:,1] < rgb_threshold[1]) \
            | (image[:,:,2] < rgb_threshold[2])
color_select[thresholds] = [0,0,0]

# Display the image                 
plt.imshow(color_select)
plt.show()

The result, color_select, is an image in which pixels that were above the threshold have been retained, and pixels below the threshold have been blacked out.

In the code snippet above, red_thresholdgreen_threshold and blue_threshold are all set to 0, which implies all pixels will be included in the selection.

In the next quiz, you will modify the values of red_thresholdgreen_threshold and blue_threshold until you retain as much of the lane lines as possible while dropping everything else. Your output image should look like the one below.