10.2. Perspective Projection Model

The previous diagram (Thin Lens Camera Model) accurately depicts the geometry of a camera. However, it does not yield a convenient mathematical model to relate points in the real world to points in the image.

The perspective projection model allows us relate object locations in the real world to the corresponding pixel locations in the image.


Perspective Projection Camera Model. In this diagram, the x and y are as we normally view 2-D plots, so by the right hand rule the Z values of objects in the world are negative. Not everyone does it this way.

With the origin of the image coordinate system, (x, y), in the center of the image, the perspective projection equation relates the world locations to image locations.

( x, y ) = \left( f\,\frac{X}{Z}, f\,\frac{Y}{Z} \right)

Notice that given a location in the world, it is possible to determine the image location, but not the reverse. An image point defines a ray, which might intersect an object in the world at any distance away. Some knowledge of the geometry of the world is required to determine world locations from an image.

For film cameras, it is appropriate to think of f in units of linear distance; but for digital cameras, pixels is a more convenient unit for f. When f is given in linear units (usually mm) for a digital camera, the linear size of the distance between pixels, \rho, must also be used. The unit of \rho is usually mm/pixel. Divide the focal length (mm) by \rho (mm/pixel) to get the focal length in pixels.

Another observation from perspective projection is that as Z (distance from the camera) increases, objects in the image get smaller. Thus parallel lines that are not perpendicular to the optical axis appear to come to a point at a long distance away.


Figure 11.4 from [RVC] shows that parallel railroad tracks appear to come to a point at a long distance away.

When objects are very far away, the X, Y, and Z can be huge. If the camera is moved, those numbers hardly change. This explains why the moon seems to follow you, why the north star (Polaris) is always north, and why you can tell time from the sun regardless of where you are.

10.2.1. When Z is Constant

In industrial robotics applications it is common to position the camera so that Z is constant. For Z to be constant, the axis of the camera must be perpendicular to a flat surface being photographed. Images from a camera positioned over a conveyor belt carrying products can meet this scenario, as can a picture taken directly in front of a building.

When Z is constant and known, the ratio of \frac{f}{Z} in the perspective projection equation is constant for the whole image, so the size of objects can be measured from the distance (in pixels) between two points on the image.

The distance between two points in an image is:

\Delta_i = \sqrt{(u_1 - u_2)^2 + (v_1 - v_2)^2}.

Then, the distance between the points in the world can be computed.

\Delta_i = \frac{f}{Z}\,\Delta_w

\Delta_w = \frac{Z}{f}\,\Delta_i

Note: The most accurate results are obtained from points near the optical axis (center) of the image.

The ratio of the distances between points in the image and points in the world is also constant. Thus, if we know the distance between two points in the world, then the distance between two other points can be calculated from an image.

\frac{\Delta_{i1}}{\Delta_{w1}} = \frac{\Delta_{i2}}{\Delta_{w2}}

10.2.2. Projective Geometry

In the matrices used to relate points from an image to points in the physical world, we use homogeous coordinates, rather than Euclidean coordinates. In one sense, this may seem as just a convenience to allow matrix multiplication by adding another row to the matrices; but it is really more than that. Homogeneous coordinates are used in the geometry system known as projective geometry, which describes the relationship between the physical world and images taken by cameras.


Urbino, Italy, The ideal city, by Piero della Francesca (1415–1492)

Projective geometry is concerned with how the world appears to us; whereas, Euclidean geometry describes the actual dimensions of objects. The images that our eyes send to the brain are projective views. Then our brain translates the projective images to Euclidean space to give us our intuition about the dimensions of the objects that we see. Artists were the first to recognize the importance of projective geometry. Projective geometry has its origins in the early Italian Renaissance, particularly in the architectural drawings of Filippo Brunelleschi (1377–1446) and Leon Battista Alberti (1404–1472), who invented the method of perspective drawing. Italian fresco painter Piero della Francesca (1415–1492) described the mathematical model of projective geometry in his publication De Prospectiva Pingendi in 1478. Later, French mathematician Gérard Desargues (1591–1661) gave us more rigorous mathematical models and equations of projective geometry that we use today. Two other 17th century mathematicians that contributed to our knowledge of projective geometry were Blaise Pascal and Philippe de La Hire.


Figure 11.3 from [RVC]. This drawing illustrates the concept of projective geometry as an artist should view a scene that they are drawing or painting. Notice that in this diagram, the x and y axes point in the same direction as u and v that we use with MATLAB. By the right hand rule, the Z values of objects in the world are positive, but the positive y values are below the principal point.

Here, we review a few math principles of projective geometry and homogeneous coordinates. Later, we will apply homogeneous coordinates to images.

Consider a normalized projective system where, per the above diagram, f = 1, then any point (X,\,Y,\,Z) passing through the image (Euclidean) plane at point (x,\,y) is on the same ray. Simply divide (X,\,Y) by Z to convert from homogeneous coordinates to Euclidean coordinates.

( x,\, y ) = \left(\frac{X}{Z},\ \frac{Y}{Z} \right)

To convert a point in Euclidean coordinates to homogeneous coordinates, simply add an addition variable of value 1. Points along the same ray are scale invariant because any change is removed in the conversion back to Euclidean coordinates.

( x,\, y ) = \left(\frac{cX}{cZ},\ \frac{cY}{cZ} \right)
= \left(\frac{X}{Z},\ \frac{Y}{Z} \right)

The 3-tuple homogeneous coordinate points are said to be in two dimensional projective space (\bm{\tilde{P}} \in \mathbb{P}^2), that is, they correspond to points on a plane in Euclidean space (\bm{P} \in \mathbb{R}^2).

Representing points and lines in homogeneous coordinates allows us to take advantage of some convenient properties.

Consider two homogeneous points, \bm{\tilde{p}_1} and \bm{\tilde{p}_2}, then the line between the points is the cross product of the points.

\bm{\tilde{l}_{12}} = \bm{\tilde{p}_1} \times \bm{\tilde{p}_2}

A line in homogeneous coordinates is also a 3-tuple. The point where two lines intersect is also found by a cross product.

\bm{\tilde{p}} = \bm{\tilde{l}_1} \times \bm{\tilde{l}_2}

See also

Data Analysis study guide for notes on cross products.

10.2.3. Camera Matrix

Here we discuss how to describe the perspective projection camera model with matrices such that the model is more useful as a software model.

In homogeneous coordinates, a point in the physical world is ideally projected onto an image with the following matrix multiplication.

\begin{bmatrix} zx\\ zy\\ z \end{bmatrix}
= \begin{bmatrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0
  \end{bmatrix} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix}


The Euclidean coordinates of a point (x, y) relative to the principal point is \left(\frac{zx}{z}, \frac{zy}{z}\right).

This model is appropriate for a continuous (film) image plane. But for digital cameras, we have to divide the image plane into a grid of light sensors (pixels). The dimension of each pixel in this grid is \rho_u wide and \rho_v high. The pixels on typical digital camera are usually square, about 10 microns (10^{-6} meters) on each side.

In MATLAB, the origin is in the upper, left corner of the image, so we can translate the image to shift the origin from the principal point.


Figure 11.6 from [RVC]. This drawing is of the perspective projection camera model, but it also illustrates the concept of projective geometry.

To make the equations even more useful to machine vision applications, we can relate the position and orientation (pose) of the camera relative to a world coordinate frame.


Figure 11.5 from [RVC]. The camera may be translated and rotated relative to a world coordinate frame.

\tilde{p}=\begin{bmatrix} wu \\ wv \\ w \end{bmatrix}
 = \underbrace{\underbrace{
 \begin{bmatrix} \frac{f}{\rho_u} & 0 & u_0 \\
                      0 & \frac{f}{\rho_v} & v_0 \\
                      0 & 0 & 1 \end{bmatrix}
\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\
                0 & 0 & 1 & 0 \end{bmatrix}}_{\mathbf{K}}
{\begin{bmatrix} \mathbf{R} & t \\
       \mathbf{0}_{1\times 3} & 1\end{bmatrix}}^{-1}}_{\mathbf{C}}
\begin{bmatrix} X \\ Y \\ Z \\ 1\end{bmatrix}


The u_0 and v_0 terms allow us to use the upper right corner of the image as the origin. The coordinates of a point (u,
v) is again found by converting from homogeneos coordinates to Euclidean coordinates, \left(\frac{wu}{w}, \frac{wv}{w}\right).

The product of the first two matrices is typically denoted by the symbol \bf{K} and we refer to these as the intrinsic parameters. All the numbers in these two matrices are functions of the camera itself. It does not matter where the camera is or where it is pointing, they’re only a function of the camera. These numbers include the height and width of the pixels on the image plane, the coordinates of the principal point, and the focal length of the lens.

The third matrix describes the extrinsic parameters related to where the camera is but it does not say anything about the parameters of the camera.

The product of these matrices is called the camera matrix, \bf{C}.

\mathbf{\tilde{p}} = \mathbf{C\,\tilde{P}}

The camera matrix can also be found from a calibration procedure, which may be needed if some parameters are not known or if the lens causes any distortion to the image.

\tilde{p}=\begin{bmatrix} wu \\ wv \\ w \end{bmatrix}
 = \begin{bmatrix}
       c_{11} & c_{12} & c_{13} & c_{14} \\
       c_{21} & c_{22} & c_{23} & c_{24} \\
       c_{31} & c_{32} & c_{33} & c_{34}
\begin{bmatrix} X \\ Y \\ Z \\ 1\end{bmatrix}


The value of c_{34} is often set to 1, in which case, \bf{C} is said to be a normalized camera matrix.

The MVTB has objects that model the projection of points onto an image plane of cameras with known parameters.


Given a camera matrix and point in the world as below, find where the point will be in an image.

>> C
C =
        512        -110           1         800
        512         512        -100        1600
          1           1           0           0
>> P
P =

>> P_tilde = [P;1]  % homogeous coordinates
P_tilde =

>> p_tilde = C*P_tilde
p_tilde =
>> p = p_tilde(1:2)/p_tilde(3) % Cartesian coordinates
p =

The image plane point is at (156, 424).