Technology - Anatomy of a computer vision application

Stura specializes in computer vision technologies. Our target applications are people counting, footfall analysis, queue monitoring, and access control.

The three steps for monitoring people in video streams are:


Questa e' una prova forte


Detection is the process of analyzing an image to see if it contains persons. The typical output of a detection algorithm is a set of boxes denoting the location of the persons in the image.

Nowadays the most effective person detectors use CNN (convolutional neural networks). These models are trained on large datasets (e.g. COCO) and have proved effective in many situations.


Despite the power of CNNs, the detection process is never 100% accurate. Depending on the quality of the videos and the complexity of the scenes some problems can occur:

Stura specializes in technologies to aid video analysis of video streams containing people with application to analytics (e.g. people counting) and surveillance (e.g. access control).

These types of applications are notoriously deceiving because while it’s very easy for us, humans, to watch a video and immediately understand how many people are in it, it’s much more difficult (and in some cases impossible) for a machine to achieve the same results. For example, assuming the case of a shopper journey application, three steps a necessary to extract the information of interest:

Step 1: Detection

Detection is the process of analyzing an image to see if it contains persons. The typical output of a detection algorithm is a set of boxes denoting the location of the persons in the image.

Challenges with detection

The detection process is never 100% accurate. Depending of the performance of the algorithm, the quality of the images, and the complexity of the environment, some problems can occur:

Detections with CNN


Pose Estimation with GPU

Processing the video with a GPU also offers the possibility of detecting the pose of each person. The person’s pose describes the position of each body part in space and can be visualized as a skeleton overlaid on the person itself. The advantages of using pose estimation for detection are threefold:

  1. It works more reliably (better accuracy) at detecting people when the body is partially occluded

  2. When the feet are visible, it provides great accuracy at calculating the true position of the person since we know exactly where the feet rest on the floor.

  3. It’s possible to know which direction a person is facing and if they are engaging in specific movements (e.g. extend an arm to grab something on a shelf).

Comparison between detection performed with bounding box (left) and detection performed with pose estimation (right). The pose estimation detection contains information about which direction a person is facing.

Step 2: Tracking

Tracking is the process of assigning an unique ID to each person visible in the video. Ideally the ID will be associated to the same person as long as they move in front of the camera, and then discarded when the person exits the field of view.

Note: accurate tracking is necessary for computing basic metrics such as exposure and dwell time. Without tracking it would not be possible to know if the persons detected in each frame were already there or just entered the scene.

Challenges with tracking

Tracking depends on the output of the detection process, which itself is not 100% accurate. Even with a good detection algorithm, accurately tracking people is a rather complex task that becomes even harder in crowded scenes with many people that come in close proximity and partially occlude each other. Typical problems with tracking are:

Step 3: Re-Identification

Re-identification is the process that attempts to assign an unique ID that is valid among all the cameras of an entire venue. The re-identification operation is executed every time a person enters the field of view of a camera and the algorithm must decide if they are new persons or if they were previously detected in any other camera. In cases where it’s possible to accurately re-identify people it’s also possible to compute the total number of visitors in the venue, and reconstruct their path through the entire store.

Challenges with re-identification

As of today, the only reliable way to re-identify people across different cameras is to use facial recognition, which is typically not viable unless the cameras are installed at eye level. Re-identification without using the face is very challenging. It might be possible with some success in cases where the total number of people is limited (e.g. < 50 persons) and the layout of the cameras in the store is known. Knowing the locations of the cameras allows the algorithm to determine the a-priori probabilities of people moving/walking from one camera to the other in a given amount of time and use this data for more accurate re-identification.

Notes on Age and Gender Recognition

The Mediar team also expressed interest for age and gender recognition. There are several solutions available to perform this task; however all the libraries require that the faces of the persons are clearly visible and the images have good quality. At the moment, most of the videos that were provided as samples do not have the required angle of view and quality necessary to detect the shoppers’ faces.