Technology - Anatomy of a computer vision application

person tracking and pose estimation

Video analysis of pedestrian movements using pose estimation and heatmaps © Stura, Inc.

Stura specializes in computer vision technologies. Our target applications are people counting, footfall analysis, queue monitoring, and access control. This section provides some background information about these applications.

The three steps for monitoring people in video streams are:

Detection
Tracking
Re-Identification (Optional)

detection, tracking, re-identification

Detection, tracking and re-identification are the three building blocks for video analytics applications © Stura, Inc.

Step 1: Detection

Detection is the process of analyzing an image to see if it contains persons. The typical output of a detection algorithm is a set of boxes denoting the location of the persons in the image.

Nowadays the most effective person detectors use CNN (convolutional neural networks). These models are trained on large datasets (e.g. COCO) and have proved effective in many situations.

Challenges with detection

Despite the power of CNNs, the detection process is never 100% accurate. Depending on the quality of the videos and the complexity of the scenes some problems can occur:

Missed Detections: some person will appear in a video without a corresponding box. From a point of view of the algorithm, it’s like those persons are not there.
False Positives: some boxes will not contain persons, but part of the background. This often happens when the image contains shapes that resemble the outline of a person. Reflective walls can also result in false detections.

We also note that detecting persons via CNN is a resource-intensive process. Even when using GPU cards, the detection stage limits the throughput of the system. At Stura we specialize in custom trained, fine-tuned CNN models. We optimized our models for accuracy and detection speed.

Pose Estimation with GPU

Some deep learning models are also capable of performing pose estimation. Pose estimation detects not only the persons but also the locations of their body parts. The advantages of using pose estimation are threefold:

It’s more accurate at detecting people when the body is partially occluded.
By detecting the feet’s position, we can calculate the exact position (x,y) on the floor.
With pose information we can determine which direction a person is facing and if they are engaging in specific movements. For example, we can detect when someone grabs an object from a shelf.

boxes vs poses

Comparison between bounding boxes only (left) and pose estimation (right) © Stura, Inc.

Step 2: Tracking

Tracking is the process of assigning a unique ID to each person visible in the video. The ID will stay with the persons as long as they move in front of the camera. Accurate tracking is necessary for footfall analysis. For example, tracking provides the data to calculate the dwell time of shoppers in specific areas.

Challenges with tracking

Tracking depends on the output of the detection process, which we know is not 100% accurate. Even with good detections, tracking people is a rather complex task. The problem becomes harder in crowded scenes with many people occluding each other. Typical problems with tracking are:

Multiple IDs assigned to the same person. The algorithm loses track of an individual and assigns them a new ID. This can happen, for example, when a person moves behind an obstacle. The data will incorrectly report there were two different people.
Same ID shared between different persons. The algorithm associates a previous ID to a different person. The switch can happen when a person moves in front of another one. The data will incorrectly report there was one person, when it actually was two.

At Stura we have extensive experience in designing tracking algorithms. We will adopt the solution that provides the better performance for your videos.

Step 3: Multi-camera re-identification (Optional)

Re-identification is the process that recognizes the same person moving across different cameras. For every new detection, the system will have to decide if this is a new visitor or if the person had been seen before. Successful re-identification enables computing the total number of visitors in the venue. It also makes it possible to reconstruct their path through the entire store (shopper journey).

Challenges with re-identification

As of today, facial recognition is the most reliable re-identification approach. Unfortunately, face recognition is not viable unless cameras are placed at eye level. Also, face recognition is not always viable for applications that need to be GDPR compliant.

Luckily there are approaches that enable re-identification without using facial images. These methods create anonymized descriptions (encoding vectors) of each person’s appearance. The encoding vectors can later be used to search for a specific person. The encoding vectors can also be used to automatically re-identify the same person across cameras.

The technology for re-identification is still in the research stage. At Stura we can help you assess its accuracy and see how it performs in your venue.

Implementation Considerations

The what, when, where of computer vision applications.

What: Video Sources

The position and types of cameras have a big impact on the outcome of a computer vision project. New installations represent favorable cases since they offer flexibility in choosing:

Total number of cameras and their position (on the ceiling, on the walls, etc).
Sensor technology: RGB, thermal, stereoscopic, etc.
Lenses (if applicable): narrow or wide angle

It is also true that most venues are already equipped with CCTV security cameras. The resolution of these cameras, and their placement, is often not optimal for computer vision. But it’s still possible to extract valuable information from those streams. At Stura we specialize software that is fine-tuned to work with CCTV cameras.

When: Real-Time vs Offline Computation

Some applications (e.g. surveillance) demand real-time processing. In other instances the video streams can be processed offline. For example, data in the retail market is often analyzed over long time intervals. When real-time data is not needed, video analytics can leverage offline processing. Processing videos offline will improve both accuracy and efficiency (less computation needed).

Where: Edge, On Premises, Cloud

Video streams are information-rich media sources and analyzing them requires significant processing power.

In some applications, edge-processing allows extracting data directly at the source. Small embedded (IoT) devices can be paired to cameras to process the stream locally. Edge processing improves system robustness and reliability. Also, since the videos never leave the device, it offers the highest level of data privacy.

Another option involves processing the videos on on-site servers. This common approach is easy to integrate in existing IT infrastructures. It also offers a good level of data privacy since the videos are not leaving the company’s network.

Finally, a third approach is to process the videos on the cloud. Cloud computing offers great benefits in terms of flexibility and scalability. New servers can be quickly added (or removed) to match the processing requirements.