What can we get from modern computer vision?
Computer vision is ubiquitous. It is an exciting field of study that involves teaching machines to "see" and interpret the world around them, using digital images or videos. In the past, computer vision has been a separate field of science and engineering. One of the key differentiators between modern computer vision and its earlier iterations is the shift from rule-based to data-driven approaches. With the advent of deep learning, computer vision has moved away from handcrafted features and towards end-to-end models that can learn directly from raw data. This has allowed for more accurate and robust models, as well as the ability to scale up to large datasets. Thanks to this shift and particularly transformer-based architectures, computer vision shares much common ground with natural language processing. With that being said, there are some important differences.
There is a plethora of startups, which build their success on computer vision techniques. For example, AI Clearing uses computer vision to provide automatic construction progress monitoring. Due to the vision models, paired with drone-based imagery, construction site managers can easily track the progress of massive construction sites, such as large solar farms or highways. Nomagic is a company, which automatise repeatable logistic tasks in warehouses, using robotics. They utilise computer vision solutions to determine optimal pick positions of unknown products, scan the barcode, or place items in the containers in an optimal way. Focal Systems provides a retail automation solution. Their cameras automatically scan shop shelves, in order to have precise information about what’s on them. This facilitates managing the shop in an automatic manner. Another startup, ReSpoVision builds a football-dedicated solution, which automatically analyses TV match recordings and builds accurate game telemetries (such as ball position or player velocities).
There are a number of computer vision tasks and classifying all of them is a tedious job. One way to do so would consider the input data: photos (from a camera, UAV, Satellite…), videos, point clouds, or various remote-sensing devices. Another important distinction in computer vision is the difference between generative and discriminative models. Discriminative models focus on distinguishing boundaries on observed images, such as detecting objects or recognising faces. Generative models, on the other hand, are designed to create something new, such as generating realistic images or videos. Examples of generative models include Midjourney, a tool that can generate images from textual descriptions, and DALL-E 2, which can create images from textual prompts.
Generative computer vision tasks
Generative computer vision tasks involve creating something new, such as generating images or videos that are similar to a set of training data. These tasks are typically tackled using generative models, which are designed to create new data based on the patterns, observed in the training data. Let’s start with text-to-image synthesis. This involves generating an image, based on a textual description. Think of DallE or Midjourney! Another common generative computer vision task is image synthesis, which involves generating new images that are similar to a set of training images. This can be used for a variety of applications, such as generating realistic images of people, animals, or objects for use in video games or virtual reality environments. Style transfer is closely related - it involves taking the style from one image and transferring it to another. For example, one could transfer the style of a famous painting onto a photograph. There’s no need to stay in two dimensions. For instance, with 3D object generation (or 3D reconstruction), one can create 3D models of objects from 2D images or videos. For instance, 3D reconstruction can be used to create virtual tours of buildings or to visualise complex scientific data in 3D.
Sometimes, we don’t want to generate anything new - we just want to work with what we have. For instance, image inpainting involves filling in missing or damaged portions of an image. This can be useful in applications, such as image restoration or video editing, where damaged or missing portions of an image need to be reconstructed. Another such task - widely known from the FBI TV series - is image super-resolution. Essentially, it generates a high-resolution image from a low-resolution image. This can be useful in applications, where an improvement of image quality is required, such as satellite imagery. Sometimes, we want to work with visual content, but the target output is text. For instance, image captioning involves generating a textual description of an image or video. Closely related video summarization involves generating a shorter, more concise version of a longer video, by selecting and summarizing keyframes or segments. Video summarization is used in applications such as news broadcasting, sports analysis, and security surveillance.
Discriminative computer vision tasks
One of the primary tasks of computer vision is image classification, which involves categorising an image into predefined classes or categories. For example, a computer vision system can be trained to classify images of animals into different species, such as cats, dogs, and birds. Another important computer vision task is semantic segmentation, which involves identifying and labelling every pixel in an image with a corresponding class or label. Essentially, this is the classification on a pixel level. This task can be used to separate the foreground and background of an image or segment different objects in an image. For example, in medical imaging, semantic segmentation can be used to identify and segment tumours in MRI scans. Also, think of blurring your background on Zoom! Classification is not limited to images. Point cloud classification is a task that involves classifying points in a 3D point cloud into different categories or classes. For instance, point cloud classification can be used to identify and classify different types of objects in a 3D map of a warehouse or identify road markings and signs in a 3D map of a street.
Object detection is another key computer vision task that involves identifying and localising objects within an image or video. This task is crucial for applications, such as security surveillance and autonomous vehicles. For instance, object detection algorithms can be used to detect and track suspicious behaviour in a crowd of people or identify and avoid obstacles on the road. Facial recognition, a special case, involves recognising and identifying individuals based on their facial features. Facial recognition is used in security surveillance, identity verification, and social media tagging. Multi-object tracking involves tracking multiple objects over time. It is commonly used in video analysis and surveillance, where it is necessary to track the movements of multiple people or vehicles in real time. A use case for this would be tracking customer behaviour in a store to analyse and improve store layouts or identify bottlenecks in warehouse operations.
There are much more tasks. Depth estimation involves estimating the distance of objects from a camera or sensor. Depth estimation is used in applications such as robotics, augmented reality, and autonomous vehicles. Pose estimation involves estimating the 3D position and orientation of a human body or object from a 2D image or video. Pose estimation is used in applications, such as sports analysis, motion capture, and virtual try-on. Edge detection is another computer vision task that involves identifying the edges or boundaries between objects in an image. This task is useful for feature extraction and can be used in applications, such as face detection, image segmentation, and object recognition. These are just a few examples of the many computer vision tasks that exist. The list is not closed. There are tasks, which can’t be easily classified, such as optical character recognition (OCR), which is recognizing text in an image or video and converting it into machine-readable text. OCR is commonly used in document scanning and digital archiving. As computer vision technology continues to improve, we can expect to see even more advanced applications and use cases in the future.
Ok, so how can I use it in my company?
Knowing, which task one needs is an important step in the process of building a computer vision pipeline, but there’s much more work to do. Once you have identified the specific tasks that can be addressed with computer vision, the next step is to develop a plan for building a computer vision pipeline. This typically involves selecting appropriate models and hardware components, designing and training machine learning models, and integrating the models into your existing systems and workflows. One can always start with pre-trained models that can be customized for specific use cases. This approach can help reduce the time and cost associated with training new models from scratch, while still providing the benefits of computer vision technology. The smallest models fit consumer-grade GPUs. Quite often, they are good enough to suit their task, and there’s no need for huge expenditures in server rooms.
The applications of computer vision are vast and varied, and we know that this can be complex - especially given the field is constantly changing, and each new week brings new discoveries in this domain. At yellowShift, we love computer vision, and we would be happy to help you! At yellowShift, we specialise in developing custom computer vision solutions for businesses of all sizes. Whether you need to automate a specific task, analyse large amounts of image or video data, or enhance the capabilities of an existing system, we can help. Our team of expert computer vision engineers and machine learning specialists has the skills and expertise to tackle even the most complex projects.