Saliency detection using the Vision framework

Original photo by Meritt Thomas on Unsplash
The salience (also called saliency) of an item is the state or quality by which it stands out from its neighbors. Saliency detection is considered to be a key attentional mechanism that facilitates learning and survival by enabling organisms to focus their limited perceptual and cognitive resources on the most pertinent subset of the available sensory data.

Wikipedia

In this article you will find information on:

  • How to find spots in the image containing something interesting i.e. architecture, objects, animals, people, vegetation, and so on.
  • How to find parts of the image that are likely to draw attention.

Knowing this you can help users with cropping, focus on meaningful content, find objects of interest and highlight them, make attention-driven animations, add tracking to video feeds, and a lot more.

If you don't know how to work with the Vision framework yet I encourage you to check my previous vision-related articles listed in the Vision Framework series for more details.

I'm reusing the code from Barcode detection using Vision framework therefore if you find something not explained in detail please check this article.

This week I will introduce two new requests:

  • VNGenerateObjectnessBasedSaliencyImageRequest - which tells us where objects of interest are located.
  • VNGenerateAttentionBasedSaliencyImageRequest - which tells us which part of the image is likely to draw user attention.

We will work on these two images:

Photo by Robert Lukeman on Unsplash
Photo by Jonah Pettrich on Unsplash

Beautiful landscape with a waterfall (I have to visit Iceland someday!) and two birds sitting on a branch. Do they have anything in common?

Yes.

Our application even with the power of the Vision framework I uncovered in my previous articles won't learn anything from them. They are a blank page and we can't do anything to either help the user with a task he is performing or make the viewing of these images nicer. No people, no cats or dogs, no barcodes, and we are helpless.

Until now.

Let's make VNGenerateObjectnessBasedSaliencyImageRequest first :

let saliencyRequest = VNGenerateObjectnessBasedSaliencyImageRequest()

And run the request:

func process(_ image: UIImage) {
guard let cgImage = image.cgImage else { return }
let saliencyRequest = VNGenerateObjectnessBasedSaliencyImageRequest()

let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                           orientation: .init(image.imageOrientation),
                                           options: [:])

visionQueue.async { [weak self] in
    do {
        try requestHandler.perform([saliencyRequest])
    } catch {
        print("Can't make the request due to \(error)")
    }

When this is done we receive the results:

guard let results = saliencyRequest.results as? [VNSaliencyImageObservation] else { return }

Note: In iOS 15 and above we don't need to map the results anymore. They come with the correct type and not arrays of Any.

The VNSaliencyImageObservation after a quick look gives us salientObjects which is an array of VNRectangleObservation. This means we can get the bounding boxes indicating locations of objects.
But that's not all. This observation inherits from VNPixelBufferObservation therefore we have access to CVPixelBuffer. This is where the heat map is located.

A heat map (or heatmap) is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space.

Wikipedia

First, let's handle what we know and did a few times before:

let rectangles = results
    .flatMap { $0.salientObjects?.map { $0.boundingBox.rectangle(in: image) } ?? [] }
    .map { CGRect(origin: $0.origin.translateFromCoreImageToUIKitCoordinateSpace(using: image.size.height - $0.size.height),
                  size: $0.size) }

We are getting boundingBoxes from salientObjects. In the process, we are projecting normalized coordinates into the image and later we translate the coordinates from CoreImage coordinate space to UIKit coordinate space. Please check Detecting body pose using Vision framework and Barcode detection using Vision framework for the details.

Now it's time to extract the heat map. I said before it's located in pixelBuffer which, as the name implies, is CVPixelBuffer. The first problem is how to get the UIImage from a pixel buffer. I made an extension to make this easier to use:

extension CVPixelBuffer {
    func makeImage() -> UIImage? {
        let ciImage = CIImage(cvImageBuffer: self)
        
        guard let cgImage = CIContext().createCGImage(ciImage, from: ciImage.extent) else { return nil }
        return UIImage(cgImage: cgImage)
    }
}

The first step is to make a CIIMage from the CVPixelBuffer:

let ciImage = CIImage(cvImageBuffer: self)

The second step is to make a CGImage from the CIImage:

guard let cgImage = CIContext().createCGImage(ciImage, from: ciImage.extent) else { return nil }

And the third step is to finally make an UIImage from the CGImage:

Let uiImage = UIImage(cgImage: cgImage)

It's time to extract a heat map from the observation. Unlike salientObjects the heat map is one per observation:

let heatMap = results.first?.pixelBuffer.makeImage()

This is what we get for the image of two birds. The 68x68 heat map image:

Knowing the size of the image size of the heat map is unexpected but nothing we can't handle. We pass everything we get to the drawing function:

DispatchQueue.main.async {
    self?.imageView.image = image.draw(rectangles: rectangles,
                                       image: heatMap)
}

We need to make sure we are on the main thread before displaying the image returned by the drawing function.

I will paste the body of a drawing function in a moment. Before that, I want to show what we need to do to properly display the heat map:

image?.draw(in: CGRect(origin: .zero, size: size), blendMode: .hue, alpha: 1.0)

We pass the heat map as an optional image and then we draw it. We use the size of the image to make sure our heat map will be scaled appropriately. This deals with the 68x68 size problem we had. BlendMode is set to .hue to make the heat map areas stand out.
This is not all we could do when we have the heat map. We could cut the objects from the photo and save or paste them into the other image, blur the background, make objects lose or gain color, and so on.

This is the whole drawing function, similar to the one used for barcodes or animals:

extension UIImage {
    func draw(rectangles: [CGRect],
              image: UIImage?,
              strokeColor: UIColor = .primary,
              lineWidth: CGFloat = 2) -> UIImage? {
        let renderer = UIGraphicsImageRenderer(size: size)
        return renderer.image { context in
            draw(in: CGRect(origin: .zero, size: size))
            
            image?.draw(in: CGRect(origin: .zero, size: size), blendMode: .hue, alpha: 1.0)
            
            context.cgContext.setStrokeColor(strokeColor.cgColor)
            context.cgContext.setLineWidth(lineWidth)
            rectangles.forEach { context.cgContext.addRect($0) }
            context.cgContext.drawPath(using: .stroke)
        }
    }
}

Let's check the effect:

Two objects were detected and their shapes are described by the heat map. We additionally have the CGRects which is convenient if we need exact positions of the objects in the image.

Now we will tackle another case. Imagine we want to make an animated slideshow but we want to focus on the most interesting areas of any image. Not every image has clear objects that can be identified.

This is where VNGenerateAttentionBasedSaliencyImageRequest comes in. It's similar to VNGenerateObjectnessBasedSaliencyImageRequest therefore the only change we need to make is to change the request to:

let saliencyRequest = VNGenerateAttentionBasedSaliencyImageRequest()

Let's see what we will get now for the image with birds:

And for the amazing waterfall in Iceland:

The results are different. This request shows us where people, most likely, will focus their attention when watching these images.

If you are working with this request alone you could simplify the code and get the first observation from the results because attention request produces one observation.

Note: This request should always produce a meaningful response. Unlike the previous one which didn't see anything interesting in the waterfall image.

When AI will start the war I will hide in Iceland.

This is the whole code:

let visionQueue = DispatchQueue.global(qos: .userInitiated)
extension ImageProcessingViewController {
    func process(_ image: UIImage) {
        guard let cgImage = image.cgImage else { return }
        let saliencyRequest = VNGenerateObjectnessBasedSaliencyImageRequest()
        // let saliencyRequest = VNGenerateAttentionBasedSaliencyImageRequest()        
        // Choose which request to run ^
        
        let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                                   orientation: .init(image.imageOrientation),
                                                   options: [:])

        saveImageButton.isHidden = false
        visionQueue.async { [weak self] in
            do {
                try requestHandler.perform([saliencyRequest])
            } catch {
                print("Can't make the request due to \(error)")
            }

            guard let results = saliencyRequest.results as? [VNSaliencyImageObservation] else { return }
            
            let rectangles = results
                .flatMap { $0.salientObjects?.map { $0.boundingBox.rectangle(in: image) } ?? [] }
                .map { CGRect(origin: $0.origin.translateFromCoreImageToUIKitCoordinateSpace(using: image.size.height - $0.size.height),
                              size: $0.size) }
            
            let heatMap = results.first?.pixelBuffer.makeImage()
            
            DispatchQueue.main.async {
                self?.imageView.image = image.draw(rectangles: rectangles,
                                                   image: heatMap)
            }
        }
    }
}

extension UIImage {
    func draw(rectangles: [CGRect],
              image: UIImage?,
              strokeColor: UIColor = .primary,
              lineWidth: CGFloat = 2) -> UIImage? {
        let renderer = UIGraphicsImageRenderer(size: size)
        return renderer.image { context in
            draw(in: CGRect(origin: .zero, size: size))
            
            image?.draw(in: CGRect(origin: .zero, size: size), blendMode: .hue, alpha: 1.0)
            
            context.cgContext.setStrokeColor(strokeColor.cgColor)
            context.cgContext.setLineWidth(lineWidth)
            rectangles.forEach { context.cgContext.addRect($0) }
            context.cgContext.drawPath(using: .stroke)
        }
    }
}

extension CVPixelBuffer {
    func makeImage() -> UIImage? {
        let ciImage = CIImage(cvImageBuffer: self)
        
        guard let cgImage = CIContext().createCGImage(ciImage, from: ciImage.extent) else { return nil }
        return UIImage(cgImage: cgImage)
    }
}

extension CGRect {
    func rectangle(in image: UIImage) -> CGRect {
        VNImageRectForNormalizedRect(self,
                                     Int(image.size.width),
                                     Int(image.size.height))
    }
    
    var points: [CGPoint] {
        return [origin, CGPoint(x: origin.x + width, y: origin.y),
                CGPoint(x: origin.x + width, y: origin.y + height), CGPoint(x: origin.x, y: origin.y + height)]
    }
}

extension CGPoint {
    func translateFromCoreImageToUIKitCoordinateSpace(using height: CGFloat) -> CGPoint {
        let transform = CGAffineTransform(scaleX: 1, y: -1)
            .translatedBy(x: 0, y: -height);
        
        return self.applying(transform)
    }
}

If you want to play with Vision and see it for yourself you can check the latest version of my vision demo application here. The example code is located in this file.

If you have any feedback, or just want to say hi, you are more than welcome to write me an e-mail or tweet to @tustanowskik

If you want to be up to date and always be first to know what I'm working on tap follow @tustanowskik on Twitter

Thank you for reading!

P.S. Saliency doesn't tell us anything about the contents of the identified areas. But this doesn't mean we can't do anything about it. We can get the coordinates and use CoreML to try to classify them but this alone is material for an article.

This article was featured in Awesome Swift #278 ūüéČ

If you want to help me stay on my feet during the night when I'm working on my blog - now you can:

Kamil Tustanowski is iOS Dev, blog writer, seeker of new ways of human-machine interaction
Hey ūüĎčIf you are seeing this page it means you either read my blog https://cornerbit.tech¬†or play with my code on GitHub https://github.com/ktustanowski.Thank you...
Kamil Tustanowski

Kamil Tustanowski

I'm an iOS developer dinosaur who remembers times when Objective-C was "the only way", we did memory management by hand and whole iPhones were smaller than screens in current models.
Poland