Detecting body pose using Vision framework

Original photo by Thao Le Hoang on Unsplash (The dots were added by my and are marking body joints detected by Vision)

Vision framework is a remarkable piece of technology that gives us access to all kinds of image and video detection and classification features. It can detect face landmarks, body pose, hand pose, barcodes, text, and more. For more information on Vision visit our source of truth - The Documentation.

In this article I will cover how to go from here:

Photo by Thao Le Hoang on Unsplash

To here:

In other words how to apply body pose detection, available from iOS 14 and macOS 11, and to display detected body points. Up to nineteen unique body points can be detected which is enough to assess where the person is located in the image and what is doing. There are limitations and the result is not always ideal but based on my personal observations it's more than good enough.

Get the image

In order to start making Vision requests we first need an image, which we will process. For the vision demo application, I'm using good ol' UIImagePickerController. Make sure to connect this action to a button:

@IBAction func didTapLoadImageButton(_ sender: UIButton) {
    saveImageButton.isHidden = true
    let imagePicker = UIImagePickerController()
    imagePicker.sourceType = .photoLibrary
    imagePicker.delegate = self
    present(imagePicker, animated: true)
}

Next, we need to handle the selected image:

extension ImageProcessingViewController: UIImagePickerControllerDelegate, UINavigationControllerDelegate {
    func imagePickerController(_ picker: UIImagePickerController,
                               didFinishPickingMediaWithInfo info: [UIImagePickerController.InfoKey : Any]) {
        imageView.image = info[.originalImage] as? UIImage
        picker.dismiss(animated: true, completion: nil)
    }
}

With a few lines of code, we have a source of images for processing.

The Request

Let's do the import first:

import Vision

With Vision imported we are free to do all the supported requests and, to be honest, I was amazed how easy this is.

Let's create a new function where we will put all Vision-related code:

func process(_ image: UIImage) { }

And call it right after we set the image to the image view. This will automatically trigger the processing for each picked image.

The next part is selecting what we want Vision to analyze. We are interested in what people in the images are doing therefore we pick VNDetectHumanBodyPoseRequest:

let bodyPoseRequest = VNDetectHumanBodyPoseRequest()

Then we create a request handler. Since we want to analyze still images we pick VNImageRequestHandler:

let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                          orientation: .init(image.imageOrientation),
                                          options: [:])

Vision doesn't operate on UIImages therefore we need to get CGImage from UIImage first. Let's add a guard before anything else:

guard let cgImage = image.cgImage else { return }

The other problem is we need to provide image orientation but using CGImagePropertyOrientation and not UIImage.Orientation. We need to do a simple conversion:

extension CGImagePropertyOrientation {
    init(_ uiOrientation: UIImage.Orientation) {
        switch uiOrientation {
            case .up: self = .up
            case .upMirrored: self = .upMirrored
            case .down: self = .down
            case .downMirrored: self = .downMirrored
            case .left: self = .left
            case .leftMirrored: self = .leftMirrored
            case .right: self = .right
            case .rightMirrored: self = .rightMirrored
            @unknown default:
                self = .up
        }
    }
}

We have everything in place now. Finally, it's time to make the request:

do {
    try requestHandler.perform([bodyPoseRequest])
} catch {
    print("Can't make the request due to \(error)")
}

Let's pause here for a minute.

The request handler can handle multiple requests at once. This is the way it's meant to be used. If more than one detection needs to be processed all the requests should be passed in an array that currently holds the lonely [bodyPoseRequest]. It's optimized to be faster this way. It's not advised to create requests handlers per each request.

The Results

When the detection is complete we can find the results in results property of the request:

guard let results = bodyPoseRequest.results else { return }

In our case, it's of type VNHumanBodyPoseObservation. Results are in an array because Vision can detect more than one person in the image. Each person detected has a separate observation associated to it.

Since we have access to observations we want to check the points for each body point detected. Vision stores this information in VNRecognizedPoint. First, for simplicity, let's take one observation:

guard let results = bodyPoseRequest.results,
      let result = results.first else { return }

We have two ways of obtaining the points from the observation. We can get the point for each joint separately:

let noseRecognizedPoint = result.recognizedPoint(.nose)

For the nose joint from the example image recognized point has:

(lldb) po result.recognizedPoint(.nose)
[0.575886; 0.782616]

(lldb) po result.recognizedPoint(.nose).confidence
0.7972713

Or we can get points for a whole joints group i.e. torso:

let torsonRecognizedPoints = result.recognizedPoints(.torso)
(lldb) po result.recognizedPoints(.torso)
▿ 6 elements
  ▿ 0 : 2 elements
    ▿ key : VNHumanBodyPoseObservationJointName
      - _rawValue : neck_1_joint
    - value : [0.604436; 0.676590]
  ▿ 1 : 2 elements
    ▿ key : VNHumanBodyPoseObservationJointName
      - _rawValue : right_upLeg_joint
    - value : [0.593101; 0.413685]
  ▿ 2 : 2 elements
    ▿ key : VNHumanBodyPoseObservationJointName
      - _rawValue : root
    - value : [0.630210; 0.422992]
  ▿ 3 : 2 elements
    ▿ key : VNHumanBodyPoseObservationJointName
      - _rawValue : left_shoulder_1_joint
    - value : [0.666078; 0.672480]
  ▿ 4 : 2 elements
    ▿ key : VNHumanBodyPoseObservationJointName
      - _rawValue : left_upLeg_joint
    - value : [0.667319; 0.432300]
  ▿ 5 : 2 elements
    ▿ key : VNHumanBodyPoseObservationJointName
      - _rawValue : right_shoulder_1_joint
    - value : [0.542794; 0.680700]

We get two things from VNRecognizedPoints:

  • A location that returns a CGPoint. Note that, as you can see, there is something off with these coordinates. We will come back to it later.
  • Confidence is a way for Vision to inform us about the quality of the observation. It's between 0 and 1 and our ~0.8 means there is a high chance we know where the person's nose is located in the picture.

Show the results

When I was making my first requests I was curious how precise they are. Unfortunately, it's hard to understand a plain list of printed points. Vision was confident about the detection but I wasn't. I wanted to present results visually to make them easy to assess and comprehend. Let's do the same now. First, we do a step back and instead of taking the first observation, we will take them all. We need all the points in one place in order to draw them for multiple people:

guard let results = bodyPoseRequest.results else { return }
        
let normalizedPoints = results.flatMap { result in
    result.availableJointNames
        .compactMap { try? result.recognizedPoint($0) }
        .filter { $0.confidence > 0.1 }
}

We compactMap the availableJointNames from the result to get point for each joint, with a confidence of more than 10%. We then flatMap the points arrays to have them all in a single output array.

Let's now get back to the issue with the coordinates I mentioned before. VNRecognizedPoint is using coordinates from normalized coordinate space and in this form, we can't use them. We need to project them into image coordinates:

extension VNRecognizedPoint {
    func location(in image: UIImage) -> CGPoint {
        VNImagePointForNormalizedPoint(location,
                                       Int(image.size.width),
                                       Int(image.size.height))
    }
}
let points = normalizedPoints.map { $0.location(in: image) }

These points are ready for us to use. I made a simple extension that will draw the points into the image:

extension UIImage {
    func draw(points: [CGPoint],
              fillColor: UIColor = .white,
              strokeColor: UIColor = .black,
              radius: CGFloat = 15) -> UIImage? {
        let scale: CGFloat = 0
        UIGraphicsBeginImageContextWithOptions(size, false, scale)
        draw(at: CGPoint.zero)

        points.forEach { point in
            let path = UIBezierPath(arcCenter: point,
                                    radius: radius,
                                    startAngle: CGFloat(0),
                                    endAngle: CGFloat(Double.pi * 2),
                                    clockwise: true)
            
            fillColor.setFill()
            strokeColor.setStroke()
            path.lineWidth = 3.0
            
            path.fill()
            path.stroke()
        }

        let newImage = UIGraphicsGetImageFromCurrentImageContext()
        UIGraphicsEndImageContext()
        return newImage
    }
}

The only thing left for us to do is to replace the image in the image view:

self?.imageView.image = image.draw(points:  points,
                                   fillColor: .primary,
                                   strokeColor: .white)

I have a primary color set in the vision demo application. You can use any color you like. Let's build the application and check the result:

Something is wrong. Which means we must have missed something. After a closer look, the detected points are starting to make more sense but they appear to be upside down. It's because coordinate space in UIKit has its origin at the top left corner and in CoreImage origin is located at the bottom left corner. We make the fix:

extension CGPoint {
    func translateFromCoreImageToUIKitCoordinateSpace(using height: CGFloat) -> CGPoint {
        let transform = CGAffineTransform(scaleX: 1, y: -1)
            .translatedBy(x: 0, y: -height);
        
        return self.applying(transform)
    }
}

And apply it:

let upsideDownPoints = normalizedPoints.map { $0.location(in: image) }

let points = upsideDownPoints
    .map { $0.translateFromCoreImageToUIKitCoordinateSpace(using: image.size.height) }

This time after building and running we finally see what Vision was telling us all along but we couldn't understand:

The person in the picture is in the martial-art pose.

Well... this is not exactly true. Vision understands where the body joints are. Nothing more. But this is impressive nonetheless.

This how our Vision handling code looks like in the end. Note, I created a separate queue to make sure our request won't block the main thread:

private let visionQueue = DispatchQueue.global(qos: .userInitiated)

func process(_ image: UIImage) { 
guard let cgImage = image.cgImage else { return }

visionQueue.async { [weak self] in
    let bodyPoseRequest = VNDetectHumanBodyPoseRequest()

    let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                              orientation: .init(image.imageOrientation),
                                              options: [:])
    do {
        try requestHandler.perform([bodyPoseRequest])
    } catch {
        print("Can't make the request due to \(error)")
    }

    guard let results = bodyPoseRequest.results else { return }

    let normalizedPoints = results.flatMap { result in
        result.availableJointNames
            .compactMap { try? result.recognizedPoint($0) }
            .filter { $0.confidence > 0.1 }
    }

    let upsideDownPoints = normalizedPoints.map { $0.location(in: image) }
    let points = upsideDownPoints
        .map { $0.translateFromCoreImageToUIKitCoordinateSpace(using: image.size.height) }

    DispatchQueue.main.async {
        self?.imageView.image = image.draw(points:  points,
                                           fillColor: .primary,
                                           strokeColor: .white)
    }
}

Remember about DispatchQueue.main.async to make sure any UI modifications will be done on the main thread.

If you want to play with Vision and see it for yourself you can check the latest version of my vision demo application here. If you want to make sure you have the same code as used in this article please use version 0.1.0.

This is my first article on Vision and more are yet to come. Stay tuned if you are interested in real-time detecting, classification, Machine Learning, and more.

If you have any feedback, or just want to say hi, you are more than welcome to write me an e-mail or drop me a line on Twitter.

Thank you for reading!

This article was featured in SwiftLee Weekly 78 and iOS Dev Weekly #526 🎉�

Kamil Tustanowski

Kamil Tustanowski

I'm an iOS developer dinosaur who remembers times when Objective-C was "the only way", we did memory management by hand and whole iPhones were smaller than screens in current models.
Poland