Currently, I am working on a medical project which requires detection of Head Nods (in agreement), Head Shakes (in disagreement), and Head Rolls (Asian/East Indian head gesture for agreement) within a computer application.
Being that I work with the Kinect for Windows device, I figured this device is perfect for this type of application.
This posting serves as explanation to how I built this library, the algorithm used, and how I used the Kinect device and Kinect for Windows SDK to implement it.
Before we get into the Guts of how this all works, let’s talk about why the Kinect is the device that is perfect for this type of application.
The Kinect v2.0 Device has many capabilities. One of which allows the device to capture a persons face in 3-D… That is 3-Dimensions:
Envision the Z-axis arrow pointing straight out towards you in one direction, and out towards the back of the monitor/screen in the other direction.
In Kinect terminology, this feature is called HD Face. In HD Face, the Kinect can track the eyes, mouth, nose, eye brows, and other specific things about the face when a person looks towards the Kinect camera.
So envision a person’s face tracked in 3-D.
We can measure height, width, and depth of a face. Not only can we measure 3-d values and coordinates on various axes, with a little math and engineering we can also measure movements and rotations over time.
Think about normal head movements for a second. We as humans twist and turn our heads for various reasons. One such reason is proper driving techniques. We twist and turn our heads when driving looking for other cars on the road. We look up at the skies on beautiful days. We look down on floors when we drop things. We even slightly nod our heads in agreement, and shake our heads in disgust.
Question: So from a technical perspective what does this movement look like?
Answer: When a person moves their head, the head rotates around a particular axis. It’s either the X, Y, Z, or even some combination of the three axis. This rotation is perceived from a point on the head. For our purposes, let’s look at the Nose as the point of perspective.
When a person Nods their head, the nose is rotated around the X-axis in small up and down manner. The Nose coordinates for Head Nod makes the Y- coordinate values of the Nose point go up and down.
When a person Shakes their head, the nose is rotated around the Y-axis in a small left and right manner. The Nose coordinates for the Head Shake makes the X-coordinate values of the Nose point go up and down.
If we were to graph Nods and Shakes over time, their Y and X graphs would look like this:
Question: So great, we have a graph of Head Nods and Head Shakes… How do we get the Y, X and rotations from the head?
Answer: Luckily for us the Kinect for Windows SDK, provides us engineers with the HD Face Coordinates in 3-D. That is we get the X, Y, and Z coordinates of a Face. Due to linear algebra, and vector math, we can also derive the Rotational Data from this as well. HD Face gives us Facial orientation, and also Head Pivot data.
Question: Now we’re getting somewhere, so exactly how do you calculate Head Nods/Shakes/Rolls with the Kinect?
Answer: Well it takes a little creativity, and some help from some researchers in Japan (Shinjiro Kawato and Jun Ohya), who figured out the mathematically formula to derive the head position deviations.
So my implementation is based in part on this paper. Instead of “Between the eyes”, I decided to use the Nose, since the Kinect readily gives me this information fairly easily.
The implementation concept is simple.
First let’s assume, from the research paper that a typical Nod/Shake/Roll lasts about 1 to 1.4 seconds.
Next let’s take for fact that the Kinect device produces 30 frames per second. And as long as a person is facing the camera, the majority of these frames per second will produce a HD Face frame for us (assuming at least approx ~15-20 fps).
Therefore if I capture about 1-1.5 seconds of frames, I can determine Head Rotations, pixel coordinates (X, Y and Z), derive rotation in angles, and store this data in a state machine for each measured frame.
I can then change states for each measured frame from “Extreme” to “Stable” to “Transient” based on the algorithms provided by Kawato and Ohya.
I then use a delayed 5 frame buffer to evaluate a set of states for the last 3 of the 5 buffered frames.
Next thing I do is continue applying the algorithm from Kawato and Ohya to figure out when and precisely how to check for head nods/shakes/rolls inside my buffered frame states.
The mechanism to check is simple as well. If the current frame state changes from a non stable state to “Stable” then I go and evaluate for Nods/Shakes/Rolls.
The evaluation is also simple. During the evaluation process, if the previous frame states have more than 2 adjacent “Extreme” states, then I check to see if all the adjacent states have Nose rotation angles greater than a configurable threshold. By default my threshold is 1 degrees. Depending on which axis it is, Y – Nods, X – Shakes, Z – Rolls, I raise an event that the appropriate head action occurred.
Here’s a graphical view of the process flow:
Frame state depiction:
If you’re interested in testing out this library, please contact me here through this blog.
Here’s the library and a sample Windows 8.1 store application using the library in action. In the picture below, I have updated the HD Face Basic XAML sample for visualization. As the HD Face mesh head nods and shakes, I show the confidence of a Head Nod or Head Shake. On the left represents KinectStudio and a recorded clip of me testing the application