A Blog of Computer Related Topics: February 2012

Tuesday, February 28, 2012

Learning shape models for monocular human pose estimation from the Microsoft Xbox Kinect

James Charles and Mark Everingham

School of Computing

University of Leeds

The problem the authors are trying to address in this paper is that due to variation in human body shape, it's hard to perform 'pose estimation'; figuring out what pose the person is in. Most existing work doesn't accurately reflect human limbs because it uses cylindrical or conical representations, which don't have enough flexibility to fit the wide variety of body shapes. The paper proposes a method to capture variation in limb shape by first inferring the pose from a binary silhouette, using a 'generative model of shape'. Then, models of probabilistic shape templates for each limb are learnt from Kinect output by inferring segmentation of silhouettes into limbs.

The algorithm is based on a Pictorial Structure Model (PSM) for the human body. Ten nodes represent limbs in the body, with edges connecting them. The shape of each limb is independent, and each is parameterized based on its location and orientation. The overall probability of a pose given a silhouette image is then calculated. This alone won't give an entirely accurate pose though, because there are still outcomes where multiple limbs are in the same position or missed entirely.

By combining this PSM with sampling from templates learned by the Kinect, the shortcomings of relying on the PSM alone can be overcome. The Kinect give joint, depth, and silhouette data. The learning process segments the data from the Kinect into limbs. By cross referencing this data with the PSM, higher accuracy can be achieved in pose estimation.

Source: http://www.comp.leeds.ac.uk/me/Publications/cdc4cv11.pdf

Tuesday, February 14, 2012

Recognizing Hand Gestures with Microsoft’s Kinect

Matthew Tang

Department of Electrical Engineering

Stanford University

This paper, like my first, also focuses on finding and recognizing hand gestures. It attempts to differentiate between two main hand gestures, namely an open hand and a closed fist. These gestures are demonstrated in a program that allows you to drag and drop virtual objects on the screen.

The algorithm employed is also somewhat similar to the first paper I read. Because the resolution of the Kinect makes using depth data an unreliable method for finding the 'hand pixels', it is also helpful to check the RGB value of the pixels to determine if they are a hand. A glove is an easy way to do this (the hand will be a solid, known color), but makes it inconvenient for the user, so the RGB is compared against common skin colors. The method of determining the probability that a pixel is skin relies on adequate lighting, so performance decreases when the area is poorly lit. Color balancing is used to compensate for this. By integrating both depth and RGB data, you can get a much more accurate representation of the hand.

Now that the hand has been found, you need to see if it is making a gesture. The image of the hand is rotated based on the arm from the skeleton data. A center of mass for the hand is calculated in order to center the hand in the middle of the image. Now that the image is somewhat regulated, it can be analyzed to check if the hand is open or closed. Gestures extend naturally from this by being the transition of the hand from open to closed or vice versa.

Source Link: http://www.stanford.edu/class/ee368/Project_11/Reports/Tang_Hand_Gesture_Recognition.pdf

Thursday, February 9, 2012

Full DOF tracking of a hand interacting with an object by

modeling occlusions and physical constraints

Iason Oikonomidis, Nikolaos Kyriazis, Antonis A. Argyros

In this paper the authors devise a method by which to estimate the full pose of a hand that is being occluded by some object. They treat this as an optimization problem, and infer information from the fact that the hand and the object that is occluding the hand cannot occupy the same space. Therefore, the position of the occluding object tells you a great deal about how the hand is positioned.

To get an idea of the position of the hand, the scene is broken down into a sequence of multiframes, a set of images taken from different cameras at the same point in time. A joint hand-object model is used to represent the hand and the object occluding it. The hand model uses 27 parameters, giving it a depth of field of 26. The authors then attempt to "estimate the parameters that give rise to the hand-object conﬁguration that (a) is most compatible to the image features present in multiframe M (Sec. 2.1) and (b) is physically plausible in the sense that two different rigid bodies cannot share the same physical space (interpenetration constraints)." They use an edge map and a skin color map to differentiate between the hand and the object, with the assumption that the object will NOT be skin colored. A fair amount of math ensues.

On experiments with real image data, their system appears to identify the location and position of the hand quite well. While there is no quantitative data, "visual inspection" shows that the accuracy is better than previous systems that this was tested against.

Their camera system actually consisted of 8 normal cameras mounted in a circle around the hand/object, rather than a Kinect, which I found out in the middle of reading the article / writing this entry. However. it would be interesting to see how this might be adapted for use with a single Kinect rather than their 8 camera setup. There would probably be more restraints, because it's possible to not see the hand at all. Likely you would have to have a certain amount of hand showing before you could make an accurate guess, but it might be doable.

Source: http://www.ics.forth.gr/~argyros/mypapers/2011_11_iccv_hope.pdf

Tuesday, February 7, 2012

Context-Aware 3D Gesture Interaction Based on Multiple Kinects

Maurizio Caon1, Yong Yue, Julien Tscherrig, Elena Mugellini, Omar Abou Khaled

This paper presents research into using two Kinects simultaneously to let a user control his or her environment by pointing at "smart objects". These smart objects are added to the environment and recognized by the system beforehand. Different combinations of gestures (pointing), and posture (standing, sitting) cause different actions. For example, sitting on the couch and pointing at a media player will turn on the TV, but standing and pointing at it will instead turn on the radio.

Each Kinect sends the skeleton data represented in XML for each person it is tracking to a central module. This module then combines the skeleton data (weighting the data with more joints more heavily), and combines it into a 3D skeleton model. Their gesture recognition is fairly simple, when the arm joints assume specific values, they do a projection of the arm into space ahead of it to see if it intersects with any smart objects.

Wednesday, February 1, 2012

Gesture Recognition Accuracy

The authors of this paper explore two different methods of hand detection. The first method is skin detection based on RGB values from the color camera, and the second is based on depth information.

The skin based detector works by trying to identify which area of the image is the user's hand by using, you guessed it, skin color. A 'generic skin color histogram' is used to give each pixel of the image a probability of being skin. The 'difference' (change in intensity between the current pixel and the pixels to it's side) is also computed, to help find clumps of pixels of the same color. The sections with the most skin color are scored higher, and more likely to be the hand location. This method doesn't always accurately find the hand, in the example in the paper it instead classified a person's face in the background as the hand.

The depth method finds abrupt changes in depth and uses that information to construct a binary image. The five largest 'connected by depth' areas have their average depth calculated, and the section with the lowest mean depth is considered the person making the gesture. The area is then filtered to determine the location of the hand. The example of this method given in the paper did accurately find the hand.

The depth hand detector appeared to work the best in their experimental results as well, so perhaps that would be a good option for us to consider as well.

Source Link: http://vlm1.uta.edu/~athitsos/publications/doliotis_petra2011.pdf