Monday, April 16, 2012


Using Kinect and a Haptic Interface for Implementation of Real-Time Virtual Fixtures
Fredrik Ryd´en, Howard Jay Chizeck, Sina Nia Kosari, Hawkeye King and Blake Hannaford


This paper uses the depth feature of the Kinect camera to create real-time haptic virtual fixtures to aid with robotic surgery.  Essentially, these images serve to give the surgeon an indication of where and where not to cut.  This seems like a really good thing to have available to your surgeon, and it's cool that they are getting results using technology initially intended for video games.  The ability to come up with these virtual fixtures in real-time allows the doctors to compensate for movements and deformations during a surgical procedure.  Normally, these images are produced using a CT scan.  However, it's quite complicated to continuously CT scan during a surgery, so using the Kinect presents a more viable solution.

The haptic forces used in this paper are generated from a point cloud, which is the depth data taken from the Kinect.  This data is sent from the computer attached to the Kinect to a computer that is attached to the haptic device.  What this allows is for the surgeon (or whoever is manning the device) to see the effects that their "touch" will have.  An example given in the paper is that by moving your hand up will force the haptic device to move along with the hand.

While this doesn't directly relate to our project, I feel that it is a good example of the versatility of the Kinect, and how it has a myriad of applications that go beyond video games (such as ours).




Thursday, April 5, 2012

Design a questionnaire

D H Stone

Another not-super-exciting blog, but since we're in the middle of coming up with evaluation material it's probably more useful to me than another hand tracking algorithm.  This paper is on: how to write a useful questionnaire.  It also came out of a medical journal, although I don't think that changes much because the concepts behind making a solid questionnaire are still the same.  Anyway, back on topic:

According to the author, a good questionnaire is "one that works".  Basically, that the respondent's answers can be analyzed without bias, error, or misrepresentation.  While there is not a strict set of rules to follow to achieve this, the author does give a set of guidelines he feels will help you toward this goal.

Questions should be:
* Appropriate - The questions give relevant information.
* Intelligible - The language in the questionnaire is understandable by the respondent.
* Unambiguous - The questions mean the same thing to both the respondent and inquirer.
* Unbiased - Equal chance for all answers, also avoid "Recall Bias" (memory based).
* Omnicompetent - Be able to handle as many responses as possible. (Use 'other' and 'don't know' categories)
* Appropriately coded - Make sure your categories are mutually exclusive.
* Piloted - Questionnaires should always be piloted to check for any errors or other faults.
* Ethical - Get consent, etc. (IRB)

The author also has a step-by-step guide on the actual design of the questionnaire, rather than just theory behind the questions, which should be useful when actually constructing the survey:

(1) Decide what data you need
(2) Select items for inclusion
(3) Design individual questions
(4) Compose wording
(5) Design layout
(6) Think about coding
(7) Prepare first draft and pretest
(8) Pilot and evaluate
(9) Perform survey



Tuesday, March 27, 2012


An examination of four user-based software evaluation methods 
Ron Henderson,  John Podd,  Mike Smith,  and Hugo Varela-Alvarez

Since the focus of the last few class periods has been evaluation, I decided to go ahead and read a paper on different  methods rather than yet another hand tracking algorithm.  The paper was written in 1995, but focuses on evaluation methods that are still used (data logging, questionnaires, interviews, 'verbal protocol analyses').

For their study, the authors got a group of 148 people and had each of them use a different evaluation method to test one of three pieces of software (spreadsheet/word processor/database).  The subjects used the software and then applied the evaluation method.

Data Logging: internal software was used to log keystrokes with time samples, to be examined after the test.

Questionnaire: used a 7 point scale with 'not applicable' and 'don't understand' options.  Questions were over topics such as


     Program  self-descriptiveness
     User  control  of  the  program 
    Ease  of  learning  the  program  
    Completeness  of  the  program 
    Correspondence  with  user  expectations
    Flexibility  in  task  handling  
    Fault  tolerance 
    Formatting 

Open ended questions followed these asking about specific problems and calling for comments/suggestions.

Interview: Semi-structured format of scripted questions and following up on unique interviewee comments.

Verbal Protocol: Video taped users while they were evaluating the software.  Users were later asked to 'think aloud' as they watched the tapes play back.

Conclusions:

Data logging is nice because it's pretty much as objective as you can get, however, it's tedious to analyze.

Questionnaires can be give vague results if the wording is not incredibly specific for each question, and it's difficult to make questionnaires that everybody will understand completely.

Interviews are good for getting relevant information quickly, but are subject to the problem of memory decay.

The verbal protocol method tends to be good at finding problem areas because it calls to memory when the user was having trouble with a particular exercise.  However, it's very time-consuming.

The authors note that using a combination of these methods will most likely give the best results, as they add unique contributions, but that using multiple methods is probably affected by diminishing returns, so just blindly adding more methods is not the best approach.




Thursday, March 22, 2012

Guidelines for Multimodal User Interface Design

                                         Leah M. Reeves
Jennifer Lai
James A. Larson
Sharon Oviatt
T. S. Balaji
Stéphanie Buisine
Penny Collings
Phil Cohen
Ben Kraal
Jean-Claude Martin
Michael McTear
TV Raman
Kay M. Stanney
Hui Su
Qian Ying Wang


Communications of the ACM

Since I'm doing a large amount of the UI programming for our project, I thought I'd make a bit of  a topic switch and read up on some UI research.  The paper I read focused on multimodal UI design.



According to the paper, there are six major categories for guidelines.  These are:

  • Requirements Specification
    • Design for a broad range of users
    • Privacy/Security Issues
  • Designing Multimodal Input and Output
    • Maximize human cognitive/physical abilities
    • Integrate input methods in a way compatible with user preference/system functionality/context
  • Adaptivity
    • Adapt to the needs of your users (Ex: Gesture input!)
  • Consistency
    • Make it look consistent, use common features
  • Feedback
    • Users should be aware of which inputs are available
    • Users should notified of alternative interaction options
  • Error Prevention/Handling
    • Provide clearly marked exits from tasks
    • Allow undoing of commands
    •  If an error occurs, permit users to switch to a different modality
The authors do note that more research needs to be done in order to get a better grasp of what the most intuitive/effective combination of different input and output methods are, since the population that these decisions affect is so broad.  They also say that new techniques for error handling and adaptivity should be explored.

These guidelines will be useful to keep in mind as we create the interface for our project, especially since the Kinect is multimodal.




Source: http://delivery.acm.org/10.1145/970000/962106/p57-reeves.pdf?ip=128.194.247.31&acc=ACTIVE%20SERVICE&CFID=91378723&CFTOKEN=92548257&__acm__=1332440500_e7a95379e3a0a0cd7ffc5c29f7d7138f

Thursday, March 8, 2012

Manipulator and object tracking for in-hand 3D object modeling
Michael Krainin, Peter Henry, Xiaofeng Ren and Dieter Fox
The International Journal of Robotics Research 2011 30: 1311 originally published online 7 July 2011

In this paper, the authors write about using a PrimeSense depth camera (functionally the same as a Kinect, with RGB and Depth info), to create an algorithm which allows robots to create a 3D model an unknown object.

The process relies on finding the correct alignment between the object and the robot in each sensor frame.  Prior work has used only the manipulator or the object being modeled.  However, by using RGB-D and encoder data from both the robot's manipulator and the object being modeled, the authors can achieve a much more accurate alignment.

Several steps are taken in the algorithm to ensure accuracy for the model.  Most are somewhat complicated but I'll try to sum them up here in a few words.
  • Kalman filtering helps 'maintain temporal consistency' between input frames and also provide estimates of uncertainty by keeping track of the manipulator's joint angles and the rotation of the object, among other things.  
  • Articulated Iterative Closest Point tracking is used to estimate 'joint angles by attempting to minimize an error function over two point clouds'. 
For the actual modeling:
  • 'Surfels' used as the representation for easy addition of points and removing superfluous points, as well as dealing with occlusion and updating the model.
  • Loop closure to connect pieces of the model.  This involves 'maintaining a graph whose nodes are a subset of the surfels in the object model'.  Edges show that both nodes were visible in the frame, and allows for computation of connected components.
  • Object re-grasping is exactly what it sounds like.  Since some parts of the object will be occluded by itself or by the manipulator, you have to look at it from multiple angles.  It's really complicated.  Put the object down, then pick it back up in a different orientation.



While this doesn't really influence our work that much, it was pretty interesting to read.


Tuesday, February 28, 2012


Learning shape models for monocular human pose estimation from the Microsoft Xbox Kinect
James Charles and Mark Everingham
School of Computing
University of Leeds

The problem the authors are trying to address in this paper is that due to variation in human body shape, it's hard to perform 'pose estimation'; figuring out what pose the person is in.  Most existing work doesn't accurately reflect human limbs because it uses cylindrical or conical representations, which don't have enough flexibility to fit the wide variety of body shapes.  The paper proposes a method to capture variation in limb shape by first inferring the pose from a binary silhouette, using a 'generative model of shape'.  Then, models of probabilistic shape templates for each limb are learnt from Kinect output by inferring segmentation of silhouettes into limbs.

The algorithm is based on a Pictorial Structure Model (PSM) for the human body.  Ten nodes represent limbs in the body, with edges connecting them.  The shape of each limb is independent, and each is parameterized based on its location and orientation.  The overall probability of a pose given a silhouette image is then calculated.  This alone won't give an entirely accurate pose though, because there are still outcomes where multiple limbs are in the same position or missed entirely.

By combining this PSM with sampling from templates learned by the Kinect, the shortcomings of relying on the PSM alone can be overcome.  The Kinect give joint, depth, and silhouette data.  The learning process segments the data from the Kinect into limbs.  By cross referencing this data with the PSM, higher accuracy can be achieved in pose estimation.





Tuesday, February 14, 2012



Recognizing Hand Gestures with Microsoft’s Kinect
Matthew Tang
Department of Electrical Engineering
Stanford University

This paper, like my first, also focuses on finding and recognizing hand gestures.  It attempts to differentiate between two main hand gestures, namely an open hand and a closed fist.  These gestures are demonstrated in a program that allows you to drag and drop virtual objects on the screen.

The algorithm employed is also somewhat similar to the first paper I read.  Because the resolution of the Kinect makes using depth data an unreliable method for finding the 'hand pixels', it is also helpful to check the RGB value of the pixels to determine if they are a hand.  A glove is an easy way to do this (the hand will be a solid, known color), but makes it inconvenient for the user, so the RGB is compared against common skin colors.  The method of determining the probability that a pixel is skin relies on adequate lighting, so performance decreases when the area is poorly lit.  Color balancing is used to compensate for this.  By integrating both depth and RGB data, you can get a much more accurate representation of the hand.

Now that the hand has been found, you need to see if it is making a gesture.  The image of the hand is rotated based on the arm from the skeleton data.  A center of mass for the hand is calculated in order to center the hand in the middle of the image.  Now that the image is somewhat regulated, it can be analyzed to check if the hand is open or closed.  Gestures extend naturally from this by being the transition of the hand from open to closed or vice versa.



Thursday, February 9, 2012


Full DOF tracking of a hand interacting with an object by
modeling occlusions and physical constraints

Iason Oikonomidis, Nikolaos Kyriazis, Antonis A. Argyros

In this paper the authors devise a method by which to estimate the full pose of a hand that is being occluded by some object.  They treat this as an optimization problem, and infer information from the fact that the hand and the object that is occluding the hand cannot occupy the same space.  Therefore, the position of the occluding object tells you a great deal about how the hand is positioned.

To get an idea of the position of the hand, the scene is broken down into a sequence of multiframes, a set of images taken from different cameras at the same point in time.  A joint hand-object model is used to represent the hand and the object occluding it.  The hand model uses 27 parameters, giving it a depth of field of 26.  The authors then attempt to "estimate the parameters that give rise to the hand-object configuration that (a) is most compatible to the image features present in multiframe M (Sec. 2.1) and (b) is physically plausible in the sense that two different rigid bodies cannot share the same physical space (interpenetration constraints)."  They use an edge map and a skin color map to differentiate between the hand and the object, with the assumption that the object will NOT be skin colored.  A fair amount of math ensues.

On experiments with real image data, their system appears to identify the location and position of the hand quite well.  While there is no quantitative data, "visual inspection" shows that the accuracy is better than previous systems that this was tested against.

Their camera system actually consisted of 8 normal cameras mounted in a circle around the hand/object, rather than a Kinect, which I found out in the middle of reading the article / writing this entry.  However.  it would be interesting to see how this might be adapted for use with a single Kinect rather than their 8 camera setup.  There would probably be more restraints, because it's possible to not see the hand at all.  Likely you would have to have a certain amount of hand showing before you could make an accurate guess, but it might be doable.




Tuesday, February 7, 2012

Context-Aware 3D Gesture Interaction Based on Multiple Kinects

Maurizio Caon1, Yong Yue, Julien Tscherrig, Elena Mugellini, Omar Abou Khaled

This paper presents research into using two Kinects simultaneously to let a user control his or her environment by pointing at "smart objects".  These smart objects are added to the environment and recognized by the system beforehand. Different combinations of gestures (pointing), and posture (standing, sitting) cause different actions.  For example, sitting on the couch and pointing at a media player will turn on the TV, but standing and pointing at it will instead turn on the radio.

Each Kinect sends the skeleton data represented in XML for each person it is tracking to a central module.  This module then combines the skeleton data (weighting the data with more joints more heavily), and combines it into a 3D skeleton model. Their gesture recognition is fairly simple, when the arm joints assume specific values, they do a projection of the arm into space ahead of it to see if it intersects with any smart objects.


Wednesday, February 1, 2012

Gesture Recognition Accuracy


The authors of this paper explore two different methods of hand detection.  The first method is skin detection based on RGB values from the color camera, and the second is based on depth information.

The skin based detector works by trying to identify which area of the image is the user's hand by using, you guessed it, skin color.  A 'generic skin color histogram' is used to give each pixel of the image a probability of being skin.  The 'difference' (change in intensity between the current pixel and the pixels to it's side) is also computed, to help find clumps of pixels of the same color.  The sections with the most skin color are scored higher, and more likely to be the hand location.  This method doesn't always accurately find the hand, in the example in the paper it instead classified a person's face in the background as the hand.

The depth method finds abrupt changes in depth and uses that information to construct a binary image.  The five largest 'connected by depth' areas have their average depth calculated, and the section with the lowest mean depth is considered the person making the gesture.  The area is then filtered to determine the location of the hand.  The example of this method given in the paper did accurately find the hand.

The depth hand detector appeared to work the best in their experimental results as well, so perhaps that would be a good option for us to consider as well.


Source Link: http://vlm1.uta.edu/~athitsos/publications/doliotis_petra2011.pdf