Gesture Coding Manual


This manual documents our methods for labeling speech-accompanying gestures. We are interested in the grouped-ness of gesture units as described by Adam Kendon, focusing on the boundaries of what Kendon 1980 calls “Parts.” Some of our terminology is different from Kendon’s terminology, and this helps us to better define the characteristics of speech-accompanying gestures.

  • Gesture phases
    These are similar to Kendon’s G-phrases. We make the distinction between phases and phrases because we are using the term phrases to refer more to groups of SDGs. Gesture phases are the pieces or components that make up a SDG.
  • Stroke-defined gesture [SDG]
    Like Kendon’s G-Units, but we define that each one must have a single stroke.
  • Preparation phase
    A moving phase that starts from a full or incomplete rest position. It precedes the stroke phase.
  • Pre-stroke hold
    This phase only occurs after a preparation phase. It is non-moving, right before the action, or stroke phase.
  • Stroke phase
    This is the main action of the gesture unit, our SDGs. Kendon describes it as a “distinct peaking of effort” based on Rudolf Laban’s definition of effort in dance theory.
  • Post-stroke hold
    Following a stroke phase, the hand or hands are not relaxed. There is intent to relax or to start another gesticulation
  • Recovery phase
    Similar to Kendon’s recovery, relaxing phase is a motion that goes to a relaxed state.
  • Relaxed phase
    A state where the hands are relaxed either partially or full. Hands tend to be non-moving or having imperceptible motion. Included in the SDG for data analysis purposes.
  • Non-counted gestures
    These are non-speech accompanying motions where the hands many be engaging in self-touch, or various twitching or swaying combinations.

Perceptual Gesture Grouping [PGG]
Similar to Kendon 1980’s description of Parts which group G-Units, PGGs are groups of SDGs. We call them “perceptual” due to how we labeled them (using our perception, and without sound). They may be grouped into higher levels of PGGs, with level 1, PGG1, being a grouping of SDGs, and level 2, PGG2, being a grouping of PGG1s, and so forth.

The grouped-ness of PGGs described by Kendon 1980’s characterization of how G-Phrases are collected into G-Units or Parts: “G-Phrase 1 and G-Phrase 2 are grouped into Part 1 because they are very similar in form and in the space they make use of. G-Phrase 3 is regarded as belonging to a separate Part, in this case because it is enacted by a different limb. In other examples where the gesticulation is confined to one limb only, distinct Parts are recognized if the limb moves to an entirely new spatial area for enactment, or if it engages in a sharply distinctive movement pattern.” (Kendon 1980)

After labeling Perceptual Gesture Groupings [PGGs], we identify 3 major features, or kinematic dimensions, that when changed, aid in determining the boundaries of PGGs. These dimensions are as follows: hand shape, location with respect to the body, and trajectory shape. Hand shape and trajectory shape describe the “form” of the gesticulation, and location of the hands refer to Kendon 1980’s “the space they make use of.” We are not taking into account the handed-ness of the SDGs. In gesticulation, the two hands move in unison or with focus on a single dominant hand. In cases of asynchrony, the hands are still doing similar things and we have never seen in our corpus the left and right hands execute different active gesticulations that differ in these three dimensions.

The cornerstone of our research is the quantification of these dimensions. In particular, calculating the amount of change that occurs from one hand shape to the next, from one location to the next, or from one trajectory shape to the next. In our quantification method, the more perceptually different they are, the larger the numerical difference. Currently this manual does not go into the quantification methods and procedures. We aim to provide this upon publishing our findings and promise that these methodologies remain consistent with the end goal of quantification that is more than just counting occurrences.

We supplement our methods with various tools and programming scripts that help align the annotations so everything lines up perfectly for quantification analysis. This methodology allows wiggle room for being a few video frames off in the annotations, which speeds up the labeling process without compromising on accuracy.[/fusion_text][/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]


We use ELAN, a video annotation tool created by Max Planck Institute, to label our video samples. To download and learn more about how to use ELAN, check out . We will also provide tips on using ELAN for each feature labeled. PGGs are labeled perceptually, and a beginner can do it with little direction. Each gesture phase goes into its own tier and is later assembled into a single tier via scripts. Kinematic dimensions are labeled in accordance with how detailed they can be. For example, hand shape may stay the same across multiple SDGs and PGGs, so you will only have to label that segment once. Location changes frequently and may change from the beginning of a stroke phase to its end.

  • Labeler
    Someone who annotates or labels the gesture features
  • Labels, annotations
    Individual annotated tokens
  • Annotation value
    The text or label of the annotation
  • Annotation range
    The start to end of an annotation

Tips for annotation value

  • Use a question mark at the end if you are unsure
  • Use forward slash “/” when unable to decide between two annotation values, with the more likely one first
  • Label for the main gesturing hand for each gesture stroke. Usually the other hand is relaxed, dragging along, or doing a similar motion. We’ve never seen two hands execute completely different gestures unless you’re recording people patting their heads and rubbing their bellies simultaneously.

Tips for Workflow

  • Explore the ELAN preferences. Some things like “center on selected annotation” can be turned off so that the annotation you just selected doesn’t suddenly jump to the center of the screen.
  • Use horizontal zoom slider to zoom in and out for getting better detail or seeing the bigger picture.
  • After you’ve labeled where the gesture phases are, copy the strokes tier, delete the annotation values in the copied tier, and use that for labeling other dimensions that describe what’s going on. (Referentiality, kinematics, etc.)

Perceptual Gesture Groupings (PGGs)

Series of gestures could be called repeated hits, beats, or shakes, with increasing speed and decreasing time separation between each stroke. In our experience, the boundaries between them are not so clear cut. Despite that, it is still very easy to perceive where the groupings are. In taking out the difficult decision-making process, we were able to focus on labeling the perceptual gesture groupings, and do it quickly and efficiently. With this process in place, we were able to notice higher level groupings of groupings.

We recommend starting off with perceptual phrase groupings. These are the easiest and fastest to label with very little training, and can provide fast insights to anyone new to gesture research. You can label starting with the largest groups in a tier, and then go down into smaller and smaller perceptual groups in different tiers before you get to individual gesture strokes. The grouping tier above the strokes are considered PGG level 1.

Use your perception to label gesture strokes that appear to group together. The annotation range does not have to be 100% accurate so long as the entirety of the gesture strokes are contained within the time boundaries.

Leave the annotation value blank. Use ELAN’s “Label and Number Annotations” tool under the Tier menu option. This will give you an ID to reference later. The number of gesture strokes that are contained within each PGG is an important measure to us, and is done in post-processing of the data.

If the gesture phrase groupings appear to fall into higher level groupings, feel free to add more tiers. We normally used PGG1 for the smallest groups and PGG2 for larger groups. PGG3 occurs rarely and would depend on the speaker and duration of the video (we labeled only 13 in a 30-minute sample).


Because perception of gesture grouping levels can vary from one annotator to the next, we used two annotators for each sample, with a consensus labeling round for our final annotations. Variations in labeling occur usually across levels of PGGs. Disagreements are only counted when the boundaries of the larger annotation cuts the boundaries of the smaller annotation. We look at the labeller that annotated longer PGGs (containing more strokes). Then checked whether the initial SDG of the PGG is also a initial SDG for the second labeller, and whether the last SDG of the PGG is also a last SDG of the second labeller. If this is not so, then it counts as a disagreement.

Sometimes, what is considered an SDG may be labelled by one labeller as a small movement (too slight to be counted as an SDG, but the information is still captured) by another labeller. A PGG disagreement that includes this at the beginning or end is counted as a 50% disagreement rather than a full disagreement.

Gesture Strokes and Phases

Gesture strokes are described as the peak of effort in a G-Phrase in Kendon 1980: “A phrase of gesticulation, or G-Phrase is distinguished for every phase in the excursionary movement in which the limb, or part of it, shows a distinct peaking of effort – ‘effort’ here used in the technical sense of Rudolf Laban (Dell 1970). Such an effort peak, or less technically, such a moment of accented movement, is termed the stroke of the G-Phrase.” We interpret this to say that a gesture can include various phases, including preparation, stroke, hold, and recovery. The stroke phase is necessary for identification of the movement as a gesticulation unit. The other phases are not always used. The phases we labeled are: preparation phase, pre-stroke hold, stroke, post-stroke hold, relaxation, and relaxed. Thus we separated Kendon’s proposed recovery phase into two parts: relaxation, where the hand is in motion, and relaxed, where the hand has stopped moving. This helped disambiguate some of the questions we had and provided more detail to our investigations.

Annotation range:

It can be difficult to label the range of gesture strokes. The gesture may be too slow, too fast, or too blurry. For the case of the video being too blurry – our video frame-rate is at 30 frames per second, and though that is much too low for automatic motion capture, it did allow us to use the frames where the hands clear up in the frames as boundary markers for the annotation label.

Progression of blurry to clear video frames:

blurry less blurry cleared up

Annotation value:

Each gesture phase has its own tier. The annotation value is created with “Label and Number Annotations” tool under the Tier menu option


If you are able to use motion tracking, use the calculated velocity and acceleration to help determine the start and end of the gesture strokes.

Hand Shape

Our hand shapes are labeled independent of its orientation. This not only helps contain the number of handshape labels, but also helps make it easier to quantify.

Annotation range:

Hand shape annotation range usually spans multiple gesture strokes based on the speakers we have seen.

Annotation value:

The value is usually a single capital letter abbreviated from the word that would best describe the hand shape. For example, we use “F” for “fist,” “O” for “Open,” and “R” for “Relaxed.” For uncertainties between two hand shapes, we used 5-point spectrum. We use both hand shape abbreviations with a number from 1-5 in between. For example: C4O. The first letter indicates that it is a cupped shape that the labeler initially saw, the number indicates how close this shape is to the second handshape, Open. C4O means it more like Open than Cup. It is equivalent to O2C. So, the order does not matter.

Check the library of hand shapes at the end of this manual. It contains images of the different hand shapes as a reference.

Abbr Description
D Deictic, pointing
G Gun
F Fist
R Relaxed
O Open, fingers spread outwards
C Cup or claw
K Knife, fingers straight together and flat
A Angled, fingers flat together,
bent 90 degrees above palm
H Hole, making an empty cylinder shape
Q Okay, iconic “okay” sign
T Two, twice, peace sign
L Loose fist, relaxed shape with fingers curled in
S Steepled, two open hands with fingertips touching
W Wall, two open hands with fingertips touching,
forming a vertical barrier
J Jailed, like wall, but with fingers spaced out
I Intertwined, two cup-like hands clasped,
with fingers intertwined


We do not annotate hand shapes for regions where there are no strokes. Occasionally there may be a single gesture stroke that has a hand shape distinct from its neighbors. When both hands have different hand shapes, and the non-gesturing hand’s hand shape changes, the annotation value would refer to the gesturing hand. In the case when both hands have different shapes, include “RH” for right hand and “LH” for left hand.


These hand shapes will vary a bit across different speakers. One speaker’s default “Cup” handshape may be more rounded than another speaker’s “Cup” handshape. As we’re not doing a inter-speaker comparison, we use the default labels. If you are doing a inter-speaker comparison of handshapes, we recommend going through the video and take screenshots of the different handshapes used.


Location refers to where the hands are in respect to the body.

Location Diagram

Annotation range:

Every gesture stroke has a start location and an end location even if they are the same. Sets of small gestures may stay in the have the same location for the duration of the set. For large gestures that cover more space, also label the extremes in between the start and end locations.

Annotation value:

Use the mapping grid to for the values to use. Location is annotated as (right_hand_y),(right_hand_x);(left_hand_y),(left_hand_x). The y-values can have half values. For example, “3.5” would refer to the middle of the torso. Annotating the right hand first makes the labeling process faster.


If you have tracked data of where the hands are for your video sample, do the location labeling anyways. Location labeling may be automated by using the tracked data and a script to chunk the ranges based on the grid above. Be sure that your tracked data is adjusted for the speaker’s torso movements.

Slight movements that are visible may not be captured as having a location change because the trajectory motion does not travel far enough to qualify as a substantial change.

Trajectory Shape

Trajectory shape labeling describes the path shape of the gesture stroke. Following the active hand movement, you can see the motion trace out a straight path, curved path, or, if looking at multiple gesture strokes, a looping path. Labeling for trajectory shape is usually straightforward, but does have nuances to pay careful attention to. For example, the looping trajectory shape must apply to multiple consecutive strokes, otherwise if a movement path looks like a “loop” with surrounding strokes being non-looping, the one in question is labeled as having a curved trajectory shape.

The straight trajectory shape is usually a vertical down or up motion, and can occasionally be diagonal–in most examples we’ve seen, the motion is moving upwards and out– or horizontal, usually outwards. Keep in mind these variations are usually infrequent perhaps because they take more energy to carry out, and may occur more often for a particular speaker.

Especially when deciding between labels for curved or the variations of straight-horizontal and straight-diagonal, be sure to review the video snippet containing the entire SDG with all its phases and then evaluate the influences of the preparation and recovery phases on your perception of the stroke trajectory phase. Take them into consideration as contextual clues but remember that the trajectory shape label refers to the stroke phase, not to any of the other phases. It is always helpful to look at the larger context, the other SDGs around the stroke you are labeling.

Now, looping strokes and successive curved strokes can be difficult to tell apart. Keep in mind that our stroke-labeling convention is to cut the continuous looping into individual “loops” at the end of the loop, usually at the lowest point in the motion, and label them each as a stroke.

The main difference between curved and looping is that there is a pause, hold phase, or break in movement path, between successive curved strokes, whereas these do not occur between successive looping strokes. Another helpful cue is the preparation phase which may occur before each curved strokes, but only before a set of successive looping strokes. Likewise with the recovery phase, which may occur after each curved stroke, but only at the end of a set of successive looping strokes.

To label, play back the video segment of the stroke and then the entire SDG or a longer segment that includes the stroke to verify the trajectory shape. Label the trajectory shape, and make sure it refers to the stroke, not the preparation phase or the recovery phase. As a reminder, gesture kinematic labeling is carried out with the audio muted.

Annotation range:

To facilitate labeling, copy the stroke tier, rename it for trajectory shape labeling, and then delete the annotation values deleted before you start labeling. There’s an ELAN feature for this. In this way, you can focus on the labeling and not have to also decide where to label.

Annotation value:

Use these labels:

Abbr Description
S Straight
C Curved
L Looping
S-horiz Straight, horizontal
S-diag Straight, diagonal

Hand Shape Library

Handshapes are quantified by 3 measures: curl, finger spread, and intention. Curl describes how much the fingers are curled in, with 5 being most curled in a fist shape, and 1 being least curled in the Open shape. Spread refers to the distance between fingers, or how far they are from each other. Open has the most spread at 5, and Knife has no spread at 1. Intention describes a qualitative intent behind the hand shape. Q, or “Okay” handshape has intent at 5, and Relaxed has an intent at 1. Each handshape has a unique set of numbers for Curl, Spread, and Intent. This allows them to be quantified for later analysis.

Name Abbr Description Image
Deictic D index finger pointing
Gun G index finger and thumb out
Fist F all fingers curled in
Relaxed R all fingers loose with no intention
Open O all fingers spread outward
Cup C all fingers curled midway, claw-like, high intention
Knife K fingers straight and flat
Angled A like, knife, bent
Pursed P fingers pointed together
Hole H fingers curved in, forming cylinder shape
Okay Q iconic OK shape
Two T index and middle finger pointed out
Loose L relaxed shape with fingers curled in
Steepled S two open hands, fingertips touching
Wall W two open hand forming a wall or barrier
Jailed J two hands forming a barrier with fingers not pressed against each other
Intertwined I two cup hands clasped, with fingers intertwined


Ada Ren at ada.inspired(at)gmail(dot)com