Thursday, December 13, 2007

Ink Features

Ink Features for Diagram Recognition - Rachel Patel, Beryl Plimmer, John Grundy, Ross Ihaka


Patel et al statistically determine which of the various features used in sketch recognition are the most useful for distinguishing shapes from text. They drew from a possible set of 46 features, including those previously used in sketch recognition and newly created features. First they collected sample sketches from 26 people for nine diagram types, each of which included shapes and text. From these samples, each feature was computed. Next, a complete statistical analysis was conducted to determine the most useful features. Using statistical partitioning, the most useful feature was chosen iteratively to create a decision tree. At each node in the tree, a binary test is performed based on the feature which will best decide on the two classes. For each outcome of the test, if the decision is not "good enough," the next best feature for that branch is chosen and another test is formed. Surprisingly, the tree is only 6 levels deep and employs only 8 features. When compared to the Microsoft Ink and InkKit dividers, the new method outperformed both method in Shape classification, but worse than the Microsoft in text classification (Ink seems to classify sketches leaning strongly towards text) on training data, and better on shapes on test data. On all set the text recognition was close to the recognition rate of the InkKit divider. Seemingly, the new recognizer show a considerable improvement in accuracy of division between text and shapes. The eight features used include: time till next stroke, Speed till next stroke, Distance from last stroke, Distance to next stroke, Bounding box width, Perimeter to area, Amount of ink inside, and Total angle.

Discussion - An overall accuracy rate would be useful in addition to the shape-only and text-only accuracies. The overall accuracy of the new method depends on how heavily the datasets were weighted towards shapes or text. If diagrams are shape heavy, the new recognizer obviously performs better, but if text is predominant, the others would be better. The distribution of text/shapes may be difficult a priori for many domains. Alternately, a combination recognizer perhaps using a voting scheme or boosting could be used to gain the advantages of the highly accurate text classifiers along with the accuracy of the shape recognizers.

Speech and Sketching

Speech and Sketching: An Empirical Study of Multimodal Interaction - A. Adler and R. Davis

Adler and Davis conduct a series of user studies to determine how a user would interact with a system combining both sketch and speech input that moves towards an interactive, 2-way communication between system and user. Previous studies were often limited to a set of commands, one-way interaction, annotation (no sketching), or a set of fixed sketch symbols. Rather than attempting to set up a complex Wizard of Oz experiment, users and experimenters sat face to face with the experimenter taking the role of an idealized recognition system and responded to the user's verbal questions and modifying drawings as appropriate. Users were asked to sketch and describe drawings from four different domains. The recorded conversations were analyzed qualitatively in five areas: sketching, language, multimodal interaction, questions, and comments. Sketching was divided into groups by stroke and amount of ink. By stroke, 52% were creative, 40% writing, 5% selection, and 3% modification, while by amount of ink, 63% was from creation, 21% writing, 8% selection, and 8% modification. Three colors were used on average and users frequently changed colors. Color was used to refer to previously sketched sections, to create groups, or to reflect the real-life color of a sketched object. They found language was often ungrammatical and repetitive, with the repeated phrases being key to the discussion, and related to what the user was currently sketching. Multimodal interaction often linked the order of speaking and drawing or combined speaking an annotation or label with saying it. Users often completed a sketch that was spoken about before speaking again. When asked questions, the users clarified the sketch and were encouraged to speak more freely. When commenting the user often described what they were planning to do or made comments about the sketch quality. Qualitatively, users were found to begin speaking before drawing if they were speaking a phrase, but to begin drawing before speaking if they were speaking a single word.

Wednesday, December 12, 2007

Three main concerns

Three main concerns in sketch recognition and an approach to addressing them - James V. Mahoney and Markus P. J. Fromherz

Mahoney and Fromherz examine the segmentation of strokes into lines and curves useful for recognition. For the classification of strokes into the components of stick figures, segmentation ambiguity (lack of a 1-1 mapping of strokes and possible segments) often occurs due to small gaps between strokes, overlapping of strokes, crossing strokes, or interaction with background strokes. They seek to correct these problems by creating multiple interpretations either joining or dividing strokes so that the newly created strokes. These alternate interpretations can greatly improve the sub-graph matching recognition method used to classify components. Their method uses two stages to determine possible alternate graphs. The first is graph elaboration composed of 5 parts. Proximity linking joins small gaps. Link closure replaces connections or intersections with links. Virtual junction splitting divides a stroke near a non-connected endpoint and inserts links. Spurious segment jumping inserts links around small, possibly erroneous strokes. Lastly, continuity tracing adds a connected interpretation for segments that could connect smoothly.

The second stage attempts to classify strokes. First, strokes are preprocessed and divided at corners, then passed through the first stage, creating an input graph of connected segments and segment interpretations. Next, the input is compared models of sketches describing a connected graph of that class including optional segments. This is done using subgraph matching, that is the input graph is checked for a subgraph whose components match the components of the model graph, as well as matching characteristics of the model components (relative length, etc). Subgraph matching is solve by translating it to a constraint satisfaction problem. The CSP has several constraints not translated from the graph including 1-1 matching of data to model components, minimum total link length, optimal proportions, and maximal part count (as many non-optional parts assigned to data components).

Monday, December 10, 2007

Constellation

Constellation Models for Sketch Recognition - D. Sharon and M. van de Panne

Sharon and van de Panne use a constellation model to recognize groups of shapes. Individual objects are represented by a feature vector consisting of the x and y coordinates of the center of the bounding box and the length and angle of the diagonal of the bounding box. Pairs of parts are represented by a feature vector consisting of the relative change in x and y coordinates, the minimum distance between an end point of the first object to any point in the second, and the minimum distance between an end point of the second object to any point in the first. However, since this model does not scale well, parts are subdivided into two groups, mandatory and optional. Individual features are computed on all parts, while pair features are computed only if one of the pair is a mandatory part. Recognition is performed in two stages. First mandatory parts are found, then unlabeled strokes are fit to optional parts. Models are learned using probabilistic methods. From labeled examples, the mean and covariance are determined for the feature vectors of each modeled class. Strokes are labeled as various mandatory and optional parts by determining the maximum likelihood labeling. Using branch-and-bound, the algorithm assigns mandatory labels in sequence to the stroke that is most likely of that label. After training on 50-60 examples for each class, recognition is tested on several examples. Recognition generally runs in a relatively short time period.

Discussion - Recognition accuracy is surprisingly missing, while the results focus on the time taken during successful recognitions. It is also unclear whether all five models are considered for assignment of labels to a new sketch or if only the model corresponding to the class of sketch is used. If the second is the case, users must identify what kind of sketch has been drawn before the system can label the parts, and user labeling of the components doesn't seem to be too strenuous of an additional task.

Perception

Perceptually Based Learning of Shape Descriptions for Sketch Recognition - Olya Veselova and Randall Davis

Veselova seeks to learn to recognize sketches based on ideas derived from human perception. After low level recognition, higher level shapes are constructed based on constraints. These constraints are based on how people perceive similar shapes. People base similarity on several "singularities" determined by previously by Goldmeier. They include verticality, horizontality, straightness, parallelism, and others. These properties are translated into constraints. In addition, humans place a greater emphasis on some of these singularities. Thus the constraints are ranked according to priority. Priorities are adjusted depending on bases on three global properties: obstruction, tension lines, and grouping. Obstruction depends on the amount of stuff between two constrained objects; the more stuff the lower the constraint is ranked. Tension lines form a "balance" in the shape and is oriented along a grid system. Grouping is determined in two ways: connectedness and subshapes (similar to shapes making more complex shapes in LADDER). Finally, when compared to user determined similarity, the system achieved a 77% compatibility with the majority vote of similarity. The system performed even better on shapes that users were in high agreement.

Discussion: As noted in the final section of the paper the constraints are grouped into broad classes leaving a good deal of wiggle room for similarity. Also, the authors note that the system lacks global symmetry constraints, a key singularity described earlier in the paper. Such a constraint would seem to be one the more important constraints added. Also lacking is a negation constraint; however, this is a problem in many constraint systems.

Assistance

Naturally Conveyed Explanations of Device Behavior - Michael Oltmans, Randall Davis

Assistance combines sketching with verbal descriptions of the sketch. First the user draws a sketch of a physical system. Next, the user switches to description mode and describes how the components interact, both verbally and by sketching symbols. From these descriptions, Assistance develops a model of what each component does, how it will affect other components, and how the system of components operates.

Next, Oltmans provides an example input to describe how Assistance interprets the description. First, a verbal description is divided into clauses. These clauses provide information about a specific object and how it should move. Additionally, this information can be chained to objects that are connected to a describe object. Conditionals linking clauses are inferred to represent causal relationships between objects. Drawn arrows clarify the meaning of the motion described verbally giving them direction withing the sketch. Causal interactions can also be inferred from the use of transitive verbs or from internal models, such a the interaction between a pulley and objects connect to a string looped over the pulley. To demonstrate understanding, Assistance can answer questions about how a component falls into the sequence of action and what it will do, in addition to providing inferred representations of the components.

Oltmans then details the implementation of Assistance. The sketch is preprocessed by Assist to find objects and drawn connections, as well as description arrows. ViaVoice processes and parses the verbal description. These are fed to Assistance which translates them into propositional statements that are fed to a forward-chaining rule system and a truth maintenance system. After Assist finds object and their connections to each other or the fixed plain, Assistance determine how they could potentially move. ViaVoice input is translated to a constrained grammar syntax that understands motions, conditions, and propulsions. Reference gestures such as arrows and points link the verbal descriptions to the objects they describe. To translate utterances into events the voice input is filtered through a parse tree that determines subject, verb, and direct object. Then the referenced objects are determined and the motions and interactions assigned. Additionally, reference gestures must associated with verbal directives. When an ambiguous verbal reference is made ("this"), it must be accompanied by a distinguishing reference gesture, and an ordering of gestures can be maintained for multiple vague references with in a verbal phrase. Next, arrows are translated into events, and multiple event representations are merged. Assistance then finds implicit and explicit causal links from inferring them from events or finding voiced conditionals. Finally, forward chaining and truth maintenance attempts to determine a consistent causal chain the is compatible with the users intent. Events with no definite cause cause a search through plausible causes (explicit versus implicit causal links).

Discussion - While the description of Assistance is straight forward, it lacks the evidence in the form of user studies that would suggest whether user find the combination of sketching and voice useful and more intuitive that other options such as simply sketching. Some users may find the voice descriptions somewhat less useful, if they use arrows to specify direction of motion. However, Assistance appears to be a first step toward a powerful combination of interface techniques.

Gross, Do

Ambiguous Intentions: a Paper-like Interface for Creative Design - Mark D Gross, Ellen Yi-Luen Do

Gross and Do begin by describe three characteristics used in what they term diagramming. The first is abstraction, which can take the form of a simple, abstract figure representing a more complex one, or a complex figure being recognized as an instance of the simple one. Second is ambiguity, which represents postponed decisions. Rather than resolve all ambiguity, the author suggest that some ambiguity is intentional and a system should not be overly concerned with resolving it. The last is imprecision, allowing the user to approximate without focusing on fine details.

Next, related work is discussed before moving to implementation details of their system called Cocktail Napkin. Users can draw using a variety of simulated tools which may be pressure sensitive. As glyphs are recognized, a text label may be applied to show recognition. Also, strokes are not cleaned by default. Additionally, editing tools are provided and two user collaboration is supported.

The author next detail configurations. Configurations are highly similar to shapes in LADDER. They are composed of simpler shapes along with geometric constraints, and may be represented by an abstract symbol. Configuration recognition triggers after the user pauses for 5 seconds. Constraints are initially generated from an example sketch and modifiable by the user by direct change to the constraints or providing additional examples. Finally, the user may provide an abstract symbol to represent the configuration.

Configurations may have different meanings depending on context. Additionally, basic shapes may have multiple interpretations. Cocktail Napkin maintains these multiple interpretations until additional information is provided to resolve the ambiguity. This information can come in several ways. First is overtracing, in which the user redraws the glyph more definitely and replaces the old. Alternately, if one of the interpretations is part of a larger configuration, the ambiguous glyph is resolved to the correct part of the configuration. Context can be either user specified or implicitly specified when a glyph or configuration unique to some context is found.

As the user draws, constraints are defined for recognized glyphs. As the user edits the drawing, Cocktail Napkin attempts to maintain the constraints between glyphs. Like configurations, constraints can be context dependent, such as connected for graphs or adjacent in a floorplan. By adjusting the constraints the user can incrementally refine the sketch.

Next, the authors move to implementation details. A low level recognizer attempts to classify glyphs. First the glyph's bounding box is determined and assigned to a fuzzy class (tall, wide, etc) then divided into a 3x3 grid. The path sequence around the grid is found, as is the number of corners. Along with total stroke count, these feature are compared to allowable ranges of template glyphs. First, classes are narrowed by sequence. Next if the other features fall within allowable ranges, the glyph is assigned potential classes. If only one is found it is assign certainty 1, 2 if multiple. Next, rotations of the path sequence are examined, and classes and certainties are assigned. Dots and lines take special cases. If no match is found one- and two-off paths are considered and assigned certainty 3 and 4. Alternately, the glyph can be user-specified as an example of a class and the template is updated. Configurations are recognized using LISP functions representing the constraints of a configuration, and the components are replace by the configuration. As specific configurations are found, the system transfers between various contexts with past contexts maintained in a context chain. Each context contains its own set of glyph templates, configurations, spatial constraints, and mappings to other contexts. New glyphs are matched within the current context if possible and moves to other contexts in the chain if necessary.

When presented to users, they first attempt to just draw without using recognition, intent on completing the drawing. Additionally, most users found the recognition echo annoying.

Discussion - The low level grid system would seem to have several flaws. Primarily, rotations of glyphs would require storage of a large number of paths as would local details. For example, consider a triangle. It could be pointed upward or downwards, or its legs form a an L. Each has a different path and all must be stored under the triangle template. Also, the context transitioning is not well detailed.