Modeling guidance and recognition in categorical search: bridging human and computer object detection
MetadataShow full item record
Although various object detection methods have been widely studied, state-of-the-art performance of object detectors still lag far behind human performance. Humans can perform object detection tasks on various object categories hundreds of times a day in an effortless manner. The main effort in computer vision community is aiming at improving the performance of object detectors, while on the other side only little research has been done on understanding how humans perform in the object detection process. In this thesis, we analyze the relationship between human behaviors and object detection methods in computer vision on both guidance and recognition task. In our experiment, human observers searched for a categorically-defined teddy bear or butterfly target among non-targets rated as having HIGH, MEDIUM or LOW visual similarity to target classes. Actual targets show very strong search guidance, measured by the first fixated objects. Also guidance to non-targets objects are in proportion to their visual similarity to the target; high-similarity objects were first fixated the most and low-similarity objects the least. We design several computational experiments: First, we propose a computational model that uses C2 features and SVMs in the context of Target Acquisition Model (TAM), to model human behavior in an object detection task. Eye movement behavior of our computation model matched human behavior almost perfectly, showing strong guidance to targets and same pattern of first fixation on target-similar objects. We conclude that categorical search is guided, and that driving this guidance are visual similarity relationships that can be quantified in terms of distance from a SVM classification boundary. Second, we train and evaluate computational vision models for object category recognition and compare their output to the human behavior. Some algorithms do well at predicting which object humans will fixate first, but there are differences between which features perform best for classification and which predict human behavior most closely. This is a critical question for developing visual search algorithms that produce perceptually meaningful results. In additional, we demonstrate that the information available in the fixation behavior of subjects is often sufficient to decode the category of their search target--essentially reading a person's mind by analyzing what they look at using a technique that we refer to as behavioral decoding. Our results show we can predict an observer's search target based on their fixation pattern using two SVM-based classifiers, especially when one of the distractors were rated as being visually similar to the target category. These findings have implications for the visual similarity relationships underlying search guidance and distractor rejection, and demonstrate the feasibility in using these relationships to decode a person's task or goal.