Object Recognition

Contents

Introduction

Animals and people need to recognise other animals, people and objects that are important for their survival.  This is the basic level importance and means that they can recognise things to eat, things that may eat them, things to inhabit and others that they may mate with.

Additionally, on a social level, animals and people need to recognise members of their family or social group. We would not be able to survive without this ability.

Challenges of Object Recognition (Top)

There are many challenges when looking at object recognition. Mainly a 3D object causes a huge array of 2D retinal images depending on what angle it is viewed, distance it is away from the eye and the lighting it is under. There are also many different types of the same type of object, such as pens (see the picture below). Despite the many different appearances they can take, we know that they are all pens of some type.

4 pens on a white background. A pink marker, a white branded pen, an orange felt tip pen and a blue biro.

All of these are seen as pens, even though they all look very different

Therefore we need to correctly recognise various 2D images of the same type of 3D object as belonging to the same thing. It is also important to categorise together images created by different types of the same thing, be it pens, dogs, cats, books etc. This grouping of "different appearance but same type of object" is known as STIMULUS EQUIVALENCE. However, we still need to be able to distinguish the differences between the different types of the same object.

 

Object Recognition Mechanisms

There are several object recognition mechanisms that animals are thought to use; simple and complex.  We shall now look at them both in more detail.

 

Simple Object Recognition Mechanism

Primitive animals, such as insects and fish, detect "key stimuli", of which all images corresponding to the same object have common stimuli, such as a motion or colour pattern. This is not ideal for image recognition as it is an inflexible process that can be easily fooled.  For example, Tinbergen (1951) demonstrated that male sticklebacks recognise the red bellies of their rival males.  He also found that they will also respond aggressively to a crude fish model as long as it has a red belly (see image below), as the simple object recognition they use will identify the crude fish model as a rival male stickleback.

Photograph of a stickleback in a fish tank with a fake stickleback toy with a red belly.

Tinbergen's experiment with a crude fish model causing the same reaction as if a real male stickleback was present

Complex Object Recognition Mechanisms

 

Other animals, especially primates and humans, have a more flexible recognition process that allows us to look at lots of different stimuli that can give the same recognition of an object.  This makes it less open to being fooled and the mechanism is mainly based upon learning.  This involves recognition and discrimination of subtle criteria.

 

Humans do respond to key stimuli in some circumstances, such as babies innately choosing to watch and follow face-like patterns over non-face-like patterns.

 

This complex recognition mechanism does require learning, essentially of what configurations of stimuli are equivalent and distinct from one another.  Babies first learn to identify their parents under all lighting and viewing conditions before starting to learn other members of the family, friends, neighbours, pets and objects.  This process continues into school, where recognition of letters and numbers also begins to develop.

 

Object recognition improves with practice and is something that can be refined.  Observers can be trained to identify faces from a noisy picture and birdwatchers can rain themselves to identify camouflaged birds and quickly distinguish between bird species.  The effect of this training can last for years and this suggests that there must be some cortical plasticity in adults as well as in children (remember cortical plasticity is the change in neural responses to a stimulus).

 

In order to recognise objects, humans need:

·       An adequate visual system

·       Previous recognition experience

·       Ability to recall/remember previous experiences of an object

·       Adequate amount of the object to be seen to recognise it

 

Recognition thus improves by having knowledge of the object and by practice due to the learning effect.

Alphanumeric Recognition (Letters and Numbers)

Alphanumeric recognition is much simpler than 3D object recognition, but there there is only early work that has been done in this field. There are several theories as to how this recognition actually occurs.

 

A visual demonstration of alphanumeric recognition in the 5 key stages

Demonstration on the template matching process

Template matching is where a template of each letter and number is stored in long term memory. Incoming patterns are matched against templates, by undergoing image standardisation and normalisation to make the image the correct size and orientation for matching to the templates.  This can be seen in the figure above:

There are several problems with the template matching theory and the main problem is ambiguity. This arises due to different types of font available and as such some letters and numbers may actually look like a different letter or number, thus confusing the visual system. An example of this is found below, where an R may look like an A and vice versa.  In some cases, especially when numbers can be confused for letters, allow for word construction in personalised number plates

An image of a black A and R appear written over a grey R and A, demonstrating how easily similar letters can be misidentified

Ambiguity can arise in the template matching theory

Furthermore, to follow forms of each letter, hundreds of templates would be required.  Applying this to a 3D object, each 3D object looks different at every angle and as such thousands of templates would be required to recognise a single 3D object.  As there are millions of 3D objects, then a billion or more templates would be required to recognise them all.

This theory will only work well in constrained cases, such as in automated systems for reading number plates or cheque reading machines.  It is therefore extremely unlikely that this is the object recognition mechanism in place in the human brain.

 

Feature Analysis

 

Feature analysis uses a combination of features to identify and recognise objects. A study by Selfridge (1959) uses the pandemonium model using feature, cognitive and decision demons (image below) to recognise and identify the object.

A column of "feature demons" with particular features within, 26 cognitive demons of different letters and a singular decision demon. This is explained in more detail in the accompanying text.

Feature Demons are perhaps similar to simple cortical cells and they respond to a single feature within the image (i.e. a horizontal line or an acute angle).

 

Cognitive Demons react to a combination of feature demon outputs and each cognitive demon represents a particular letter.

 

Decision Demons select which cognitive demon (or letter) is shouting the loudest and selects that particular letter as the one it thinks is the most appropriate.

There are problems still with this model as it requires features to first be identified (such as a mini-template) and it does not fit in well with current knowledge of cortical cells (i.e. cortical cells are not feature detectors but filters responding to the bandwidth of orientations, spatial frequencies and movement directions, and not just one type of feature).

This would also mean that it would confuse the letter "T" with "" as they both have the same features and thus would trigger the same cortical response.

It also does not provide information about differences in the type of letter shown. The model has to be able to group all of the "A"s together but also be able to recognise the different fonts of the letter A. However, this pandemonium model may fit in well with contextual evidence such as that in the picture below. The central letter is actually identical in both the top and the bottom word, but is seen as an "H" in the top word and an "A" in the bottom word. This gives evidence that top-down processing could sensitise the demons that are responsive to a contextually sensible perception.

 

A drawing of the word "The Cat" but the H of the word THE is the same as the A of the word CAT

The central letter is identical in both the top and bottom words but perceived as different

Structural Model

 

The structural model describes the obligatory features of the structure of the object. This, for instance, states that the letter "T" must have a vertical line that supports a horizontal line towards the centre (as seen right).

4 T shapes, the first with a tall spine, the second a normal T, the third T has a mildly elongated top and the final T has an excessively long top

Several identifiable forms of the letter "T"

All of the letters in the image above are of the letter "T", as they all fit the description for the letter "T" in the structural model. All the letters have a straight, vertical line supporting a straight horizontal line. This can extend to all letters of the alphabet, numbers and foreign characters to allow for their recognition, as each alphanumerical item will have its own set of structural instructions.

 

3D Object Recognition (Top)

Marr and Hildreth's computational model (1980) and the feature integration theory (as proposed by Treisman in 1980) are both object recognition models, but focus more on image analysis than the process of recognition. The following theories combine the image analysis with perceptual organisation.

 

Marr and Hildreth Model

The Marr and Hildreth theory states that objects are broken down into 3D primitives, which they describe as "generalised cones".  These are objects with a cross section of constant shape but of variable size along an axis (Image to the right).

 

It should be noted that along the axis, all of the cross sections are of constant shape surrounding the axis, but are of different sizes.

 

Marr and Nishihara added that most natural forms achieved by growth are made up of generalised cones. The model first looks at the overall configuration of the object and then finds the axis of further generalised cones at a more local level.

A cylindrical shape with a central axis

These 3D models with different levels of specificity are "3D model descriptions" and when a 3D model description matches one in a catalogue of 3D model descriptions in the human memory, it allows for recognition. The top-down processing then refines the analysis of the image.

 

Biederman's Recognition by Components Theory

 

Biederman modified the theory to cover more than just generalised cones.  He stated that complex 3D objects could be broken down into 36 basic 3D shapes called geometric ions ("geons" for short).  He then stated that objects are recognised by their specific combinations of geons.

 

Geons can be identified from any viewpoint or even if they are partially obscured.  Therefore the theory has "viewpoint invariance", meaning as long as you recognise the geons then you will recognise the object.  However, in cases where basic geons are obscured or hidden due to a strange viewing angle, then recognition is more difficult. An example of this would be looking down at an hourglass and just seeing the circular component instead of the two conical glass chambers. 

Evidence for Geon-Based Recognition (Top)

There are several factors that provide evidence for geon-based recognition.  We will now look at these in more detail.

 

Repetition Priming

Repetition priming example where the second presentation shows similar features to allow recognition

Demonstration of repetition priming

Repetition priming is where subjects will name an object more quickly the second time, if they have previously been shown the same object. If the second example were to use different geons, then the priming effect would be reduced even if the images are of the same type of object. If the same geons are preserved then priming is unaffected by changes in size, small changes in viewpoint or in location.

 

This is better demonstrated in the figure on the left.

 

Object Matching

 

In 1993, Cooper and Biederman asked subjects whether two successively presented objects were the same type of object or different. They found that when geons were maintained, the subjects responded faster and with greater success, finding the task relatively easy. When the geons were changed, the subjects were slower and found the task much harder. When the geon types were maintained, but the sizes of the geons were altered, the matching difficulty was affected much less and speed was less affected.  Again this is best demonstrated visually, as in the image below.

Demonstration of object matchingDemonstration of object matching

 

Neurophysiological Evidence

It has also been found that the neurons in the area IT of the monkey respond selectively to simple geometric shapes, which may correspond to the forms of the geons.  When a change of distance to viewing or a change in viewing angle was introduced, the simple geometric shape did not affect the output of the cell.

 

View-Based Object Recognition Theories (Top)

 

There are several theories that are based upon viewpoint.  Biederman's theory and the theory by Marr and Nishihara both suggest that recognition process is unaffected by viewpoint.  In many situations, it can be experienced that recognition is much quicker and more accurate when the object is viewed from a familiar viewpoint.

 

View-based recognition theories suggest we store multiple viewpoints of objects in our memories but have more emphasis on certain orientations.  On encountering an object, we will look for the most similar neural representation in memory and if no identical representation is available, then it is interpolated from the closest match.

 

We will have a preferred (canonical) viewpoint for every object and this is the viewpoint at which recognition is the fastest. If we train the visual system to recognise an object at different angles, the speed of recognition improves.

 

There are unfortunately some limitations in these theories as they all assume that we are motionless, where actually movement is very important in object recognition due it providing additional information about the object.  We also recognise patterns of movement associated with people and objects to aid in their recognition (such as recognising somebody by their walk).

 

These models only explain recognition of basic classes of objects, but to identify and distinguish between different faces, breeds of animal or types of pen will require a more complete explanation.

 

THIS CONCLUDES THE UNIT ON OBJECT RECOGNITION

TOP OF PAGE - UNIT 5: FACIAL RECOGNITION, EXPRESSIONS AND SPEECH