The human visual attention system is fascinating in how it guides or draws the human gaze in the scene in a movie or the super-market isle as we walk through. To really understand why it is so interesting, consider that the human eyes have only a tiny region that they see in really high resolution. This area is called the fovea and is about the size of 0.3 mm, which covers a region about the size of a thumbnail at an arm’s length from you. The rest of area outside of this region is in low resolution. Yet when we look around, we see everything in what is perfect detail. How is this possible?
Well, this is possible because your eyes are constantly moving and filling in the gaps of the information you have of the real world. These movements are called saccades, in which different parts of the scene or image is given time for fovea’s high precision image processing. Although fovea is only 1% of the retina, it uses up more than 50% of the processing power in the human visual cortex.
So the big question in computer vision is what is the control mechanism behind the movements and if we can model it accurately then we open up fantastic applications in robotics, medicine, design, art and many more. We can optimize the computation resources on a robot to only process the most relevant image data. We can use it better design ads and product placements in the shop to ensure they catch the person’s eye.
There are two main schools of thoughts behind the control of the human gaze:
- There is something salient within the scene that draws the attention. For example, bright colours and yellows tend to draw attention more than dull colours. Or perhaps motion that draws us to pay attention to something that is moving when everything else is still.
- Eyes move with the intent to learn more about the scene itself. We don’t know what is there and so the brain needs to fill that part of the world model by getting information about it. Here the eyes movement is controlled by task and intent.
So I decided to capture data on how my own gaze is drawn to things happening in a scene. To do this I built this rather stylish contraption – a bike helmet with an LCD screen at the front and a webcam pointing at my eye. Okay, not very stylish, but its a very cheap prototype.

That little black clip is holding the CMOS chip of a webcam straight at my left eye. Below is a screenshot of what it sees. Note that it isn’t a very crisp image, there are regions of shadow and light, as well as reflection from the retina of the external light source itself.

I then used the awesome software written by the guys at OpenEyes. Now they had written the software to make it work for infrared cameras (which I didn’t have). I tweaked the software to add a pre-processing step that works with a natural light rather than infrared. Infrared makes life so much easier.
Then after some calibration (which maps where I was looking with the spot I was looking at on the LCD screen in front of me), it was time to sit back and watch some videos, while recording where my gaze went.
Once we have the data, I looked at if there were some simple algorithms that could predict the gaze on the acquired data. I came up with 4 very simple algorithms that attempted to predict where the gaze would be at a particular point in the scene. Without going into details, they were:
- Maintain – tries to maintain the trajectory of the movement of the gaze across the scene.
- Skin – tries to find regions in the scene that are likely to be skin tones (and therefore people) using pixel level classifier trained incrementally using Ripple Down Rules.
- Motion – tries to find regions of the highest degree of motion with respect to previous frames.
- RDR – an Ripple Down Rules based predictor (i.e. incrementally trained decision tree) that tries to pick between 1, 2 and 3 depending on the scene and historical properties.
Below is a screenshot of the system in action. The big yellow plus (+) is where I had actually focused on while I was watching the movie. Note that none of this was visible to me while I was capturing my own gaze. Then when I replayed the video, I let the other 4 algorithms attempt to predict my gaze, marked by the smaller pluses and box annotated by algorithm numbers.

You can watch the short video for yourself here:
Although no one algorithm was good enough to predict the gaze, there were some intuitive indicators that motion and skin tended to do better. Then again this might be biased because the scene that I had used had people standing around and talking with some odd movement. The most interesting part was how to create a training paradigm for RDR with temporal data. The underlying processing pipeline was also interesting and later gave rise to my ProcessNet work [ book chapter, PKAW paper].