This paper describes how the performance of Sticky by Tobii’s webcam eye tracking algorithm has been validated through live eye tracking studies using online panels. Anonymous respondents from the US, India, UK, France, Germany and Sweden were exposed through the browser on their desktop to a randomized list of stimuli images with known fixed locations on their screens while their gaze was recorded through their front facing cameras.
The study included 116 usable sessions and 68 unusable sessions for a usable rate of 63%. The results from this study show that Sticky’s average radial gaze error in a real-world (non-lab) environment is on average 6% of the screen. This error is composed of 4% error in x-direction/screen width and 7% error in the y-direction/screen height. The error is defined in relation to the stimulus area as described in the results section and measured across a range of participant desktop setups.
This is more than accurate enough to differentiate what people are looking when discussing elements on a web page, package, image, advertising, or other stimulus. The resulting conclusion is that fully automated desktop eye tracking using online panels can produce acceptable quality results for research and optimization of content and advertising.
To measure and determine the accuracy of Sticky’s eye tracking across desktop setups, the experiment was conducted in a non-lab realistic environment, respondents were tasked with focusing on 20 predetermined images, with each image displayed for 3 seconds. The predetermined images were represented as a 10% of the screen height and displayed on top of a white background.
A small face was shown as the stimulus, with a number counting down from 3 to 1. Participants were told to look at the image.
Panelists were recruited using an online panel and took the test while using their own desktop setups from the United States, India, UK, Germany, France and Sweden at the time of recruitment.
To start the test, participants clicked on a link which was presented by the online panel company which distributed the test. The test took about 1.5 minutes. Presentation of stimuli was controlled by the Sticky system with timing, randomization, instructions and calibration all happening in an automated fashion.
Data for the test was automatically streamed to the Sticky cloud as it was recorded. It was then processed for completeness and the eye tracking calibration quality was validated. Sessions that did not match quality standards were removed.
Validation measures how much the gaze prediction is different from the ground truth value for the validation points (if the difference is greater than a threshold value 0.18, the session is considered as unusable). We have also a validation for a real-time head detection. During the experiment the system checks if head is missing in 80% of the frames every 4 seconds and if it is the participant would be rejected. These built an automatic validation of the accuracy of each session.
The data is then accessible through the Sticky API and web portal for easy analysis and access, both on the individual and aggregated levels.
Gaze error metric importance
The average difference between the real gaze position and the measured gaze position - is one of the most important metrics to benchmark the performance of an eye tracking solution. A system with good accuracy will provide more valid data as it is able to truthfully describe the location of a person’s gaze on a screen. It is therefore significant to measure a system’s accuracy, in order to evaluate the relevance of the eye tracking data the solution produces.
Gaze error metric
The error for a single estimated gaze point is calculated as the distance in percentage between the gaze point and the closest boundary edge of the corresponding image. Each image was displayed for 3 seconds and the gaze error was calculated for each session for the last 2 seconds, allowing time for the participant to fixate on the image after it loaded.
This metric was chosen due to the long duration the participants were asked to fixate and the high probability that the participants would not gaze only at the center of those images, which would result in an underestimate of the accuracy of the system. Also, this metric was deemed as being more useful in determining the minimum AOI size to use within studies.
The following tables present the accuracy when using the recommended sample size of 40 participants.
Accuracy statistics areas of interest (AOI)
Gaze heat maps
The following tables examine the effect of sample size on accuracy.
Convergence with larger sample sizes
At 40 sessions, the confidence interval of the mean radial error is ± 0.9%. Thus, if this test is repeated with 40 respondents or more, there is a 95% chance that the resulting mean radial error is 6.0% ± 0.9%.
At 40 sessions, the standard deviation (SD) of the mean time viewed is 0.04 seconds for an area of interest that occupied 20% of the screen height. This gives a confidence interval in the mean time viewed of ± 0.08 seconds. Thus, if this test is repeated with 40 respondents or more with this AOI, there is a 95% chance that the resulting mean time viewed is 0.7 seconds ±0.08 seconds.
In general, the experimental method assumes that all respondents were gazing only at the center of the 10% viewport image, during the last 2 seconds of each stimulus. The method also assumes that respondents adhere to the experimental procedure and don’t significantly change environment during the experiment; for example, no excessive head movement or changing from an indoor to an outdoor setting. Gaze points where the respondents occasionally were looking elsewhere or sessions where the participant did not fully follow behaviour instructions have not been removed from the dataset. Those gaze points of course will have impact on the mean errors.
Therefore, the conclusion can be made that the potential accuracy of the algorithm is higher than reported; however, this white paper reports on the effective accuracy likely to be seen when using the product.
Minimal AOI size
One of the main motivations of the white paper is to recommend minimum AOI sizes within experiments, however this is difficult due to device viewport sizes and resolutions changing significantly between participants. To simplify recommendations this white paper focuses on minimum AOI size recommendations as a percentage of the smaller screen dimension (assumed to be the height for desktop experiments). In live usage we aim to accurately register at least 70% of all respondents gazing at a particular area of interest and capture approximately one third of the theoretical maximal time spent on that area. To achieve this with Sticky’s algorithm, the minimum area of interest in an experiment should be at least 20% of the screen width.
Minimal sample size
Due to the reasons discussed in the methodology section above, in an experiment we can expect the accuracy of several sessions to be decreased by participant behaviour. We therefore recommend using a minimum of 40 participant sessions. Further details of why 40 participants are recommended can be found in the appendix (Recommended number of experiment participants).
Confidence in the mean error
The average radial gaze error in Sticky’s algorithm is 6% of the screen, with a 3% standard deviation in the sample (116 sessions).
Furthermore, at 40 sessions, there is a confidence interval in the mean radial error of ± 0.9 %. Thus, if this test is repeated with 40 respondents or more, there is a 95% chance that the resulting average radial error is 6% ± 0.9%.
Sticky’s algorithm automatically estimates the quality of each session and removes sessions not suitable for eye tracking. The most common reasons for this are:
- Bad respondent attention, calibration dots are not followed, or stimuli are not looked at.
- Bad light conditions, too dark or too strong background light.
- Large head movements.
- Light reflections in glasses.
- Bad video resolution due to low internet connection.
Weakness of Sticky’s algorithm
The largest weakness of Sticky’s algorithm is its low robustness, when compared to a hardware IR eye tracker in a controlled environment. However, this is a problem with all existing webcam eyetrackers and often controlled environments are not good representations of normal behavior. This issue of low robustness can be solved with oversampling, as the cost of recording sessions is much cheaper using webcams and online panels.
Providing gaze data streams
The Sticky platform does not apply any fixation filter, to resolve gaze data. This means that we deliver raw gaze data. Fixation filters used and evaluated in the past have shown to be of limited use. Front facing cameras provide too few data points (on average 15 Hz). However, the use of a relevant sample size (app. 40 usable) has shown that the technology does give normalized data distribution where the error can be decided by an average deviation (a deviation normally measured in degrees but also as percentage of the screen or as pixels). We do use a median filter, a type of noise reduction filtering, which we apply to our predicted raw gaze output.
The Sticky platform only counts AOI hits if the gaze data is within the AOI boundaries. If the gaze is outside the AOI it won’t be allocated to the AOI during the analysis.
The data provided by the platform is based on sessions meeting a minimum quality threshold, i.e. all sessions have passed a data validation check where significant outliers are identified and removed (classified as “unusable”).
Recommended number of experiment participants
Providing recommendations on participant sample size is complicated as increasing the sample size will increase the statistical significance; while the accuracy requirements of a specific experiment and the quality of the recorded sessions means one size does not fit all. Here we have based out general recommendation on two things.
- Accuracy requirements we have placed on our sample metrics.
- Quantitative evidence from this white paper on the benefits of increasing sample size.
A minimum threshold was put on the confidence the system should give us across two different sample statistics, the radial error in pixels and the time viewed in milliseconds. The confidence interval in the radial error was required to be less than 1% of the screen and the confidence interval in the time viewed to be less than 5% of the total 2 second viewable time. These requirements were met at 40 sessions with a 95% confidence interval in mean radial error of 6% ± 1% and a 95% confidence interval in the time viewed of 0.6± 0.08 seconds.
By plotting the confidence in the above sample statistics for different sample sizes, we can see the relative benefit of increasing the sample size below. If an individual experiment requires higher confidence, then you may want to increase the sample size, however Sticky’s recommendation is that given the increase cost of larger sample sizes there are diminishing returns after 40 participants for most studies.
What's the suitable margin in order to avoid or optimize bleed over?
Eye tracking of any kind will always produce data with a certain level of inaccuracies. These can result from participant gaze behaviour, changes in posture, light conditions, wearing glasses and the ability for the eye tracking algorithms to cope with all these variations. This means that the system will record multiple gaze point locations instead of a single one when a person is staring at the same spot.
When you combine AOI measurements for a crowded stimulus like a shelf, and your AOI boundaries are relatively close to each other, it is likely that you will have false positive- and negative hits, i.e. some gaze points will end up in another AOI or are recorded outside of the target AOI. The following article provides interesting read on this topic: https://marketingexperiments.com/a-b-testing/type-i-ii-errors-defined
The awareness of this challenge helps you decide which research approach to go for, i.e. if you’d like to capture all, or most of the gaze plots, would make you choose a “Sensitive” approach, whereas an assurance of only capturing relevant plots within an AOI would make you select a “Selective” approach. A cautious decision for how to use these two alternatives gives you a strategy for how you will build and setup your AOIs.
(Start watching 28 minutes into the program).
The separation of AOIs is in any case a very relevant and interesting question. There is no standard way to calculate a separation of stimulus or how the data can be manipulated once recorded. Instead you aim to create AOI s and experiments that repeatedly, on a high level, generates consistent result.
The same webinar referred to above shows an approach on how you can best decide shape and distance of AOIs based on the average error and how you should place your stimuli’s(products).
Study setup and stimuli used
After the initial eye calibration, the main study setup consisted of 20 consecutive stimuli, in randomized order, showing a small image on top of a white background. Each stimulus was displayed for 3 seconds. The stimuli were randomly chosen from 7 different face stimuli. These stimuli screen locations were distributed to cover the entire available browsing area. This formed the basis for calculating the error in the eye tracking.
Link to experiment: