Information retrieval for children: Search Behavior and Solutions

I have been awarded the degree of Phd (Cum Laude) after successfully defending my PhD thesis at the University of Twente on February 14, 2014.

The first contribution of this thesis provides a characterization, on a large scale, of the search behavior of young users. The problems they face when they search for information on the web, the topics they searched and the online activities that motivate search were explored in detail and contrasted against the search behavior of adult users. The results presented in this thesis have important implications for the development of search tools for young users and for the design of educational literacy. Two central problems were identified in the search process of young users: (1) difficulty representing the information needs with keyword queries, and (2) difficulty exploring the list of results.

We found that focused queries are often required to access high quality content for young user with modern search engines. However, young users were found to submit queries that lack the specificity needed to retrieve content that is suitable for them, which leads to frustration during the search process. This observation motivates the second contribution of this thesis. We propose novel query recommendation methods to improve the chances of young users to find content that is suitable and on topic. Concretely, we present an effective biased random walk based on informa- tion gain metrics. This method is combined with topical and specialized features designed for the information domain of young users. We show that our query suggestions outperform by a larger margin not only related query recommendation methods but also the query suggestions offered by the search services available today.

In respect to the second difficulty, it was found that young users have a strong click bias, in which results ranked at the bottom of the result list are rarely clicked. This behavior greatly hampers their navigational skills and exploration of results. It also reduces the chances of young users to find suitable information, since appropriate content for this audience is ranked, on average, at lower positions in the result list in comparison to the content aimed at the average web user.

The third contribution of this thesis aims at helping young users to im- prove their chances to find appropriate content and to ease the exploration of results. For this purpose, we envisage an aggregated search system in which parents, teachers and young users add search services with con- tent of interests for young audiences. We propose a test collection with a wide number of verticals with moderated content, a carefully selected set of search queries and vertical relevant judgments. We also provide novel methods of vertical selection in this information domain based on social media and based on the estimation of the amount of content that is appropriate for young users in each vertical. We show that our methods outperform state-of-the-art vertical selection methods in this information domain. We also show in a case study with children aged 9 to 10 years old that result pages derived from the collection proposed are preferred over the result pages provided by modern search engines. We provide evidence showing that the interaction and exploration of results are improved with result pages built using this collection, even if the users of this case study were unaware between the differences between the types of pages displayed to them.

This thesis is concluded by providing concrete follow-up research directions and by suggesting other information domains that can potentially benefit from the methods proposed in the thesis. My thesis is available online here.

Analysis of Search and Browsing Behavior of Young Users on the Web

by Sergio Duarte Torres, Ingmar Weber and Djoerd Hiemstra 

In this journal paper we expanded the study presented in paper What and How Children Search on the Web in two directions. Firstly, We provide a more detailed analysis of the topics that are searched by children on a state-of-the-art search engine by using novel classification based on fine-grained topics derived from the categories of the Yahoo! Answers service. The findings obtained through this analysis allow us to provide concrete recommendations for the development of modern IR systems for young users in specific age ranges.


Secondly, we employed toolbar logs from the Yahoo! search engine to characterize the browsing behavior of young users, particularly to understand the activities on the Internet that trigger search. We quantified the proportion of browsing and search activity in the toolbar sessions and we estimated the likelihood of a user to carry out search on the Web vertical and multimedia verticals (i.e.\ videos and images) given that the previous event is another search event or a browsing event. We found that certain group of young users are more likely to carried out multimedia search and that certain browsing events are more likely to trigger web search, such as knowledge related websites (e.g. Wikipedia).

Published at TWEB ACM, March 2014, Volume 8 Issue 2. Read the paper.

What and how children search on the web

by Sergio Duarte Torres and Ingmar Weber 

The Internet has become an important part of the daily life of children as a source of information and leisure activities. Nonetheless, given that most of the content available on the web is aimed at the general public, children are constantly exposed to inappropriate content, either because the language goes beyond their reading skills, their attention span differs from grown-ups or simple because the content is not targeted at children as is the case of ads and adult content. In this work we employed a large query log sample from a commercial web search engine to identify the struggles and search behavior of children of the age of 6 to young adults of the age of 18. Concretely we hypothesized that the large and complex volume of information to which children are exposed leads to ill-defined searches and to disorientation during the search process. For this purpose, we quantified their search difficulties based on query metrics (e.g. fraction of queries posed in natural language), session metrics (e.g. fraction of abandoned sessions) and click activity (e.g. fraction of ad clicks). We also used the search logs to retrace stages of child development. Concretely we looked for changes in the user interests (e.g. distribution of topics searched), language development (e.g. readability of the content accessed) and cognitive development (e.g. sentiment expressed in the queries) among children and adults. We observed that these metrics clearly demonstrate an increased level of confusion and unsuccessful search sessions among children. We also found a clear relation between the reading level of the clicked pages and the demographics characteristics of the users such as age and average educational attainment of the zone in which the user is located. Read the paper