Information retrieval for children: Search Behavior and Solutions

I have been awarded the degree of Phd (Cum Laude) after successfully defending my PhD thesis at the University of Twente on February 14, 2014.

The first contribution of this thesis provides a characterization, on a large scale, of the search behavior of young users. The problems they face when they search for information on the web, the topics they searched and the online activities that motivate search were explored in detail and contrasted against the search behavior of adult users. The results presented in this thesis have important implications for the development of search tools for young users and for the design of educational literacy. Two central problems were identified in the search process of young users: (1) difficulty representing the information needs with keyword queries, and (2) difficulty exploring the list of results.

We found that focused queries are often required to access high quality content for young user with modern search engines. However, young users were found to submit queries that lack the specificity needed to retrieve content that is suitable for them, which leads to frustration during the search process. This observation motivates the second contribution of this thesis. We propose novel query recommendation methods to improve the chances of young users to find content that is suitable and on topic. Concretely, we present an effective biased random walk based on informa- tion gain metrics. This method is combined with topical and specialized features designed for the information domain of young users. We show that our query suggestions outperform by a larger margin not only related query recommendation methods but also the query suggestions offered by the search services available today.

In respect to the second difficulty, it was found that young users have a strong click bias, in which results ranked at the bottom of the result list are rarely clicked. This behavior greatly hampers their navigational skills and exploration of results. It also reduces the chances of young users to find suitable information, since appropriate content for this audience is ranked, on average, at lower positions in the result list in comparison to the content aimed at the average web user.

The third contribution of this thesis aims at helping young users to im- prove their chances to find appropriate content and to ease the exploration of results. For this purpose, we envisage an aggregated search system in which parents, teachers and young users add search services with con- tent of interests for young audiences. We propose a test collection with a wide number of verticals with moderated content, a carefully selected set of search queries and vertical relevant judgments. We also provide novel methods of vertical selection in this information domain based on social media and based on the estimation of the amount of content that is appropriate for young users in each vertical. We show that our methods outperform state-of-the-art vertical selection methods in this information domain. We also show in a case study with children aged 9 to 10 years old that result pages derived from the collection proposed are preferred over the result pages provided by modern search engines. We provide evidence showing that the interaction and exploration of results are improved with result pages built using this collection, even if the users of this case study were unaware between the differences between the types of pages displayed to them.

This thesis is concluded by providing concrete follow-up research directions and by suggesting other information domains that can potentially benefit from the methods proposed in the thesis. My thesis is available online here.

Analysis of Search and Browsing Behavior of Young Users on the Web

by Sergio Duarte Torres, Ingmar Weber and Djoerd Hiemstra 

In this journal paper we expanded the study presented in paper What and How Children Search on the Web in two directions. Firstly, We provide a more detailed analysis of the topics that are searched by children on a state-of-the-art search engine by using novel classification based on fine-grained topics derived from the categories of the Yahoo! Answers service. The findings obtained through this analysis allow us to provide concrete recommendations for the development of modern IR systems for young users in specific age ranges.


Secondly, we employed toolbar logs from the Yahoo! search engine to characterize the browsing behavior of young users, particularly to understand the activities on the Internet that trigger search. We quantified the proportion of browsing and search activity in the toolbar sessions and we estimated the likelihood of a user to carry out search on the Web vertical and multimedia verticals (i.e.\ videos and images) given that the previous event is another search event or a browsing event. We found that certain group of young users are more likely to carried out multimedia search and that certain browsing events are more likely to trigger web search, such as knowledge related websites (e.g. Wikipedia).

Published at TWEB ACM, March 2014, Volume 8 Issue 2. Read the paper.

Query Recommendation in the Domain of Information for Children

by Sergio Duarte Torres, Djoerd Hiemstra, Ingmar Weber, Pavel Serdyukov. 

Children represent an increasing part of web users. One of the key problems that hamper their search experience is their limited vocabulary, their difficulty to use the right keywords, and the inappropriateness of general-purpose query suggestions. In this journal paper, we expanded the biased random walk introduced in our paper Query recommendation for Children by combining the score of the random walk with topical and language modeling features to emphasize even more the child-related aspects of the query suggestions.


We evaluate our methods using a large query log sample of queries submitted by children (from the Yahoo! Search logs). We show that our method outperforms by a large margin the query suggestions of modern search engines and state-of-the art query suggestions based on random walks.

Published at JASIST, February 2014. Read the paper.

Vertical Selection in the Information Domain of Children

by Sergio Duarte Torres, Djoerd Hiemstra and Theo Huibers 

In this paper we explore the vertical selection methods in aggregated search in the specific domain of topics for children between 7 and 12 years old. A test collection consisting of 25 verticals, 3.8K queries and relevant assessments for a large sample of these queries mapping relevant verticals to queries was built. We gather relevant assessment by envisaging two aggregated search systems: one in which the Web vertical is always displayed and in which each vertical is assessed independently from the web vertical. We show that both approaches lead to a di?erent set of relevant verticals and that the former is prone to bias of visually oriented verticals. In the second part of this paper we estimate the size of the verticals for the target domain. We show that employing the global size and domain specific size estimation of the verticals lead to significant improvements when using state-of-the art methods of vertical selection. We also introduce a novel vertical and query representation based on tags from social media and we show that its use lead to significant performance gains. Read the paper

This paper has been nominated for the best student paper award at JCDL 2013.

Cross-lingual alignment and completion of Wikipedia templates

by Gosse Bouma, Sergio Duarte Torres and Zahurul Islam. 

For many languages, the size of Wikipedia is an order of magnitude smaller than the English Wikipedia. We present a method for cross-lingual alignment of template and infobox attributes in Wikipedia. The alignment is used to add and complete templates and infoboxes in one language with information derived from Wikipedia in another language. We show that alignment between English and Dutch Wikipedia is accurate and that the result can be used to expand the number of template attribute-value pairs in Dutch Wikipedia by 50%. Furthermore, the alignment provides valuable information for normalization of template and attribute names and can be used to detect potential inconsistencies. Read the paper.

Visual Exploration of Health Information for Children

by Frans van der Sluis, Sergio Duarte Torres, Djoerd Hiemstra, Betsy van Dijk, Frea Kruisinga 

Children experience several difficulties retrieving informa- tion using current Information Retrieval (IR) systems. Particularly, chil- dren struggle to find the right keywords to construct queries given their lack of domain knowledge. This problem is even more critical in the case of the specialized health domain. In this work we present a novel method to address this problem using a cross-media search interface in which the textual data is searched through visual images. This solution aims to solve the recall and recognition problem which is salient for health information, by replacing the need for a vocabulary with the easy task of recognising the different body parts. Read the paper.

A Novel Image Encryption Scheme Based on a Generalized Chinese Remainder Theorem

by Sergio Duarte Torres, David Becerra Romero, Luis Niño and Yoan Pinzon.

In this paper, a novel method for image encryption based on a Generalized Chinese Remainder Theorem (GCRT) is presented. The proposed method is based on the work developed by Jagannathan et al. Some modifications are proposed in order to increase the method’s encryption quality and its robustness against attacks. Specifically, the inclusion of a vector to reduce the segment pixel space and a Generalized Chinese Remainder Theorem (GCRT) algorithm are proposed. These vectors are generated randomly which allows its use as private keys joining these unrestricted key values generated by the GCRT algorithm. An analysis to study a system where the RGB channels are independently encrypted is performed. Some experiments were carried out to validate the proposed model obtaining very promising results. Read the paper.

A Model for Resource Assignment to Transit Routes in Bogota Transportation System Transmilenio

by Sergio Duarte Torres, David Becerra Romero and Luis Niño.

In this work, a model based on genetic algorithms, queue theory and graph theory for route planning in a mass transportation system is presented. Most important features of the proposed approach are i) the modeling of the Americas line in the mass transportation system Transmilenio in Bogota; ii) Data preprocessing using graph theory to characterize the shortest routes between all the possible combinations of destination and source stations; iii) the optimization of travel time by route assignment using genetic algorithms iv) the simulation of events using the Poisson and Erlang distributions, corresponding to bus arrival at specific stations and to users waiting time. An experimental methodology was developed to validate the proposed approach. Read the paper (In Spanish).

A novel ab-initio genetic-based approach for protein folding prediction

by Sergio Duarte Torres, David Becerra, Luis Niño and Yoan Pinzon. 

In this paper, a model based on genetic algorithms for protein folding prediction is proposed. The most important features of the proposed approach are: i) Heuristic secondary structure information is used in the initialization of the genetic algorithm; ii) An enhanced 3D spatial representation called cube-octahedron is used, also, an expansion technique is proposed in order to reduce the computational complexity and spatial constraints; iii) Data preprocessing of geometric features to characterize the cube-octahedron using twelve basic vectors to define the nodes. Additionally, biological information (torsion angles, bond angles and secondary structure conformations) was pre-processed through an analysis of all possible combinations of the basic vectors which satisfy the biological constrains defined by the spatial representation; and iv) Hashing techniques were used to improve the computational efficiency. The pre-processed information was stored in hash tables, which are intensively used by the genetic algorithm. Some experiments were carried out to validate the proposed model obtaining very promising results. Read the paper.