It was a well-known fact that the current practice of publishing research results in robotics made it extremely difficult not only to compare results of different approaches, but also to asses the quality of the research presented by the authors. Though for pure theoretical articles this may not be the case, typically when researchers claim that their particular algorithm or system is capable of achieving some performance, those claims are intrinsically unverifiable, either because it is their unique system or just because a lack of experimental details, including working hypothesis. Often papers published in robotics journals and generally considered as good would not meet the minimum requirements in domains in which good practice calls for the inclusion of a detailed section describing the materials and experimental methods that support the authors' claims. This is, of course, partly due to the very nature of robotics research: reported results are tested by solving a limited set of specific examples on different types of scenarios, using different underlying software libraries, incompatible problem representations, and implemented by different people using different hardware, including computers, sensors, arms, grippers...
This state of affairs cannot be changed in the short term, but in the last four years some steps have been taken in the right direction by studying the ways in which research results in robotics can be better reported, assessed and compared. In this context, EURON has played an important role by fostering systematic benchmarking and good experimental practice in robotics research. The long-term benefits of these efforts are evident: not only will they foster the overall quality of research results but they will also improve publication opportunities for EU-based research, thereby increasing international visibility of European research and lead to rapid adoption of new research results by application developers and the robotics industry.
Curiously enough, the above described situation was compatible with the fact that some of the most popular organized events in robotics are related to comparative research: the different successful robot competitions that have been organized in the last years are a way of comparing the performance of the competing systems by means of very well-defined rules and metrics. The organization of these scientific competitions has proven a quick way to attract substantial research efforts and rapid to produce high-quality working solutions.
When considering the role of EURON in relation to these issues, trying to set up a task force to define a set of gold standards in robotics by itself was not considered as a feasible approach given the limited available resources. To mention just a well-known example: DARPA and NSF funded a study about a very particular field in robotics, namely human robot interaction (HRI). Even for this reduced field over sixty representatives from academia, government and industry participated in the study, and one of the recommendations regarding actions for the next 5 years concluded that the HRI field is still too new to set milestones or benchmarks [Burke at al. 04]. Even so, some grand challenges were proposed. Grand challenges are interesting as long-term goals, but they are usually vaguely described, resulting from a roadmap in the field, and not very useful for measuring progress or comparing results. Nevertheless, benchmarks could be conceived as a way of measuring progress toward a grand challenge.
Defining a benchmark —even a sound valid benchmark— could be an easy task, if it is just taken as an academic exercise. Defining a successful benchmark is something completely different. A benchmark can be considered as successful if the community extensively uses it in publications, conferences and reports as a way of measuring and comparing results. To put it in a few words: a benchmark is successful if and only if it is widely accepted by the community at which it is targeted.
This kind of success is somehow difficult to predict, but some of the following considerations may help:
Reaching consensus does take time: proposing good benchmarks for the community to accept, is a long process that requires the concourse of many people in many subcommunities within robotics. Consequently, the role of EURON in this context was not that of defining benchmarks, but rather to propose and encourage:
Since this work has been carried out in the framework of FET Beyond Robotics and IST Cognition Unit, we have been more concerned with non-industrial scenarios. This report builds on previous work developed in EURON I [Dillmann 04] in which benchmarks in industry were discussed. In general this seems to be a different situation, since industry can provide the resources to measure whatever features they desire in a robot. In this sense they can not only develop their own benchmarks, but also they have even organized competitions: a famous example is the one held in March 1996, when Ford U.S.A. organized a competition for an order of 400 welding robots with the result that the KUKA robots could solve the benchmark problems considerably faster and smoother than the robots of the main competitor. KUKA won this contract and since then all Ford European plants become equipped with KUKA robots exclusively [Ford Competition].
In order to attain the above-mentioned goals, namely successful robotics benchmarks in the medium term, the following actions were identified:
More concretely, we planned a series of discussions and refinements in parallel actions, similar to the procedure described in [Burke et al. 04] as a continuous process of convergence towards consensus in order to ensure community wide acceptance:
This website describes the up-to-date results of action (a) above in the form of the final version of an exhaustive, detailed survey and inventory of current existing efforts in comparative research: competitions, benchmarks, challenges, repositories, conferences, etc. This is the result of an on-going long process of information gathering, either obtained from different sources or kindly provided by a number of persons. Most of this information was previously unavailable, scattered in different sources and merged with irrelevant issues. After a process of selection and rewriting it is regularly updated and made available to the community in this website.
Actions (b) and (c) have been addressed with a number of meetings, workshops, and discussions —physical or email-based. The main overall result is a considerable increase in the awareness of the importance of robotics benchmarking in Europe. This has resulted in a number of on-going initiatives in Europe towards defining benchmarking scenarios, the current results of which are included in this report. The degree of accomplished tasks varies among the different initiatives, some have been already available to the community for months on end, whereas others are more embryonic as described in the corresponding sections. Some of them are based on simulations only, and data sets are made available defining objects, robots and scenarios in standard format descriptions. When moving ahead beyond simulations into real hardware in the real world, computer data sets are not enough and various solutions are put forward. Some of them are based on specific hardware that is shared by remote access, whereas others describe experimental protocols to be shared in the verification of diverse approaches to the same problem.