Revenue management (RM) is a complicated business process that can best be described as control of sales (using prices, restrictions, or capacity), usually using software as a tool to aid decisions. RM software canplay a mere informative role, supplying analysts with formatted and summarized data who use it to make control decisions (setting a price or allocating capacity for a price point), or, play a deeper role, automating the decisions process completely, at the other extreme. The RM models and algorithms in the academic literature by and large concentrate on the latter, completely automated, level of functionality. A firm considering using a new RM model or RM system needs to evaluate its performance. Academic papers justify the performance of their models using simulations, where customer booking requests are simulated according to some process and model, and the revenue performance of the algorithm compared to an alternate set of algorithms. Such simulations, while an accepted part of the academic literature, and indeed providing research insight, often lack credibility with management. Even methodologically, they are usually flawed, as the simulations only test "within-model" performance, and say nothing as to the appropriateness of the model in the first place. Even simulations that test against alternate models or competition are limited by their inherent necessity on fixing some model as the universe for their testing. These problems are exacerbated with RM models that attempt to model customer purchase behavior or competition, as the right models for competitive actions or customer purchases remain somewhat of a mystery, or at least with no consensus on their validity. How then to validate a model? Putting it another way, we want to show that a particular model or algorithm is thecause of a certain improvement to the RM process compared to the existing process. We take care to emphasize that we want to prove the said model as the cause of performance, and to compare against a (incumbent) process rather than against an alternate model. In this paper we describe a "live" testing experiment that we conducted at Iberia Airlines on a set of flights. A set of competing algorithms control a set of flights during adjacent weeks, and their behavior and results are observed over a relatively long period of time (9 months). Inparallel,a group of control flights were managed using the traditional mixofmanual and algorithmic control (incumbent system). Such "sandbox" testing, while common at many large internet search and e-commerce companies is relatively rare in the revenue management area. Sandbox testing has an undisputable model of customer behavior but the experimental design and analysis of results is less clear. In this paper we describe the philosophy behind the experiment, the organizational challenges, the design and setup of the experiment, and outline the analysis of the results. This paper is a complement to a (more technical) related paper that describes the econometrics and statistical analysis of the results.