A rigorous evaluation of five global Chemistry-Transport and two Chemistry-Climate Models operated by several different groups in Europe, was performed. Comparisons were made of the models with trace gas observations from a number of research aircraft measurement campaigns during the four-year period 1995-1998. Whenever possible the models were run over the same four-year period and at each simulation time step the instantaneous tracer fields were interpolated to all coinciding observation points. This approach allows for a very close comparison with observations and fully accounts for the specific meteorological conditions during the measurement flights. This is important considering the often limited availability and representativity of such trace gas measurements. A new extensive database including all major research and commercial aircraft measurements between 1995 and 1998, as well as ozone soundings, was established specifically to support this type of direct comparison. Quantitative methods were applied to judge model performance including the calculation of average concentration biases and the visualization of correlations and RMS errors in the form of so-called Taylor diagrams. We present the general concepts applied, the structure and content of the database, and an overall analysis of model skills over four distinct regions. These regions were selected to represent various atmospheric conditions and to cover large geographical domains such that sufficient observations are available for comparison. The comparison of model results with the observations revealed specific problems for each individual model. This study suggests the further improvements needed and serves as a benchmark for re-evaluations of such improvements. In general all models show deficiencies with respect to both mean concentrations and vertical gradients of important trace gases. These include ozone, CO and NOx at the tropopause. Too strong two-way mixing across the tropopause is suggested to be the main reason for differences between simulated and observed CO and ozone values. The generally poor correlations between simulated and measured NOx values suggest that in particular the NOx input by lightning and the convective transport from the polluted boundary layer are still not well described by current parameterizations, which may lead to significant differences in the spatial and seasonal distribution of NOx in the models. Simulated OH concentrations, on the other hand, were found to be in surprisingly good agreement with measured values.