The popularity of using wearable inertial sensors for physical activity classification has dramatically increased in the last decade due to their versatility, low form factor, and low power requirements. Consequently, various systems have been developed to automatically classify daily life activities. However, the scope and implementation of such systems is limited to laboratory-based investigations. Furthermore, these systems are not directly comparable, due to the large diversity in their design (e.g., number of sensors, placement of sensors, data collection environments, data processing techniques, features set, classifiers, cross-validation methods). Hence, the aim of this study is to propose a fair and unbiased benchmark for the field-based validation of three existing systems, highlighting the gap between laboratory and real-life conditions. For this purpose, three representative state-of-the-art systems are chosen and implemented to classify the physical activities of twenty older subjects (76.4 ± 5.6 years). The performance in classifying four basic activities of daily life (sitting, standing, walking, and lying) is analyzed in controlled and free living conditions. To observe the performance of laboratory-based systems in field-based conditions, we trained the activity classification systems using data recorded in a laboratory environment and tested them in real-life conditions in the field. The findings show that the performance of all systems trained with data in the laboratory setting highly deteriorates when tested in real-life conditions, thus highlighting the need to train and test the classification systems in the real-life setting. Moreover, we tested the sensitivity of chosen systems to window size (from 1 s to 10 s) suggesting that overall accuracy decreases with increasing window size. Finally, to evaluate the impact of the number of sensors on the performance, chosen systems are modified considering only the sensing unit worn at the lower back. The results, similarly to the multi-sensor setup, indicate substantial degradation of the performance when laboratory-trained systems are tested in the real-life setting. This degradation is higher than in the multi-sensor setup. Still, the performance provided by the single-sensor approach, when trained and tested with real data, can be acceptable (with an accuracy above 80%).