C. Chamzas, M. Lippi, M. C. Welle, A. Varava, A. Marino, L. E. Kavraki, and D. Kragic, “State Representations in Robotics: Identifying Relevant Factors of Variation using Weak Supervision,” in NeurIPS, 3rd Robot Learning Workshop: Grounding Machine Learning Development in the Real World, 2020.
Representation learning allows planning actions directly from raw observations. Variational Autoencoders (VAEs) and their modifications are often used to learn latent state representations from high-dimensional observations such as images of the scene. This approach uses the similarity between observations in the space of images as a proxy for estimating similarity between the underlying states of the system. We argue that, despite some successful implementations, this approach is not applicable in the general case where observations contain task-irrelevant factors of variation. We compare different methods to learn latent representations for a box stacking task and show that models with weak supervision such as Siamese networks with a simple contrastive loss produce more useful representations than traditionally used autoencoders for the final downstream manipulation task.