0) Motivation, Objectives and Related works:
1) Global Framework:
Figure 1. Given a monocular image and its predicted 3D pose, the human can easily tell whether the prediction is anthropometrically plausible or not (as shown in b) based on the perception of imagepose correspondence and the possible human poses constrained by articulation.
We simulate this human perception by proposing an adversarial learning framework, where the discriminator is learned to distinguish ground-truth poses (c) from the predicted poses generated by the pose estimator (a, b), which in turn is enforced to generate plausible poses even on unannotated in-the-wild data.
Figure 2. The multi-source architecture. It contains three information sources, image, geometric descriptor, as well as the heatmaps and depth maps. The three information sources are separately embedded and then concatenated for deciding if the input is the ground-truth pose or the estimated pose.
References: