Tools

This page offers tools for recording, processing, annotating, and reconstructing human motions from the dataset. All source codes and supporting documents are available in our GitHub repository, complete with a detailed installation guide. In summary:
(1) Recoder provides instructions, codes, and documents on constructing a recording framework that supports high-quality synchronized streaming, saving, and visualization of human-human throw&catch activities using multiple sensors.
(2) Processor converts the raw data captured with the proposed recording framework (particularly the sensors) into processed data in commonly used formats and aligns all data across different modalities.
(3) Annotator offers an interactive interface that enables users to visually validate and annotate each recorded throw&catch activity with a hierarchy of semantic and dense labels.
(4) Human motion constructor constructs and optimizes human motions using multi-modal data streams in the dataset, and re-target the constructed human motions to multiple robots and multi-fingered hands.
(5) Hand joint position extraction extracts hand joint positions based on the captured euler joint angles from StretchSense MoCap Pro (SMP) Gloves and defined bone lengths.
(6) Visualization offers an interactive interface that enables users to visually browse the synchronized frames of all data streams simultaneously in a way like playing videos.

Data Processing

The processor synchronizes all data streams captured from different sensors and converts the raw data into processed data in commonly used formats, as listed below. All data, both raw and processed, are stored alongside a variety of supporting files and organized in a hierarchical manner. We refer users to data processing tutorials documentation, data explanation and our technical paper for a detailed introduction to the data hierarchy and the content of each involved data file.

Device	Raw		Processed
Device	Data	File	Data	File
ZED	Left-eye and right-eye RGB videos	.SVO	RGB images	.PNG
			Depth maps (unnormalized)	.NPY
			Depth images (normalized)	.PNG
Event	Binary events in EVT3.0 format	.RAW	Events (x, y, p, t)	.CSV
Event	Sensor setting for recording	.BIAS	Event images	.JPG
MoCap Pro	Sensors' reading and hand joint angles	.CSV	Hand joint motion	.CSV
	Hand calibration parameters	.CAL	Hand joint positions	.JSON
	3D animation visualization	.FBX
	Metadata of the recording	.JSON
OptiTrack	Local and global transformatiom matrices	.CSV	Body motion in throw-catch frame	.CSV

We suggest users to visit our GitHub repository for more detailed, step-by-step instructions on processing the data from scratch. Briefly, users can follow the three steps below to process the raw data:

Step 1: Fetch the Raw Data. You can access all recorded raw data from Dropbox. The raw data of each recorded throw&catch activity is compressed in a .zip file.

Step 2: Extract the Raw Data. We provide a scripted data extractor to unzip the packaged raw data and organize all raw files in an appropriate data hierarchy, as previously mentioned.

python src/extract.py --srcpath raw_data_path --tarpath your_path

--srcpath is where the downloaded raw data is stored. --tarpath is the target path where you want to extract the raw data.

Step 3: Process the Extracted Data. After the data has been extracted and organized appropriately, run the processor :

python src/process.py --datapath your_path/data

--datapath is where your extracted data is located.

We refer users to the data processing documentation for complete technical details on how we process the multi-modal and cross-device raw data. Additionally, take a look at the data file explanation and our technical paper for a detailed introduction to the data hierarchy and the content of each involved data file.

Annotator

The annotator enables users to visually segment and annotate each recorded activity with a hierarchy of semantic and dense annotations. Overall, it provides:
(1) An interactive annotation interface: The interface displays all synchronized streams in an interactive window, as shown below, allowing users to inspect and annotate each recording frame by frame.
(2) A variety of annotation operations: Users can easily interact with the interface to label temporal segmentation and annotations using keyboard operations.
(3) An information panel: The annotation results are displayed in real-time in the information panel. Any modifications to the annotation results are immediately saved in the corresponding annotation file.

We recommend users to read the comprehensive annotation guide for a detailed explanation of the annotator and its usage, as well as the dataset annotation.

Human Motion Construction and Retargeting

I. Coarse human pose estimation

We use mmhuman3d to estimate the coarse human poses.
(1) Installation: please follow the official getting started for installation.
(2) H2TC data: Download h2tc data and fetch the rgbd0 image folder.
(3) SMPL related resources and pretrained models: SMPL v1.0 and some other resources from mmhuman3d that are needed in this step:

J_regressor_extra.npy J_regressor_h36m.npy smpl_mean_params.npz Pretrained models

Download the resources and arrange them as follows:

    mmhuman3d
    ├── mmhuman3d
    ├── docs
    ├── tests
    ├── tools
    ├── configs
    └── data
        ├── gmm_08.pkl
        ├── body_models
        │   ├── J_regressor_extra.npy
        │   ├── J_regressor_h36m.npy
        │   ├── smpl_mean_params.npz
        │   └── smpl
        │       ├── SMPL_FEMALE.pkl
        │       ├── SMPL_MALE.pkl
        │       └── SMPL_NEUTRAL.pkl
        ├── pretrained
        │   └── spin_pretrain.pth
        └── static_fits
            ├── coco_fits.npy
            ├── h36m_fits.npy
            ├── lspet_fits.npy
            ├── lsp_fits.npy
            ├── mpii_fits.npy
            └── mpi_inf_3dhp_fits.npy

To extract human poses from the input images or video with the human tracking and pose estimation, you can first cd mmhuman3d and then run:

python demo/estimate_smpl.py configs/spin/resnet50_spin_pw3d.py data/checkpoints/spin_pretrained.pth --multi_person_demo --tracking_config demo/mmtracking_cfg/deepsort_faster-rcnn_fpn_4e_mot17-private-half.py --input_path L:/h2tc_dataset/002870/processed/rgbd0 --show_path vis_results/002870.mp4 --smooth_type savgol --speed_up_type deciwatch --draw_bbox --output vis_results/

The human poses will be stored in vis_results/inference_result.npz with smpl format.

(1) Installation:

conda create -n pose python=3.7
conda activate pose
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.1 -c pytorch
pip install matplotlib, opencv-python,scikit-learn,trimesh,Pillow,pyrender,pyglet==1.5.15,tensorboard,git+https://github.com/nghorbani/configer,torchgeometry==0.1.2,smplx==0.1.28

(2) Download smplh
(3) File structure:

pose_reconstruction_frommm
|--config.py
|--fitting_utls.py
|--h2tc_fit_dataset_mm.py
|--motion_optimizer.py
|--run_fitting_mm.py
|--smplh_male.npz       # smplh model (male)
|--fit_h2tc_mm.cfg      # config file

(4) Run the multi-modal optimizer to optimize the human poses with the opti-track data and glove hands.

python pose_reconstruction_frommm/run_fitting_mm.py @./fit_h2tc_mm.cfg --data-path <h2tc_takeid_processed_folder> --mmhuman <mmhuman_file> --out <out_pose_folder>

<h2tc_takeid_processed_folder>: the processed folder, like root/002870/processed/rgbd0.

<mmhuman_file>: the coarse pose file extracted from Coarse human pose estimation, like root/vis_results/inference_result.npz

<out_pose_folder>: folder path to save the optimization pose results. The output meshes are saved in <out_pose_folder>/body_meshes_humor. Optimized human poses are saved in <out_pose_folder>/results_out/stage2_results.npz

* Optimization Algorithm

Due to inevitable visual occlusion, the results of mmhuman are coarse, especially in arms and hands. Taking into account the multi-modal data collected in our dataset, including OptiTrack, gloves poses, rgb images and so on, these information can help us optimize the mmhuman results.

Given the coarse mmhuman pose estimation $\mathcal{M_{mmh}}$, OptiTrack head and hands tracking points $\mathcal{H}$ and glove hands poses $\Theta_{hand}$, we aim to recover the accurate human poses $\mathcal{M_{acc}}$. Our optimization objective is:

\[\min _{\mathcal{M_{acc}} } \mathcal{C_{trk}} + \mathcal{C_{wst}} +\mathcal{C_{smo}}\]

The OptiTrack term $\mathcal{C_{trk}}$ measures how well the posed body model match the OptiTrack points $\mathbf{P}_t = \left \{ \mathbf{d}_t^i \right \}_{i=0}^{3}$ for head and two-hand points at each frame $t$. We use the mesh corresponding vertices $\mathbf{V}_t$ (index 411 for the head OptiTrack data, 5459 for the right hand and 2213 for the left hand) to compute

\[\mathcal{C_{trk}} =\lambda _{trk}\sum_{t=0}^{T}\sum_{i=0}^{3} \mathop{\min}_{\mathbf{v}_t^i}\left \| \mathbf{v}_t^i- \mathbf{d}_t^i \right \| ^2\]

The wrist cost $\mathcal{C_{wst}}$ is used to disambiguate the right/left wrist pose guided by hands tracking information. Meanwhile, the cost also contributes to recovering accurate whole-arm poses even in severe occlusions. We use the hand OptiTrack pose $\mathbf{O}_t^{hand} = \left \{ \mathbf{o}_t^h \right \}_{i=0}^1$ to calculate the right $h=0$ and the left $h=1$ wrist loss. It is formulated as

\[\mathcal{C_{wri}} =\lambda _{wri}\sum_{t=0}^{T}\sum_{h=0}^{1}\left \| {\mathbf{v}_{wri}}_t^h-\mathbf{o}_t^h \right \| ^2\]

where ${\mathbf{v}_{wri}}_t^h$ is the SMPLH right/left wrist pose.

Independent frame-by-frame pose estimation always causes temporal inconsistency. The regularization term $\mathcal{C_{smo}}$ is used to guarantee the smoothness of the motion recovering and keep it reasonable. The smooth term encourages the 3D joints consistency. It is formulated as

\[\mathcal{C_{smo}}= \sum_{t=1}^{T} (\lambda_{jp}\sum_{j=1}^{N} \left \| \mathbf{J}_t^j - \mathbf{J}_{t-1}^j \right \|^2 +\lambda_{bl}\sum_{i=1}^{B} ( l_t^i - l_{t-1}^i )^2 )\]

$\mathbf{J}_t$ is the joint position at time $t$. Bone lengths $l_t^j$ are from $\mathbf{J}_t^j$ at each step.

For hands pose, we already have captured the two hands poses $\Theta_{hand}$ in each capturing. We map them to SMPLH hands pose directly.

We initialize the optimization processing with the mmhuman poses. All $\lambda$ are weights to decide the contribution of each term.

III. Retargeting

III.1 Smplh sequence pose -> General format animation (.fbx)

(1) Installation:

a. Install [Python FBX](https://download.autodesk.com/us/fbx/20112/fbx_sdk_help/index.html?url=WS1a9193826455f5ff453265c9125faa23bbb5fe8.htm,topicNumber=d0e8312). 

b. Open `SMPL-to-FBX-main` and `pip install -r requirements.txt`

(2) SMPLX fbx:

Download the [SMPLX fbx model](https://smpl.is.tue.mpg.de) for unity. Keep the female model `smplx-female.fbx` and male model `smplx-male.fbx`.

(3) The file structure would be like:

SMPL-to-FBX-main
|--Convert.py
|--SMPLXObject.py
|--FbxReadWriter.py
|--<smplh_pose>/
|  |--*.npz
|--<fbx_path>/
|  |--smplx-female.fbx
|  |--smplx-male.fbx
|--<output_path>/

(4) Run

python Convert.py --input_motion_base <smplh_pose> --fbx_source_path <fbx_path>/smplx-female.fbx --output_base <output_path>

to start converting. The animation file will be saved in <output_path>. You can open it via Blender or Unity 3D.

III.2 Retargeting

We use Unity 3D (2022.3.17) to demostrate the retargeting. Plese check the tutorial video first, then you can follow the following steps:

(1) Model rigging: given a mesh model, bound the mesh vertices to bones
(2) Specifying the corresponding skeleton joints between rigged model A and B: Unity 3D automatically solves it after setting the rigged models as humanoid in animation type.
(3) Animation: please follow the above tutorial video. The animation algorithm used in Unity 3D is Linear Blend Skinning (LBS).

Four examples of motion construction and re-targeting using our dataset are shown below:

take one	take two	take three	take four

Hand joint position extraction

We extract hand joint positions (i.e., their XYZ 3D locations) based on the euler joint angles captured by Stretchsense MoCap Pro gloves and bone lengths using forward kenamatics. We provide a set of default bone lenghts:

finger_length['thumb'] = [None, 0.25, 0.11, 0.06]
finger_length['index'] = [0.34, 0.15, 0.08, 0.06]
finger_length['middle'] = [0.33, 0.15, 0.10, 0.07]
finger_length['ring'] = [0.31, 0.13, 0.10, 0.06]
finger_length['pinky'] = [0.3, 0.08, 0.06, 0.06]

but users can use their custom bone lengths to adapt to their specific needs and scenarios.

Specifically, hand joint position extraction has been integrated into plot_motion.py, which will be called by process.py during data extraction and processing. This means hand joint positions will be automatically extracted and saved along with other extracted and processed data from the raw zip files.

In addition, we provide a new script called src/utils/extract_hand_joint_positions.py to allow users to extract/adjust hand joint positions separately from the data processing process.

To run src/utils/extract_hand_joint_positions.py, users need to specify the root folder path of the take (e.g., data/001000):

python extract_hand_joint_positions.py --data_root path/to/the_take

In both cases, the extracted hand joint positions will be saved in json files, with joint positions of left hand in path/to/the_take/processed/left_hand_joint_positions.json and those of right hand in path/to/the_take/processed/right_hand_joint_positions.json.

20 joint positions are saved, and the json file contains key-value entries with key representing frame number and value being a list of joint positions. In each list, from index 0 to index 19, the saved joint positions are indicated as below:

Visualization

The first step of using our visualization tool is to prepare the processed data. This can be done by the provided processor. Alternatively, for a quick browse, we offer the processed data of several sample takes that can be directly downloaded from here. Eventually, you should have the data stored in a path similar to this: PARENT_PATH/data/take_id/processed.

Now you can run the following command to launch the visualization tool:

python src/visualize.py --datapath PARENT_PATH/data --take take_id --speed 120

The argument --take specifies the ID of the take to be visualized if set, otherwise the first take under the given path will be loaded. --speed specifies the FPS for playing the frames of streams.

Once the interface is launched, you can navigate the visualization through the following operations:

(1) space: play/pause the videos of all streams
(2) right arrow: pause the video if played and forward to the next frame
(3) left arrow: pause the video if played and backward to the last frame

Below is an example of visualizing a take: