Final Project Report

Jian Yu, Xiaoyu Zhu, Xinzhe Wei, Zimo Fan

Our video(click it!):

1 Abstract

Our project aimed to create high-fidelity 3D head avatars using a combination of neutral 3D Gaussians and a fully leaned MLP-based deformation field. our method can capture intricate dynamic details while maintaining expression precision. What's more, for the stability and convergence of the training procedure, we devise a well-designed initialization strategy guided by geometry, leveraging implicit SDF and Deep Marching Tetrahedra.

2 Technical Approach

We begin by removing the background from each image and simultaneously estimating the 3DMM model, 3D facial landmarks, and expression coefficients for every frame. During the initialization phase in 2.1, we reconstruct a neutral geometry based on Signed Distance Fields (SDF). Additionally, we refine a deformation MLP and a color MLP using training data to create a guidance model. Subsequently, we initialize the neutral Gaussians using the neutral mesh extracted through DMTet, while the deformation and color MLPs are also inherited from this stage.

Above is the preprocessing on Jian Yu's face.

Moving on to the training phase in Section 2.2, we utilize the dynamic generator to deform the neutral Gaussians to the target expression, leveraging the driving expression coefficients as conditions.

Ultimately, with a specified camera perspective, the expressive Gaussians undergo rendering to produce a feature map. This feature map subsequently serves as input to the convolutional super-resolution network, tasked with generating high-resolution avatar images. The optimization of the entire model occurs under the guidance of supervision from multi-view RGB videos.

2.1 Geometry-guided Initialization

We are familiar with how to start neural networks but Gaussians are quite different. Random initialization will be hard to converge.

To overcome this, we propose utilizing an implicit Signed Distance Field (SDF) representation and Deep Marching Tetrahedra (DMTet) to reconstruct a neutral mesh for Gaussian position initialization. Furthermore, rough optimization is applied to the color and deformation MLPs.

$f_{col}^{exp}, f_{col}^{pose}, f_{def}^{exp}, f_{def}^{pose}$ using a deformation MLP and a color MLP. During training, constraints are introduced to prevent overfitting:

$L_{offset}$ penalizes non-zero displacements to prevent learning a global constant offset;
$L_{lmk}$ restricts SDF values near 3D landmarks to be close to zero, ensuring landmarks are on the mesh surface;
$L_{lap}$ maintains mesh smoothness.

The overall loss function is formulated as:

\begin{matrix} (1) & L_{t o t a l} = L_{R G B} + λ_{s i l} L_{s i l} + λ_{d e f} L_{d e f} + λ_{o f f s e t} L_{o f f s e t} + λ_{l m k} L_{l m k} + λ_{l a p} L_{l a p} \end{matrix}

\begin{matrix} (2) & \begin{matrix} L_{R G B} = | | I_{r, g, b} - I_{g t} | |_{1} \\ L_{s i l} = I O U (M, M_{g t}) \\ P = P_{0} + f_{d e f}^{e x p} (P_{0}, θ) \\ L_{d e f} = | | P - P_{g t} | |_{1} \end{matrix} \end{matrix}

$\lambda$ $P_0$ , are jointly optimized until convergence.

$\{f_{col}^{exp},f_{col}^{pose}, f_{def}^{exp},f_{def}^{pose}\}$ . And other attributes we just follow the original strategy.

2.2 Training

2.2.1 Training Pipeline

$I_C\in R^{512*512*32}$ $\Psi$ $I_{hr}\in R_{2048*2048*3}$ $\{X_0,F_0,Q_0,S_0,A_0\}$ $\{f_{col}^{exp},f_{col}^{pose}, f_{def}^{exp},f_{def}^{pose}, f_{att}^{exp},f_{att}^{pose}\}$ $\Psi$ .

2.2.2 Loss Function

\begin{matrix} (3) & L = | | I_{h r} - I_{g t} | |_{1} + λ_{v g g} V G G (I_{h r}, I_{g t}) + λ_{l r} | | I_{l r} + I_{g t} | |_{1} . \end{matrix}

$\lambda$ $I_{lr}$ $I_C$ $VGG$ $I_{gt}$ is the ground truth.

2.3 Avatar Representation

Our goal is to generate a dynamic head avatar controlled by expression coefficients. To achieve this, we represent the head avatar as dynamic 3D Gaussians conditioned on expressions. To accommodate dynamic changes, we incorporate expression coefficients and head pose as inputs to the head avatar model, which then outputs the position and other attributes of the Gaussians accordingly.

This is the pipeline of the paper we refer to:

$\{X_0,F_0,Q_0,S_0,A_0\}$ $X_0\in R^{N*3}$ $F_0\in R^{N*128}$ $Q_0\in R^{N*4}$ $S_0\in R^{N*3}$ $A_0\in R^{N*1}$ $\{X,F,Q,S,A\}=\phi(X_0,F_0,Q_0,S_0,A_0;\theta,\beta)$ $\theta$ $\beta$ is the head pose. During training, we optimize the parameters of the model to minimize the difference between the generated head avatar and the ground truth.

$\{X,F,Q,S,A\}$ $\phi$ .

$X$ $f_{def}^{exp}\in\phi$ $f_{def}^{pose}\in\phi$ respectively. Then we add them to the neutral position :

\begin{matrix} (4) & X^{'} = X_{0} + λ_{e x p} (X_{0}) f_{d e f}^{e x p} (X_{0}, θ) + λ_{p o s e} (X_{0}) f_{d e f}^{p o s e} (X_{0}, β) \end{matrix}

$\lambda$ $\lambda_{exp}$ as follows.

\begin{matrix} (5) & \begin{matrix} λ_{e x p} = {\begin{cases} 1, & dist (x, P_{0}) < t_{1} \\ \frac{t_{2} - dist (x, P_{0})}{t_{2} - t_{1}}, & dist (x, P_{0}) \in [t_{1}, t_{2}] \\ 0, & dist (x, P_{0}) > t_{2} \end{cases} \end{matrix} \end{matrix}

$\lambda_{pose}(x)=1-\lambda_{exp}(x)$ $dist(x,P_0)$ $x$ $P_0$ $t_1$ $t_2$ are the predefined hyperparqmeters when the length of the head is set to approximately 1.

Instead of using linear function, we also tried to add some sigmoid-like function on it. That's because:

Speaking from experience with masking and weight drawing, the general weight matrix is very sparse, with epidermal vertices overwhelmingly influenced by the nearest bone. Skeleton manipulation of meshes is similar to mesh manipulation of Gaussians.
It may be more in keeping with the body's natural.

$C$ , color will be directly predicted by two color MLPs:

\begin{matrix} (6) & C^{'} = λ_{e x p} (X_{0}) f_{c o l}^{e x p} (F_{0}, θ) + λ_{p o s e} (X_{0}) f_{c o l}^{p o s e} (X_{0}, β) \end{matrix}

$Q$ $S$ $A$ , we similarly use another two MLPs to predict the rotation, scale and opacity:

\begin{matrix} (7) & {Q^{'}, S^{'}, A^{'}} = {Q_{0}, S_{0}, A_{0}} + λ_{e x p} (X_{0}) f_{a t t}^{e x p} (F_{0}, θ) + λ_{p o s e} (X_{0}) f_{a t t}^{p o s e} (Q_{0}, β) \end{matrix}

Lastly, we need to transform from canonical space to real world space. Only direction related variables need to be changed. So we have

\begin{matrix} (8) & {X, Q} = T ({X^{'}, Q^{'}}, β), \end{matrix}

\begin{matrix} (9) & {C, S, A} = {C^{'}, S^{'}, A^{'}} \end{matrix}

2.4 Problems

The training process was too slow, and although our results were trained on an existing dataset, it was too late to train it after we created our own data.
The work we referenced is so advanced (CVPR 2024) that it's hard for us to make big improvements on it. We only modified some parameters to make them more reasonable and tried to create our own dataset.

2.5 Lessons Learned

For the final project, our team has extensively studied the relevant content. First, we explored the grid-based human body model, and we researched the classic work of SMPL from the Max Planck Institute. SMPL appears to be a complex work with many abstract mathematical notations and involves a lot of prerequisite knowledge in computer graphics. However, combining the knowledge we learned in the course, we gradually understood that it is essentially a mesh model implemented using blend shapes technology, considering the influence of Shape and Pose on the standard Template through a technique called linear blend skinning (LBS). The following image is the slide we prepared for our discussion. These are the footprints of our learning process👣👣👣.

Next is the novel scene representation technique: 3D Gaussian splatting (3DGS). 3DGS expresses the scene by optimizing several small Gaussian distributions and renders them using splatting, a rasterization technique. Each Gaussian distribution not only has its mean (position) and variance (size and shape) attributes but also carries opacity (α) and spherical harmonic functions to represent the three-channel color values in different views. Since the rendering process of 3DGS involves rasterization instead of ray tracing and the original author utilized CUDA and parallelized the rendering process using multiple threads, the efficiency of rendering 3DGS scenes has been significantly optimized, resulting in real-time performance. Moreover, from an effectiveness perspective, 3DGS exhibits a strong capability in scene representation.

We tried to reproduce the latest CVPR work. Even though it was difficult, it was fun. We created COOL avatars and found our joy in code and graphics.

When making adjustments to parameters, thorough deliberation is paramount, particularly given the extensive training times characteristic of a model of such magnitude. Deviating from the optimal direction could lead to significant time wastage, as rectifying errors and rerunning experiments can be exceptionally time-consuming.

Furthermore, it's imperative to craft a comprehensive and flexible plan with generous time allocations. Unforeseen complications are inevitable, and they often demand more time than initially anticipated. For example, delving into the intricacies of the original codebase can prove to be far more challenging than initially imagined. Additionally, there can be a considerable disparity between the current state of the art in research papers and the knowledge base at our disposal. Such disparities may necessitate additional time for learning and adaptation, underscoring the importance of a well-padded timeline.

3 Results

We first use images and cameras to model the head in existing dataset:

Then convert it into Gaussian model:

Use another person's head to reenactment our avatar:

The right side is the trained avatar, the left side is making a new movement making the right head mimic the left one.

4 References

Papers

Xu, Yuelang, et al. "Gaussian head avatar: Ultra high-fidelity head avatar via dynamic Gaussians." arXiv preprint arXiv:2312.03029 (2023).

Gerig, Thomas, et al. "Morphable face models-an open framework." 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 2018.

Websites

https://github.com/YuelangX/Gaussian-Head-Avatar

https://github.com/YuelangX/Multiview-3DMM-Fitting/tree/main

https://flame.is.tue.mpg.de/

5 Contributions from each team member

Jian Yu: Do research on the topic, Reproduction of possible work, Try to make our own dataset, Polish the final report

Xinzhe Wei: Do research on the topic, Reproduction of possible work, Train on the exist dataset, Polish the final report

Xiaoyu Zhu: Do research on the topic, Reproduction of possible work, Try to make our own dataset, Polish the final report

Zimo Fan: Do research on the topic, Reproduction of possible work, Drafting of the first version of the final report, Polish the final report