Team42 Milestone

video slides

About Proposal Feedback

Thanks to TA Mingyang Wang, for some questions and suggestions on our project's proposal. Here are some answers:

Our Progress

We originally intended to implement Text-Driven 3D Human Generation. we found a very relevant paper: HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting, and plan to reproduce this. However, after 2-3 days of trying hard, we realized that there were a lot of problems with this repository: the environment configuration must cause some conflicts, some plugins can't be installed on our machine...

We turn our attention to a piece of work ECON that does not use any Nerf or Gaussian methods, but instead recovers the entire reconstruction from a monocular image(no text prompt) by a series of methods such as estimating normals, depths, etc., as well as by combining the SMPL prior for the human body. We ran this work successfully and here are some results:

On the left are the images entered, and on the right are the people observed from all angles after the reconstruction.

However, this method is more "traditional" (and actually quite new, but scene reconstruction methods such as Gaussian splatting seem to be a bit more up-to-date). We looked at the Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars method, which utilizes Gaussian splatting to reconstruct character action scenes. We successfully ran this work:

Above is the result of reconstructing the animated human body.

We then turn our attention to another piece of work: Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians, which focuses on reconstructing dynamic head avatar, could be driven by input multi-view videos. Our test shows that the results reach high-fidelity. The pipeline is shown here:

They first optimize the guidance model including a neutral mesh, a deformation MLP and a color MLP, yielding a expressive mesh head avatar. The mesh and the MLPs would be the initialization of the Gaussian model and the dynamic generator respectively. The optimizing process is shown below.

Then they use the Gaussian model and the dynamic generator to generate expressive Gaussian head, feeding to a super-resolution up-sampling network renders a 2K RGB image, which is compared with the ground truth to get the loss. The first 7% of the process is shown below:

Here shown are two images of the rendered result of trained raw Gaussian head (left) and high-resolution result (right). The result reaches high-fidelity.

After training, the Gaussian avatar can be reenacted by expression coefficients as shown below.

We tried to search for more literature related to Text-Driven 3D Human Generation on various platforms such as Google Scholar and found that since text-driven related content is still relatively new, and at the same time, 3d Gaussian splatting has just been proposed, there is very little work combining the two and it is difficult to learn from it. We thought about whether we could do something that does not use text-driven generation, but rather single/multiple eye reconstruction, which has more work to draw on.

 

Reflection and update plan