GAN based Sign Language Synthesis Model
May - Dec 2021 (@Undergraduate Research Program, Seoul National University.)


Summary
A GAN-based framework that synthesizes sign language videos by transferring interpreter motions to speakers while preserving identity and facial expressions, enhancing accessibility for deaf audiences.

Problem & Motivation

Traditional sign language interpretation displays interpreters in small corner windows, creating accessibility barriers for deaf viewers who must shift attention between the speaker and interpreter. According to Korea’s National Institute of Korean Language, 53% of users cited “small screen size” as the primary barrier to understanding sign language interpretation.

Architecture

Overall architecture of proposed model

  • Pose Extraction: OpenPose library to extract 113 keypoints (54 facial, 50 hand, 9 body landmarks) as skeleton representations
  • Generator: U-Net architecture with skip connections for detail preservation, taking speaker images and skeleton sequences as input
  • Discriminator: PatchGAN architecture processing consecutive frame pairs for temporal consistency

Results

Qualitative results

Qualitative evaluation showed superior results compared to GestureGAN baseline, with better facial feature preservation and finger detail accuracy. Training convergence was faster due to additional temporal frame information.

Resources