Ravi Shankar

I am currently a Senior research engineer at Qualcomm in the Multimedia RnD team developing on-device models for speech processing. Prior to joining Qualcomm, I was a PhD candidate in the department of ECE at Johns Hopkins University, where I worked on expressive speech resynthesis.

At JHU, I've primarily worked on generative modeling of prosody for expressive/emotional speech synthesis. My work lies in the intersection of speech signal processing, statistical modeling, and deep learning. I was advised by Dr. Archana Venkataraman (PI, NSA Lab @ JHU). I did my undergraduate in Electronics and Electrical Engineering at IIT Guwahati where I worked on Keyword spotting for low-resourced languages supervised by Dr. S.R.M Prasanna. During the course of my PhD, I received my masters degree in Applied Math and Statistics at JHU. I've been the receipient of MINDS fellowship award twice for working on the frontiers of machine learning and data science. I have also received ECE graduate fellowship at JHU and DAAD-WISE fellowship award in the past for doing research internship in Germany.

Email  /  CV  /  Google Scholar  /  Twitter  /  Github

profile photo
Research

I'm interested in un/supervised learning, graphical modeling, and signal processing. My research is mainly about understanding and manipulating prosodic information to alter emotion perception in human speech. Here are the papers that I have published in the conferences/journals or are currently under review:

Re-ENACT: Reinforcement Learning for Emotional Speech Generation using Actor-Critic Strategy
Ravi Shankar, Archana Venkataraman
arXiv Preprint: 2408.01892
Paper

In this work, we propose the first method to modify the prosodic features of a given speech signal using actor-critic reinforcement learning strategy. Our approach uses a Bayesian framework to identify contiguous segments of importance that links segments of the given utterances to perception of emotions in humans.

Manipulating Emotions: Generative Modeling of Prosody for Emotional Speech Synthesis
Ravi Shankar
Johns Hopkins University
Thesis

In this thesis, we devise different ways of learning prosody modeling techniques to inject emotion into neutral speech. We develop supervised algorithms for F0, energy and rhythm modification followed by unsupervised approaches that combines probabilistic graphical modeling with neural network as density function.

A Closer Look at Wav2Vec2 Embeddings for On-device Single-channel Speech Enhancement
Ravi Shankar, Ke Tan, Buye Xu, Anurag Kumar
ICASSP 2024
Paper

In this work, we proposed different ways of using Wav2Vec2 embeddings for single channel speech enhancement. Our study shows that in a constrained (low-memory/poor SNR/causality) settings, SSL embeddings fail to provide helpful information to improve enhancement task.

Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time Warping
Ravi Shankar, Archana Venkataraman
ISCA SSW12, 2023
code / Paper

We propose the first method to adaptively modify the duration of a given speech signal. Our approach uses a Bayesian framework to define a latent attention map that links frames of the input and target utterances.

A Diffeomorphic Flow-based Variational Framework for Multi-speaker Emotion Conversion
Ravi Shankar, Hsi-Wei Hsieh, Nicolas Charon, Archana Venkataraman
IEEE/ACM Transactions on Audio, Speech and Language Processing, 2022
code / arXiv

In this work, we extensively study the cycle consistency loss in the context of Cycle-GAN model. We identify some of its major shortcomings and propose a new loss function to address the pitfalls.

A Comparative Study of Data Augmentation Techniques for Deep Learning Based Emotion Recognition
Ravi Shankar, Abdouh Harouna, Arjun Somayazulu, Archana Venkataraman
arXiv

We comprehensively study different types of data augmentation procedures in the context of speech emotion recognition. Our study spans multiple neural architectures and datasets for an unbiased comparison.

Non-parallel Emotion Conversion using a Deep-Generative Hybrid Network and an Adversarial Pair Discriminator
Ravi Shankar, Jacob Sager, Archana Venkataraman
Interspeech, 2020
code / arXiv

Improved Cycle-GAN by using KL divergence penalty on the conditional density in addition to cycle-consitency loss.

Multi-Speaker Emotion Conversion via Latent Variable Regularization and a Chained Encoder-Decoder-Predictor Network
Ravi Shankar, Hsi-Wei Hsieh, Nicolas Charon, Archana Venkataraman
Interspeech, 2020
code / arXiv

A chained model using latent variable regularization to mediate conversion from one emotion to another in speech.

Weakly Supervised Syllable Segmentation by Vowel-Consonant Peak Classification
Ravi Shankar, Archana Venkataraman
Interspeech, 2019

We use the vowel/consonant peak identification in the loudness profile of speech to carry out syllable segmentation.

VESUS: A Crowd-Annotated Database to Study Emotion Production and Perception in Spoken English
Jacob Sager, Ravi Shankar, Jacob Reinhold, Archana Venkataraman
Interspeech, 2019 (Oral)
dataset

VESUS corpus contains 250 phrases spoken by 10 different actors in 5 emotion categories. The objective is to study factors underlying emotion perception in a lexically controlled environment.

A Multi-Speaker Emotion Morphing Model Using Highway Networks and Maximum Likelihood Objective
Ravi Shankar, Jacob Sager, Archana Venkataraman
Interspeech, 2019 (Oral)
code

We propose a perturbation model for F0 and energy prediction using highway network. The model is trained to maximize the likelihood of error in an EM framework.

Automated Emotion Morphing in Speech Based on Diffeomorphic Curve Registration and Highway Networks
Ravi Shankar, Hsi-Wei Hsieh, Nicolas Charon, Archana Venkataraman
Interspeech, 2019
code

We use diffeomorphic registration to model the target emotion F0 contour. It serves as a regularization technique for better target F0 range approximation.

Spoken Keyword Detection Using Joint DTW-CNN
Ravi Shankar, Vikram C.M., S.R.M Prasanna
Interspeech, 2018 (Oral)

In this paper, we propose a randomized DTW method coupled with convolutional network to identify presence/absence of a keyword.

Spoken term detection from continuous speech using ANN posteriors and image processing techniques
Ravi Shankar, Arpit Jain, Deepak K.T., Vikram C.M., S.R.M Prasanna
NCC, 2016

We propose a sequence of morphological operation to refine the DTW matrix for easier keyword spotting.

Service

Teaching Assistant, Probabilistic Machine Learning (EN.520.651) Fall 2021, 2022


Reviewer, ICML 2024


Reviewer, NeuRips 2022, 2023


Reviewer, UAI 2023


Reviewer, ICLR 2022, 2024


Reviewer, Interspeech 2021, 2022, 2023, 2024


Reviewer, CISS 2021


This website's template has been taken from here: source code.