" The truly unique feature of our language is not its ability to transmit information about men and lions. Rather, it’s the ability to transmit information about things that do not exist at all. As far as we know, only Sapiens can talk about entire kinds of entities that they have never seen, touched or smelled. " — Yuval Noah Harrari
Hey there, I am a MSc student at the University of Alberta working on Reinforcement Learning and Artificial Intelligence. I am currently co-supervised by Adam White and Marlos Machado; and affliated with RLAI Lab and Alberta Machine Intelligence Institute (Amii). My long term research goal is to define and understand the computational principles behind Intelligence.
My current MSc research is centered around understanding how an agent in a complex environment perceives a stream of observations. In a real-world setting an agent’s immediate observation is not very informative (non Markovian), and the true-state of the world is always mammoth compared to the agent. In my research, I’m exploring how an agent can build short and long-term memories from a stream of observations which can then be used to generate an agent’s perception at a given time (an approximate Markov-state). The agent then uses its immediate perception for prediction or control.
Previously I had worked with IBM Cloud as an ML Engineer and also collaborated with IBM Research over various research projects in representation learning and Deep Learning. I’m also an avid coder, and have experience in deployment of several machine learning algorithms at scale in IBM and Kone.
Contact: spramanik [at] ualberta [dot] ca, email [at] subho [dot] in
MSc in Computer Science (thesis based, Fully funded), 2021 - 2023
University of Alberta
B.Tech in Computer Science and Engineering, 2015 - 2019
Vellore Institute of Technology
In this paper, we propose a multi-task learning-based framework that utilizes a combination of self-supervised and supervised pre-training tasks to learn a generic document representation. We design the network architecture and the pre-training tasks to incorporate the multi-modal document information across text, layout, and image dimensions and allow the network to work with multi-page documents. We showcase the applicability of our pre-training framework on a variety of different real-world document tasks such as document classification, document information extraction, and document retrieval. We conduct exhaustive experiments to compare performance against different ablations of our framework and state-of-the-art baselines. We discuss the current limitations and next steps for our work.
Transformer is a popularly used neural network architecture, especially for language understanding. We introduce an extended and unified architecture that can be used for tasks involving a variety of modalities like image, text, videos, etc. We propose a spatio-temporal cache mechanism that enables learning spatial dimension of the input in addition to the hidden states corresponding to the temporal input sequence. The proposed architecture further enables a single model to support tasks with multiple input modalities as well as asynchronous multi-task learning, thus we refer to it as OmniNet. For example, a single instance of OmniNet can concurrently learn to perform the tasks of part-of-speech tagging, image captioning, visual question answering and video activity recognition. We demonstrate that training these four tasks together results in about three times compressed model while retaining the performance in comparison to training them individually. We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering. This illustrates the generalization capacity of the self-attention mechanism on the spatio-temporal cache present in OmniNet.
We perform text normalization, i.e. the transformation of words from the written to the spoken form, using a memory augmented neural network. With the addition of dynamic memory access and storage mechanism, we present a neural architecture that will serve as a language-agnostic text normalization system while avoiding the kind of unacceptable errors made by the LSTM-based recurrent neural networks. By successfully reducing the frequency of such mistakes, we show that this novel architecture is indeed a better alternative. Our proposed system requires significantly lesser amounts of data, training time and compute resources. Additionally, we perform data up-sampling, circumventing the data sparsity problem in some semiotic classes, to show that sufficient examples in any particular class can improve the performance of our text normalization system. Although a few occurrences of these errors still remain in certain semiotic classes, we demonstrate that memory augmented networks with meta-learning capabilities can open many doors to a superior text normalization system.
Previously:
Primarily assigned as an AI/ML Developer for IBM App Connect:
Actively collaborating with IBM Research:
Intern at the IBM Watson TRIRIGA Building Insights team.
Responsibilities include:
Selected amongst hundreds of competitors in Kone IBM hackathon for a two month sponsorship to Kone in Finland as a visiting researcher.
Mentors: Dr. Olli Mali, Jani Hautakorpi