Aditya Chinchure

TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models

2024-07-09T13:29:20-07:00

Text-to-Image (TTI) generative models have shown great progress in the past few years in terms of their ability to generate complex and high-quality imagery. At the same time, these models have been shown to suffer from harmful biases, including exaggerated societal biases (e.g., gender, ethnicity), as well as incidental correlations that limit such model’s ability to generate more diverse imagery. In this paper, we propose a general approach to study and quantify a broad spectrum of biases, for any TTI model and for any prompt, using counterfactual reasoning. Unlike other works that evaluate generated images on a predefined set of bias axes, our approach automatically identifies potential biases that might be relevant to the given prompt, and measures those biases. In addition, our paper extends quantitative scores with post-hoc explanations in terms of semantic concepts in the images generated. We show that our method is uniquely capable of explaining complex multi-dimensional biases through semantic concepts, as well as the intersectionality between different biases for any given prompt. We perform extensive user studies to illustrate that the results of our method and analysis are consistent with human judgements.
Website and Data

From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models

2024-06-28T13:29:20-07:00

Despite recent advancements in vision-language models, their performance remains suboptimal on images from non-western cultures due to underrepresentation in training datasets. Various benchmarks have been proposed to test models’ cultural inclusivity, but they have limited coverage of cultures and do not adequately assess cultural diversity across universal as well as culture-specific local concepts. To address these limitations, we introduce the GlobalRG benchmark, comprising two challenging tasks: retrieval across universals and cultural visual grounding. The former task entails retrieving culturally diverse images for universal concepts from 50 countries, while the latter aims at grounding culture-specific concepts within images from 15 countries. Our evaluation across a wide range of models reveals that the performance varies significantly across cultures – underscoring the necessity for enhancing multicultural understanding in vision-language models.
ArXiv

DE-TensoRF: Data-efficient and fast NeRFs

2023-04-28T19:29:20-07:00

Developed DE-TensoRF, a model that can render 3D objects with as few as 3 images, and in under 15 min on a single GPU. We achieved the highest grade in our class, and led to collaboration efforts with Dr. Helge Rhodin’s research group.

VisualCOMET+: Visual Commonsense Generation & its incorporation into a Multimodal Topic Modeling algorithm

2022-12-09T18:29:20-08:00

The task of commonsense knowledge generation is largely limited to the language domain, with models such as COMET (for explicit knowledge) and GPT-3 (for implicit knowledge). Moreover, VisualCOMET, a commonsense generation model that utilizes the visual context, is limited to three people-centric relations. Since commonsense generation on entire scenes, or parts of a scene, can be helpful in several downstream multimodal tasks, including VQA and topic modeling, we propose a general-purpose visual commonsense generation model, VisualCOMET+, by extending VisualCOMET with four diverse inference relations. Using the clue-rationale pairs from a visual abductive reasoning dataset, we train our commonsense generation model by creating groundtruth structured commonsense triplets. Then, we show that we can get coherent and more diverse topics by incorporating generated commonsense inferences and visual features into a novel multimodal topic modeling algorithm, Multimodal CTM.
Report

VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge

2022-10-24T10:29:20-07:00

We present a new Vision-Language-Commonsense transformer model, VLC-BERT, that incorporates contextualized knowledge using Commonsense Transformer (COMET) to solve Visual Question Answering (VQA) tasks that require commonsense reasoning. VLC-BERT outperforms existing models that utilize static knowledge bases, and the article provides a detailed analysis of which questions benefit from the contextualized commonsense knowledge from COMET.
Paper | Github | ArXiv

Universal Machine Learning API

2022-04-28T19:29:20-07:00

Universal Machine Learning API
A powerful Python API template, built on Flask, for plug-and-play use with machine learning models.
Technologies used: Python, with the Flask API package
Blog | Github

VL-BERT-Graph: Graph-enhanced Transformers for Referring Expressions Comprehension

2022-04-28T19:29:20-07:00

We explore a simple method to incorporate inter-token relationships in a Transformer before performing any training, using graphs with edge features. In VL-BERT-Graph, we generate a fully-connected graph of input tokens where the edges represent similarity between the tokens, obtained using GloVE and CLIP. We then use a message-passing GNN to incorporate these features into the input tokens or the output encoding of the model, and train the Transformer with edge-feature attention masks.
Report

Investigating extensions to VLC-BERT and comparing it with GPT-3

2022-04-20T19:29:20-07:00

Visual Question Answering with commonsense reasoning is a challenging task that requires models to understand of the image, the question, and contextualized commonsense knowledge to assist with the reasoning required to arrive at an answer. In our work, we propose extensions to the VLC-BERT, aimed at solving two drawbacks of the model by identifying potential words in the input sequence that may answer the question using a Pointer Generator, and incorporating additional image information in the form of object tags from an object detection model. Our evaluation shows that the Pointer Generator and object detection models help achieve higher scores on the OK-VQA dataset. Furthermore, we generate answers using GPT-3 and incorporate them into VLC-BERT. Our error analysis on GPT-3 and VLC-BERT models highlight that GPT-3 contains valuable implicit commonsense and factual knowledge that is beneficial to our model.
Report

Learning faster Genetic Algorithms with dynamic mutation power

2021-12-03T18:29:20-08:00

Policy Gradient (PG) methods and Genetic Algorithms (GA) are used to train Reinforcement Learning agents to perform a particular task in an environment by maximizing the received reward. In the context of this assignment, both techniques aim to approximate a policy function that, given a state, produces a policy to pick the best action to maximize reward. Here, the policy function used is deep neural network model. In this project, I implement a PG method, REINFORCE, and a simple GA method to solve the Lunar Lander (LunarLander-v2) environment in OpenAI Gym. I propose two modifications to the GA method: an improved fitness function with which the GA can solve the task in about 50 generations, and a novel dynamic mutation power technique that helps the model solve the task in 30 generations.
Video | Report

A Summary of Recent Text Summarization Techniques

2020-12-03T18:29:20-08:00

In this project paper, we surveyed text summarization models by evaluating existing extractive and abstractive models. We studied the metrics and datasets used to evaluate the latest models and evaluated upcoming abstractive techniques. Finally, we highlighted future pathways for text summarization and suggested areas for improvement.
Report