top of page

Create Your First Project

Start adding your projects to your portfolio. Click on "Manage Projects" to get started

Visual Question Answer

Date

April 2023

Location

Arlington

This project is a Visual Question Answering (VQA) system implemented using a combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) models. The goal of the system is to analyze images and answer questions related to the content of those images.

Here is a breakdown of the major components and steps in the project:

Data Loading and Preprocessing: The code first checks if preprocessed data (question lists, image lists, answer lists, and images) exists. If not, it reads and processes a JSON dataset containing questions, image IDs, and multiple-choice answers.
Images are loaded, and attention maps are generated for each question-image pair.

Word Embedding and Tokenization: The project tokenizes and converts the textual questions into word vectors using the Keras Tokenizer.
The vocabulary size is determined, and sequences of words are generated for training.

Target Labels and Categorical Encoding: Answers are assigned unique labels and converted into categorical format using one-hot encoding.
Dataset Normalization and Shuffling: Images are normalized and shuffled for training.

Model Architecture: A CNN model is defined to encode image features.
An LSTM model is defined to encode question features.
Both models are then combined using a fusion layer.
The final model is compiled using categorical cross-entropy loss and the Adam optimizer.

Training the Model: If pre-trained weights are not found, the model is trained on the provided dataset for 50 epochs, with model checkpoints saved.

Testing and Evaluation: The trained model is tested on a separate set of images, questions, and answers.
Accuracy is calculated and printed.

Prediction and Visualization: The model is used to predict answers for test questions, and attention maps are generated for a subset of images.
Predicted and actual answers, along with attention maps, are displayed for validation.

Overall, this project demonstrates the implementation of a VQA system that combines both visual and textual information for answering questions related to images. The use of CNNs for image processing and LSTMs for textual data showcases a fusion approach to tackle the multi-modal nature of the task. The attention maps provide insights into the areas of the image that the model focuses on when generating predictions.

bottom of page