Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

GazooResearch helps medical researchers (like yourself) cultivate massive medical databases. You'll find that GazooResearch has tools for every step of the research process from creating and using standardized data dictionaries, clinical information extraction, reviewing patient data for accuracy, data analysis, and it even provides data for the development of artificial intelligence algorithms.

Figue 1. GazooResearch Overview Overview

Installation

Requirements

  1. Read/Write and Internet priviliges!
  2. MacOS 13/14/15 or Windows 10/11

Installation is as simple as navigating to the GazooResearch website https://gazooresearch.com, and clicking on the MacOS download button for the installable .dmg file, or navigating to the windows store to download the application.

Figure 1. Website Installation Page

Installation Page

Creating A Database

A gazoo databases is one simple file with a suffix ".gazoo". This is akin to an excel database file ending in ".xlsx". You can save this encrypted database file on your local computer, or to your organization's encrypted OneDrive.

To create a new database follow these steps:

  1. Open GazooResearch.app
  2. Click: "New Database" and select the location and name of the database.
  3. Set and Confirm the database's password.
  4. Click: "Create Database" button

You should see a new file named: "<database name>.gazoo".

Figure 1. Create A Database

Uploading Documents

You can upload documents via the Upload Documents tab.

Click on "Choose Files" Uploading Documents

You can upload multiple documents at once. The documents must be either pdf, jpeg, or png. Uploading Documents

During the uploading process the documents get encrypted and saved in the .gazoo file. Uploading Documents

You can view the uploaded documents in the "Review Documents" Tab. Uploading Documents

Uploading Documents

Figure 2. Uploading Documents

Data Structure

GazooResearch's data structure has 4 layers. 1) Diagnosis, 2) Tag, 3) Field, and 4) Value.

Let's illustrate why 4 layers are needed with an example. Say we are wanting to study the correlation of blood sugars, and kidney function in patients with diabetes. We want to collect serum glucose levels, and serum creatinine levels (Figure 1.)

Figure 1. Lab Result Lab Result

At first glance, you might think you'll just need to store one number for glucose and another creatinine: 79 for glucose, and 0.5 for creatinine. Unfortunately, this alone is not enough information for your analysis. In reality, Glucose actually has 3 fields: 1) Date, 2) Value, and 3) Units. Creatinine also has 3 fields.

Finally, since we are studying diabetes we also have a base layer which is the icd10 code for diabetes.

Figure 2. GazooResearch's Data Structure Example Data Structure

When defining data structures in GazooResearch, we allow for the 4 layers as shown in the video below.

Figure 3. Data Dictionary

Data Analysis

Within the data analysis python modules you will also see this exact 4 layer data structure.

{'icd10':str, 'tag': str, 'field':str, 'value': [str, str, ...]},

Data Dictionary

Your database has a basic "Patient Level Information" data dictionary already loaded. This has information like patient "id", "last-name", "first-name", "dob", "sex", "document", "death".

Figure 1. Patient Level Information Patient Level Information

The information that is relavant to the patient Patient Level Information

Creating a Data Dictionary

You can either download a pre-made data dictionary from the Data Dictionary Hub, or create a custom data dictionary to fit your exact needs.

Data Dictionary Hub

GazooResaerch will host a variety of premade data dictionaries.

Custom Data Dictionary

In this example, we are going to create a data dictionary to study prostate cancer. Firstly, search and select the icd-10 diagnosis for prostate cancer. Create Custom Data Dictionary (c61)

We will want to extract PSA values. So lets create a tag for PSA. For each tag you can add a description, so that other people will understand what the acronym 'psa' means. Next we want to add the fields for PSA. In the case of PSA the fields include 1) Date, 2) Value, and 3) Units. So first we will add the timepoint. Lastly we'll add the units field. Note that different hospital may report PSA using different units (ng/ml, g/dL). Therefore units is a categorial variable, and can have one of two values, therefore we will use the "List Field" option. We name the list "units", and give it two possible values. Either ng/ml or g/dL Finally lets save the data dictionary. Now when you are annotating documents, you'll see the "c61:psa" tag Additionally, you'll see the three fields we created: 1) Date, 2) Value, and 3) Units

Figure 3. Data Dictionary

Multi-Lesion Data

In some datasets there may be more than one incidence of the cancer, and the researcher may want to analyze the data at the patient level, as well as the lesion level. GazooResearcher supports this type of data aggrigation and analysis.

Examples:

  1. Skin Cancer: One patient can have multiple independent skin cancers.
  2. Lung metastasis: One patient can have multiple lung metastasis, each being treated differently (ie. SBRT, Wedge, RFA, Cryotherapy...).
  3. Prostate Cancer: One prostate may have multiple independent prostate cancer foci, each with different gleason scores.

Example: Multi-Target

In this example we are monitoring lung metastasis size through time. One patient can have multiple lung metastasis.

Multi Lesion Target-Id

Data Dictionary

We'll create a data dictionary using the icd10 code: "c34 - Malignant neoplasm of the bronchus and lung". The tag will be named "size" indicating the size of the lung nodules. The tag's fields include 1) Date 2) Size 3) Units, 4) target-id.
Click the "Add Target-Id" button to create a special field named: "target-id", and add the description Multi-Lesion Data Dictionary Multi-Lesion Data Dictionary

Annotate Data

In the background we have uploaded 3 very basic CT scan reports, and annotated basic information like patient id (ie. mrn), and document.

Start annotating the size tag in the documents. Under the "target-id" field, create a unique name for that target. We'll use "lul-ant", which stands for "Left Upper Lobe - Anterior Segment". When we annotate the other lung lesion, we'll use a different "target-id" name: "rul-ant". When we navigate to another document, and annotate the tag "size", we can see that the tag box tells us what target-ids this patient already had. We will enter "rul-ant". When you navigate to the Timeline's Tab. GazooResearch aggrigated target-ids together. You can see that this one lung metastasis grew over the 6 months.

Annotation

Data Annotation is a simple as drawing a box around the data you want to highlight, this may be text, an image, or nothing at all.

Steps:

  1. Draw Bounding Box
  2. Use the "Down Arrow" to see list of tags.
  3. Confirm, or modify the extraceted information.

Figure 1. Annotate Data

Timeline

The timelines tab provides a easy way to visualize your database. This is key to ensure data quality. If you believe a data-point is incorrect, you can simply click on the timeline box, and GazooResearch will navigate to where that data was extracted.

Figure 1. Timelines

Export Data

Data is exportable for analysis and/or AI training.

The exported date format is: YYYY-MM-DD

Export Data For Analysis

Steps:

  1. Navigate to the Export Tab.
  2. Click "Export Data" Button
  3. Select Save Location and name

The raw exported data format may not be in the format that you are used to seeing. Don't worry, we can quickly transform this information into a typical excel spreadsheet in the Analysis section of this book.

Tokenization

Tokenization is the process of encrypting all the Personal Health Information (PHI) in your data analysis. This is useful if you intend to share your data with other researchers.

If you examine the raw exported data, PHI information is everywhere. During the export process, if you enter in a random string of characters, this will be used to encrypt all the tags that have been labeled as phi. All the tags which were labelled as phi have been replaced with an incomprehensible string, preventing anybody from figuring out the original mrn.

Analysis

GazooResearch comes complete with a suite of python based analysis tools, because what is the point of collecting data if you can't analyse it.

Install and Use GazooResearchUtils Package

!pip install GazooResearchUtils
import GazooResearchUtils as gz
import numpy as np
import pandas as pd

Read Data For Analysis

Upload exported data to Dataframe

df = pd.read_csv("./data.csv")

Convert To Human Readable Format

It is recommended to do all the analysis with the default format, and only at the last second, should you convert it to the human readable format.

We use the gz.pivot() function to convert denormalize the data into a more human readable format.

# Convert to human readable format
hr_df = gz.pivot(df)
# Save to csv
hr_df.to_csv("/Users/andrewlim/Desktop/hr_data.csv")

Note that now the column names are the data fields. For example, for mrn:111111, the date of biochemical progression is 2019-02-22. Human Readable Data

Plot Data

if you want to plot PSA values, this is what you want to do.

filter = {'tag':'psa'}
gz.get_tags_where_filter(df, filter)

By simply applying the gz.pivot() function, the data is more readable.

filter = {'tag':'psa'}
psa = gz.get_tags_where_filter(df, filter)
gz.pivot(psa)

I recommend being more verbose when specifying the filter object.

# Filter Object Structrue
{'icd10':str, 'tag': str, 'field': str, 'exact': [str, str,...], 'between': [float, float]}

The below finds all the tags which have the psa valube between 0 and 5.

filter = {'icd10':'c61',
          'tag':'psa',
          'field':'value', 
          'between':[0.,5.]}
psa = gz.get_tags_where_filter(df, filter)
gz.pivot(psa)

The below finds all the tags which have the pT:T value of '2c', or '3a', or'3b'.

filter = {'icd10':'c61',
          'tag':'pT',
          'field':'T', 
          'exact':['2c','3a','3b']}
t_df = gz.get_tags_where_filter(df, filter)
gz.pivot(t_df)

Security

Data security is of the upmost importance for medical information. GazooResearch was developed with data security as priority number 1!

Security Features

On-Premise Design

GazooResearch does not run in the cloud it was designed to run locally on your physical hardware.

*Use of GazooReseach's AI model(s) requires data to be sent to Gazoo's servers via encrypted pathways (TLS and other state-of-the-art cryptographic methods), if this is a security issue, then use your OpenAI API compatible LLM servers.

Database Encryption

Gazoo uses an encrypted sqlite database. All the files which make up the database are always encrypted on the disk. It only decrypts blocks as they are read from disk.

Since the data is stored on a disk, we naturally base our approach on “Disk Encryption Theory”. For each type of file, we use the 256-bit AES cipher in the appropriate mode of operation. The AES cipher itself encrypts/decrypts individual files in the most efficient way possible. Your data will be safe on disk.

Document Encryption

Documents are stored on disk using a 256-bit AES CBC mode cipher. 256-bit AES encryption is considered safe against brute-force attacks. It has 2128 potential key choices, making it difficult to crack. A machine that can crack a DES key in a second would take 149 trillion years to crack a 128-bit AES key.

Transport Encryption

Communication between the different components of the software are secure, having been reviewed by a third party.

Suggested Security Features

Full Disk Encryption

MacOS: It's suggested that you use FileVault to encrypt all data written to disk. Debian: It's suggested that you use Linux Unified Key Setup (LUKS) hardrive encryption.

Air-Gapped Environment

For further data protection, Gazoo can run in an air-gapped environment (not connected to the internet), this is the gold standard for data security.

Example Security Paragraph

Medical information is secured using 6 layers of security:

Physical data security begins with the medical data being located 1) on-premise, 2) behind physical locked doors. The computer is 3) air-gapped from the outside network, and only accessed physically, with the 4)correct login credentials. The hardrive containing the data is 5)fully encrpyted at rest using the Linux Unified Key Setup (LUKS) which is a trusted hardrive encryption technique. While the computer is turned on, but the medical information is not being accessed (data is 'at rest'), the 6) data is encrypted using a 256-bit AES cipher.

Format Converter

Gazoo Research accepts pdf, jpeg, png file formats, but does not accept docx, txt files. The gazoo-research-docker docker image can convert all your documents to pdf files so that they can be uploaded to GazooResearch.

Requirements

The Docker app is required.

Usage

You must first create an output directory, which will be where the resultant pdfs will be placed.
-v: mount the input directory to /home/src
-v: mount the output directory to /home/out

docker run -it --rm \
-v <input directory path >:/home/src \
-v <output directory path>:/home/out \
andrewlimmer/gazoo-research-converter

Example

docker run -it --rm \
-v /Users/Desktop/research-data:/home/src \
-v /Users/Desktop/research-data-converted:/home/out \
andrewlimmer/gazoo-research-converter