Connect with Us

A group of professionals working together holding papers and iPads

The Center for Innovative Design and Analysis offers comprehensive study design and statistical and data science collaboration primarily serving the CU Anschutz research community and the greater Denver region. We collaborate on a wide range of projects including: grant development, short-term analysis, and in-depth partnerships where a CIDA team member becomes embedded in your research unit.

Our team consists of nearly 30 faculty and students with a wide range of expertise, providing us with the knowledge and experience necessary to help you learn what’s in your data. We are available to provide assistance with study design, data management, analysis, as well as other custom needs. Through the center's own research, we remain informed of current practices and develop novel methodologies so we can apply cutting edge approaches to get the most out of your data.

Our diverse group of biostatisticians, data scientists, and health economists is comprised of faculty members and graduate student apprentices primarily from the Department of Biostatistics and Informatics in the Colorado School of Public Health at the CU Anschutz Medical Campus.

Learn more about how we can support your data through one time projects, building a partnership, grant proposal development, and one hour consultations.

Biostatistics Consulting

CIDA offers comprehensive statistical and data science support for one time, short-term projects.

Build a partnership

Develop a strong relationship with biostatisticians who are invested in supporting your long-term projects.

Develop a grant

Partner to create a fundable application with an efficient design and state-of-the art analysis plan.

Attend a one hour consult

CIDA offers a free, one hour consultation to answer general study design or analysis questions.

Preparing to work with a biostatistician and data scientist

CIDA Guidelines to Acknowledge Our Biostatisticians

Any biostatistician conducting analysis, or otherwise making a significant contribution will:

Be a co-author on the publication to acknowledge their intellectual contribution, using their primary appointment affiliation
Analyze data collected for publication and abstracts only after study completion, to maintain study and statistical integrity
Collaborate with you in structuring the presentation of results and write the statistical methods section of your paper
Review your publication and any revisions prior to submission
Assist with revisions, keeping in mind your revision deadlines

The CIDA abides by the International Committee of Medical Journal Editors (ICMJE) Guidelines concerning authorship.

Guidelines for Order of Listed Authors

The order in which our biostatistician is listed as author depends on extent of contribution:

First Author: Biostatistician plays major role in project development or hypothesis generation, (when existing data) substantially changes direction of hypothesis and corresponding analysis, provides a unique contribution, conducts analysis, and writes large portion of your paper.
Second Author: Biostatistician plays major role in project development, offers statistical expertise needed for your research, or conducts substantial and/or complex analysis of data. Typical position for biostatistician involved in life cycle of project or who has unique analytic expertise (e.g., mediation analysis, microbiome, genomics).
Middle-Listed Author: Biostatistician does not necessarily collaborate in design of study, but conducts data analysis, writes statistical methods section, and assists in interpreting results for conclusion.
Senior Author: Biostatistician provides significant mentoring to first author, offering guidance for first author to conduct analysis, or plays major role in helping first author design the study. Generally has been substantial one-on-one meeting time as well as additional analysis and paper writing support beyond typical collaborative role.

Junior or student biostatisticians: Biostatistician mentored by senior statistician, senior statistician immediately follows junior statistician within authorship line (e.g., if junior statistician is second author, mentoring statistician would generally be listed as third author).

General Policy

Information pertaining to specific individuals is protected by Health Insurance Portability and Accountability Act (HIPAA). Data that is protected in this context should be treated carefully. If data has any of the identifying characteristics elaborated on below, special issues may arise with transferring data and the permissions of the analyst to view and/or store the data. Please discuss de-identification options with your biostatistician to mitigate these issues.

PHI Definitions (COMIRB Guidelines)

Protected health information (PHI) is any data which, when combined with one or more data elements or commonly available information, could be used to identify a person. PHI does not include de-identified information which does not identify an individual and for which there is no reasonable basis to believe that information could be used to identify an individual.

Information Which May Be Protected Includes, but is Certainly Not Limited To:

Name
Postal address (to a location smaller than state)
All elements of dates, except year (For dates directly related to an individual including birth date, admission date, discharge date, and date of death. As well as ages greater than 89 aggregated to 90 and older.)
Phone/Fax Number
Email addresses
Social Security Number
Medical Record Number
Health plan number
Account numbers
Certificate/license numbers
URL addresses
IP addresses
Vehicle identifiers
Device ID
Biometric ID
Full face (or other identifying photo)
Any other unique identifying number, characteristic, or code

It should be noted that HIPAA regulations also apply to deceased individuals.

Dataset Format Guidelines

Best Practice for Improving Readability of Data

We are unable to address data format issues, and may need to ask you to reformat improperly-formatted datasets. Please be sure to follow these guidelines when you format your dataset:

Single row for headings/column names. No repeated headings.
Headings not too long—use short (1 or 2 words) column headings, then use a data dictionary to elaborate the short heading. We’ll be sure the long version from the data dictionary makes its way onto figures, etc.
Include a separate document that defines values – a "data dictionary." See below for an example.
We cannot analyze "free form" or "text string" columns (such as "other," "explain," or "notes"), although you can leave them in the dataset for reference.
The computer ignores color, so don’t color-code data or the information that you color-coded will be lost.
Stick to a coding convention. Entering "F" for one woman’s sex, "f" for another’s, and "Female" for another’s results in three types of females. Pick one convention and be consistent throughout a column. Capitalization matters!
No "special" characters, such as text accents.
File types that end with .xls, .xlsx, .csv, and .sas7bdat are good.
Include patient IDs, provider IDs, etc.
Do not include any Protected Health Information (PHI).
Missing data should be left blank, rather than coded as "99," "-99," ".," etc.
No characters in a numeric column/variable. If there are characters anywhere in a column (aside from the column name), the computer will treat the whole column as characters. Putting the word "missing" or "unknown" or the character "-" for missing values in a column will convert any numbers in that column to character expressions, which would be treated as categories, not numbers, in an analysis.
For numeric variables, don’t include units in the cell values, as they are characters. Include the units in the data dictionary instead, and we’ll put them on figures, tables, etc.

NOTE: This list is not exhaustive.

Data Dictionary Example

For a Ventricular Tachycardia Study

PtID: patient ID
Inst: institution ID
Gender: gender of patient
M=Male
F=Female
AblNum: ablation number:
- Numeric count
Fascic: tachycardia type:
- 1=Fascicular VT
- 0=Other VT
Recur: tachycardia recurrence:
- 1=VT recurrent
- 0=VT not recurrent
Follow_Up: Follow up time after this ablation
- Time started with a successful ablation and ended when VT recurred (1 above) or
  when follow up time ended without recurrent VT (0 above)
Status: Final Patient Status:
- 0=off meds, no VT
- 1=off meds, intermittent VT
- 2=on meds, no VT
- 3=on meds, intermittent VT
- 4=other

Organize Your Data for Statistical Analysis
Best Practices for Data Transfer

It is important for you to organize your data in a way that facilitates transfer to our biostatisticians, or other investigators or computers. Well-defined and organized data minimizes confusion and incorrect data.

You are encouraged to use REDCap for data collection to minimize data entry errors or risks to patient confidentiality, and ease data transfer for statistical analysis.

Recommendations for Organizing Data

Our recommendations have demonstrated to be effective for moving data from point to point in a structured manner. A reasonable data organization scheme should minimize the amount of editing needed at the receiving side of your data transfer.

Table 1 illustrates three types of variables in a structure that lends itself to simple data transfer and minimal data editing.

Identification (PatID) variables: uniquely identify aspects of an individual record (row of data), for instance, subject #, clinic #, or PatID.
Time-stable variables: include characteristics that remain constant for individual subject if observed over time, for instance, baseline demographics (age, sex, race) or study group (A, B).
Longitudinal variables: potentially change over time, for instance, weight, adolescent height, muscle tone, lab values (cholesterol, blood sugar, etc.).

In this example, the structure has one column available for identifying an individual (Subject), two columns for time-stable characteristics (Trt, Sex) and two columns for longitudinal characteristics (time, weight). Note the values of subject and time uniquely identify each row.

Other experimental designs will require different data structures, but each measured response must be uniquely associated with only one subject, visit or test.

Most statistical software packages (e.g. SAS, SPSS, Splus, R and Stata) require data represented in a rectangular format where each row is a unique observation and each column is a separate variable. When organizing data into a rectangular format: first each row contains one (and only one) unique observation. In the example each row contains a unique combination of subject, time, and treatment. Second, each column contains one (and only one) variable or response.\

Table 1: Example of a Rectangular Table

PatID	Trt	Sex	Time	Weight
1	0	1	0	181.6
1	0	1	4	183.2
2	0	0	0	130.4
2	0	0	4
3	1	0	0	150.2
3	1	0	4	145
4	1	1	0	161.2
4	1	1	4	159.4

Codebook (in a separate worksheet):

Trt: Treatment, 0=Placebo, 1=Drug, Sex:0=Woman 1=Men, Time: Time in Study in weeks; Weight: Body weight in pounds

Please Note the Following Points, Many of Which are Illustrated in Table 1:

Table 2: Identifiable PHI Information

1. Name

2. Fax number

3. Phone number

4. E-mail address

5. Account numbers

6. Social Security number

7. Medical Record number

8. Health Plan number

9. Certificate/license numbers

10. URL

11. IP address

12. Vehicle identifiers

13. Device ID

14. Biometric ID

15. Full face/identifying photo

16. Other unique identifying number, characteristic, or code

17. Postal address (geographic subdivisions smaller than state)

18. Date precision beyond year

Data table is rectangular, rows represent observations, and columns represent variables. Some columns identify observation and others contain a measured response. All data contained in one rectangular area.
Only Patient ID numbers are used, Protected Health Information (PHI) is not included. Names should not be included in your database for analysis to avoid unnecessary risks to patient confidentiality (see Table 2).
Unique key to each row consists of two variables (columns) PatID and Time.
Characters (A, AB, O) and numeric values (0, 1, 2) are not mixed within one column. Where possible, a number has been chosen in place of a character. Definition of numbers, units for continuous data, and explanation for abbreviated variable titles should be provided separately in a codebook.
Missing data: Note that none of the variable values uniquely identify the subject and conditions where measurements taken are missing (ID, trt, time). A character value (e.g. "missing", "dk", "x") or numeric value zero (i.e., 0) should not be used to indicate missingness for a continuous variable (ex: variable "Weight" in Table 1).
Before data collection begins, your should give special attention to how an assay value below detection will be indicated in the data, and how it should be treated in the statistical analysis. Similarly for left-censored or right-censored values.
Column headers are variable names, not a description. Variable descriptions can be provided separately in a "codebook" (or a separate worksheet in same workbook). In general, variable names must:
1. Be 8 characters or less in length
2. Consist of one word (i.e. no spaces)
3. Be unique (not duplicated across multiple columns)
4. Begin with a letter, not a number
5. Contain no special characters: commas, quotes, apostrophes, period, underscore.
Avoid using punctuation or spaces (e.g. commas, quotes, <,>).
Avoid using special formatting like colored text, highlighted columns, italics, bolding, super or sub scripting, and the "comment" feature.
Store notes about patients in separate column from data used in analysis (e.g. "scheduled to come in again for repeat lab"). If information in text of notes needs to be analyzed, it should be coded into one (or more) variable column(s).

If considered in enough detail before your data collection process begins, organization of the experimental data is relatively simple. Whether or not there are questions or confusion about how to efficiently organize and manage your data, consulting with a statistician before your experiment begins is a good idea. These matters can usually be resolved in a short time with satisfactory results for all concerned. Biostatisticians often oversee the data collection, storage, and retrieval systems for clinical studies. The study biostatistician is able to distinguish between essential and non-essential data, and can therefore limit the data collection systems to relevant information.

Limiting the amount of data collected means it is easier to assure data quality, minimize missing data, and pre-define the analysis data sets so that, upon study completion, data analysis is straightforward. Developing an effective data collection and management system is a key step in assuring ultimate integrity of your study. Dataset planning can be iterative, involving meetings between the Statistician, Investigator, and Informatics Manager.

Specific examples of instances in your planning phase where obtaining a statistician’s input would be beneficial:

Design data collection forms
Outline data collection/management systems (include variable name, specify variable type, e.g. date, numeric, open text)
Design, implement, and conduct of data quality monitoring system for a study
Outline how and when data abstraction should occur for interim analyses
Provide input on parameters that would help to ensure data quality control

Data Security

All data should be securely stored, and access should be restricted to those individuals entering data.
Properly dispose of paper and electronic files, keep paper copies in locked cabinet, and store electronic files on a secure-access central server.
Keep in mind the Health Insurance Portability and Accountability Act (HIPAA)’s Minimum Necessary Principle when listing what variables to include in your database.
Use or disclose only information necessary to the task. It is important to exclude unnecessary items that make information identifiable to ensure privacy, security and patient confidentiality.
Identifiable information includes items listed in Table 2. If identifiable information is necessary for research (e.g. birth date, visit date, physical address), take necessary precautions to protect the database: strong passwords, anti-virus software, data backup, possibly encryption, and being very cautious with email.
Refer to COMIRB and HIPAA for additional stipulations.

The Center for Innovative Design and Analysis (CIDA) has formed a collaboration with the Colorado Center for Personalized Medicine (CCPM) to support campus analytical needs for research using CCPM biobank data. CIDA, known for sustainable and scalable research analytics, helps make CCPM biobank genetic data accessible to researchers who need support in completing analyses. This collaborative team, The Colorado Biobank Informatics Service (CBIS), performs GWAS and other genetic analyses involving biobank data, returns cohort counts to investigators, and delivers line-level genetic data in accordance with approval from the Access to Biobank Committee.

If you are interested in collaborating with CBIS on an analysis or a cohort count, or interested in obtaining line-level genetic data from the CCPM biobank, please fill out the unified ABC Proposal Request Form.

If you have questions about biobank data access and genetic data analysis, please contact [email protected].

CBIS performs low-to-moderate complexity genome-wide association analyses (GWAS) free of charge to investigators with a primary appointment at CU Anschutz, up to two per year per investigator. High-complexity GWASs, or other high-complexity analyses, can be completed under the auspices of a funded collaborative agreement, per discussion with CBIS. A GWAS is likely to be considered high complexity if it involves the elements below (other factors may also make a GWAS high complexity – the decision of complexity is based on the assessment of the CIDA analyst):

Time-varying covariates or outcomes.
Genetic variables that are not contained in a standard genetic data freeze (i.e., CNVs).
Outcome that is defined using observations at multiple time points.
An otherwise nuanced or intensive phenotype. If your phenotype is defined using only phecodes or ICD codes, it is very likely to be considered low- to moderate-complexity.
A large number of outcomes (> 5).
A large number of total requested GWASs (> 10). Consider that, if you are requesting a GWAS of a single outcome stratified by ancestry, this likely entails one GWAS per ancestry group.

If your proposed GWAS involves the above or otherwise qualifies as high-complexity, or you request a non-GWAS analysis, we will hold a consultation meeting over Zoom to scope your project. Below are some non-GWAS analyses that are likely to be considered low-complexity and are thus likely to be candidates for free-of-charge analysis.

Phenome-wide association studies with fewer than 500 phenotypes and fewer than 50 genetic instruments (polygenic risk scores, SNPs, burden tests per genes).
Determining whether a specific SNP is contained in a certain research freeze.
Cohort counts for low-to-moderate complexity phenotypes.

CBIS will make a downloadable copy of the analysis results available to the investigator analysis completion, and the download link will remain active for 60 days. After 60 days, the investigator will need to submit a new request.

CBIS can deliver biobank data to investigators with approved CCPM ABC proposals. A ‘standard’ data delivery can be completed at no cost to investigators. A request can be considered standard if it meets the following criteria:

Involves the delivery of SNP or sequencing data only.

Involves the delivery of files in a standard genetic file format, such as binary PLINK files or vcf files.

Involves delivery of genetic data from one of the research freezes.

If your request does not meet the above criteria or is otherwise complex or intensive, we will hold an intake meeting to scope the project individually.

CBIS will deliver data via a Google Cloud Storage (GCS) bucket and will maintain the data within the GCS bucket for 60 days. After 60 days, CBIS will remove the data from the GCS bucket, and investigators will need to submit a new request to regain access to the data.

Note: Data download/egress from GCS is expensive, with a typical cost of ~$0.11 per GiB. This means the egress cost for a full data delivery (currently 1.5TiB) can cost upwards of $150. CBIS subsidizes this egress cost as a courtesy to CCPM investigators, so we ask that investigators minimize downloads of the data as much as possible.

In special cases where a larger-than-typical amount of data is being delivered or multiple downloads of the data are necessary, CBIS may deliver data via a ‘requester pays’ GCS bucket, where the cost for the data egress will be billed to the downloader.

Center for Innovative Design & Analysis

Colorado School of Public Health

CU Anschutz

Fitzsimons Building

13001 East 17th Place

4th Floor West

Mail Stop B119

Aurora, CO 80045

Twitter

colorado school of public health

coloradoSPH

Center for Innovative Design & Analysis

Connect with Us

Preparing to work with a biostatistician and data scientist

CIDA Guidelines to Acknowledge Our Biostatisticians

Guidelines for Order of Listed Authors

General Policy

PHI Definitions (COMIRB Guidelines)

Information Which May Be Protected Includes, but is Certainly Not Limited To:

Dataset Format Guidelines

Data Dictionary Example

Organize Your Data for Statistical Analysis
Best Practices for Data Transfer

Recommendations for Organizing Data

Table 1: Example of a Rectangular Table

Please Note the Following Points, Many of Which are Illustrated in Table 1:

Data Security

Center for Innovative Design & Analysis

Colorado School of Public Health

Connect with Us

Preparing to work with a biostatistician and data scientist

Authorship

Handling protected health information

Dataset format guidelines

Organize your data for statistical analysis

Partnerships

Collaborative Analyses

Data Delivery

Center for Innovative Design & Analysis

Colorado School of Public Health