Organize Your Data for Statistical Analysis
Best Practices for Data Transfer
It is important for you to organize your data in a way that facilitates transfer to our biostatisticians, or other investigators or computers. Well-defined and organized data minimizes confusion and incorrect data.
You are encouraged to use REDCap for data collection to minimize data entry errors or risks to patient confidentiality, and ease data transfer for statistical analysis.
Recommendations for Organizing Data
Our recommendations have demonstrated to be effective for moving data from point to point in a structured manner. A reasonable data organization scheme should minimize the amount of editing needed at the receiving side of your data transfer.
Table 1 illustrates three types of variables in a structure that lends itself to simple data transfer and minimal data editing.
- Identification (PatID) variables: uniquely identify aspects of an individual record (row of data), for instance, subject #, clinic #, or PatID.
- Time-stable variables: include characteristics that remain constant for individual subject if observed over time, for instance, baseline demographics (age, sex, race) or study group (A, B).
- Longitudinal variables: potentially change over time, for instance, weight, adolescent height, muscle tone, lab values (cholesterol, blood sugar, etc.).
In this example, the structure has one column available for identifying an individual (Subject), two columns for time-stable characteristics (Trt, Sex) and two columns for longitudinal characteristics (time, weight). Note the values of subject and time uniquely identify each row.
Other experimental designs will require different data structures, but each measured response must be uniquely associated with only one subject, visit or test.
Most statistical software packages (e.g. SAS, SPSS, Splus, R and Stata) require data represented in a rectangular format where each row is a unique observation and each column is a separate variable. When organizing data into a rectangular format: first each row contains one (and only one) unique observation. In the example each row contains a unique combination of subject, time, and treatment. Second, each column contains one (and only one) variable or response.\
Table 1: Example of a Rectangular Table
Codebook (in a separate worksheet):
Trt: Treatment, 0=Placebo, 1=Drug, Sex:0=Woman 1=Men, Time: Time in Study in weeks; Weight: Body weight in pounds
Please Note the Following Points, Many of Which are Illustrated in Table 1:
|Table 2: Identifiable PHI Information|
|2. Fax number|
|3. Phone number|
|4. E-mail address|
|5. Account numbers|
|6. Social Security number|
|7. Medical Record number|
|8. Health Plan number|
|9. Certificate/license numbers|
|11. IP address|
|12. Vehicle identifiers|
|13. Device ID|
|14. Biometric ID|
|15. Full face/identifying photo|
|16. Other unique identifying number, characteristic, or code|
|17. Postal address (geographic subdivisions smaller than state)|
|18. Date precision beyond year|
- Data table is rectangular, rows represent observations, and columns represent variables. Some columns identify observation and others contain a measured response. All data contained in one rectangular area.
- Only Patient ID numbers are used, Protected Health Information (PHI) is not included. Names should not be included in your database for analysis to avoid unnecessary risks to patient confidentiality (see Table 2).
- Unique key to each row consists of two variables (columns) PatID and Time.
- Characters (A, AB, O) and numeric values (0, 1, 2) are not mixed within one column. Where possible, a number has been chosen in place of a character. Definition of numbers, units for continuous data, and explanation for abbreviated variable titles should be provided separately in a codebook.
- Missing data: Note that none of the variable values uniquely identify the subject and conditions where measurements taken are missing (ID, trt, time). A character value (e.g. "missing", "dk", "x") or numeric value zero (i.e., 0) should not be used to indicate missingness for a continuous variable (ex: variable "Weight" in Table 1).
- Before data collection begins, your should give special attention to how an assay value below detection will be indicated in the data, and how it should be treated in the statistical analysis. Similarly for left-censored or right-censored values.
- Column headers are variable names, not a description. Variable descriptions can be provided separately in a "codebook" (or a separate worksheet in same workbook). In general, variable names must:
- Be 8 characters or less in length
- Consist of one word (i.e. no spaces)
- Be unique (not duplicated across multiple columns)
- Begin with a letter, not a number
- Contain no special characters: commas, quotes, apostrophes, period, underscore.
- Avoid using punctuation or spaces (e.g. commas, quotes, <,>).
- Avoid using special formatting like colored text, highlighted columns, italics, bolding, super or sub scripting, and the "comment" feature.
- Store notes about patients in separate column from data used in analysis (e.g. "scheduled to come in again for repeat lab"). If information in text of notes needs to be analyzed, it should be coded into one (or more) variable column(s).
If considered in enough detail before your data collection process begins, organization of the experimental data is relatively simple. Whether or not there are questions or confusion about how to efficiently organize and manage your data, consulting with a statistician before your experiment begins is a good idea. These matters can usually be resolved in a short time with satisfactory results for all concerned. Biostatisticians often oversee the data collection, storage, and retrieval systems for clinical studies. The study biostatistician is able to distinguish between essential and non-essential data, and can therefore limit the data collection systems to relevant information.
Limiting the amount of data collected means it is easier to assure data quality, minimize missing data, and pre-define the analysis data sets so that, upon study completion, data analysis is straightforward. Developing an effective data collection and management system is a key step in assuring ultimate integrity of your study. Dataset planning can be iterative, involving meetings between the Statistician, Investigator, and Informatics Manager.
Specific examples of instances in your planning phase where obtaining a statistician’s input would be beneficial:
- Design data collection forms
- Outline data collection/management systems (include variable name, specify variable type, e.g. date, numeric, open text)
- Design, implement, and conduct of data quality monitoring system for a study
- Outline how and when data abstraction should occur for interim analyses
- Provide input on parameters that would help to ensure data quality control
- All data should be securely stored, and access should be restricted to those individuals entering data.
- Properly dispose of paper and electronic files, keep paper copies in locked cabinet, and store electronic files on a secure-access central server.
- Keep in mind the Health Insurance Portability and Accountability Act (HIPAA)’s Minimum Necessary Principle when listing what variables to include in your database.
- Use or disclose only information necessary to the task. It is important to exclude unnecessary items that make information identifiable to ensure privacy, security and patient confidentiality.
- Identifiable information includes items listed in Table 2. If identifiable information is necessary for research (e.g. birth date, visit date, physical address), take necessary precautions to protect the database: strong passwords, anti-virus software, data backup, possibly encryption, and being very cautious with email.
- Refer to COMIRB and HIPAA for additional stipulations.