|P.Mean >> Category >> Data management (created 2007-06-20).|
Data management is the foundation of every good data analysis. You need to consider issues like how your data are entered, documented, and stored. Careful attention to these issues now will help save you time and frustration during your data analysis. Articles are arranged by date with the most recent entries at the top. You can find outside resources at the bottom of this page.
Most of the new content will be added to my blog (blog.pmean.com).
Review all blog entries related to data management.
33. P.Mean: Those pesky tab characters (created 2012-03-21). I frequently move text from one program to another, and one thing that is almost always guaranteed to cause annoyances is the presence of tabs. The tab is a single character, hex 09, that can sometimes be added with the Ctrl-I key on the computer, or the TAB key on a standard computer keyboard. The problem with the tab key is that it looks just like a bunch of blanks, but it doesn't always behave like a bunch of blanks.
32. The Monthly Mean: Discrepancies in the chisquare test (December 2011)
31. P.Mean: Discrepancies in the chisquare test (created 2011-12-16). I was working with two researchers on a project and they got different results for their chisquare tests. See if you can find out what went wrong.
30. The Monthly Mean: A binary coding trick that you can learn from Car Talk. (March/April 2011) and P.Mean: Using a binary coding trick illustrated by a Car Talk puzzler (created 2011-05-21). I often need to see how often certain variables and combinations of those variables appear in a data set. If the variable is binary, there is a trick for doing this that is illustrated by a Car Talk puzzler.
29. The Monthly Mean: Should I use a spreadsheet or a database to enter my data? (March/April 2011) I often get asked whether you should use a spreadsheet (like Microsoft Excel) to enter your data or a database (like Microsoft Access). The short answer is that for most projects it does not matter all that much. But here are some considerations that you should think about before making this choice. Databases easily allow you to implement quality checks. They also allow you to easily integrate data from multiple sources. Finally, they are more effective in handling very large data sets. On the other hand, spreadsheets are faster to set up and allow easier copying and duplication for data with repetitive patterns.
28. The Monthly Mean: Quality checks for data entry (August 2010)
27. P.Mean: Dealing with a large text file that crashes your computer (created 2010-04-02). At a meeting, a colleague was describing a text file that he had received that had crashed his system. No way, I thought, could a simple text file crash your system. I offered to investigate and he was right. The text file crashed my system too, and repeatedly. Here's what I did to figure out how a simple text file could crash your computer.
26. P.Mean: Finding duplicate records in a 19 million record database (created 2010-03-02). I was asked to help find duplicate records in a large database (19 million records). The suspected number of duplicates was suspected to be small, possibly around 90. My colleague's approach was running PROC FREQ in SAS on the "unique" id and then looking for ids that have a frequency greater than 1. That did not work--it took too long or it overloaded the system, or both. So I wanted to look at alternatives for identifying duplicate records that would do this more efficiently.
25. The Monthly Mean: A false sense of frugality. (January 2009) and P.Mean: A false sense of frugality (created 2008-12-17). A while back I received a data set that was very well documented, but there was one thing that I wish that the data entry person had not done. The demographic data was listed as 45f, 52m, 22m, 21f, etc. This was obvious shorthand for a 45 year old female, 52 year old male, and so forth.
24. P.Mean: Naming conventions for variables (created 2008-07-30). For almost all statistical software programs, you can and should provide variable names for your data. Variable names are a short descriptive explanation of what resides in each column of data. You should choose a variable name that is short, concise, and descriptive.
23. P.Mean: Undeclared missing code leads to bad results (created 2008-07-15). I found this ticket in a computer store many years ago and am just now getting around to showing it. It demonstrates how failure to declare a missing value code can lead to laughably incorrect results.
Interesting quote: Not even the most subtle and skilled analysis can overcome completely the unreliability of basic data. - R. G. D. Allen.
Morris Rivera, Jason Donnelly, Blair Parry, et al. Prospective, randomized evaluation of a personal digital assistant-based research tool in the emergency department. BMC Medical Informatics and Decision Making. 2008;8(1):3. Abstract: "BACKGROUND: Personal digital assistants (PDA) offer putative advantages over paper for collecting research data. However, there are no data prospectively comparing PDA and paper in the emergency department. The aim of this study was to prospectively compare the performance of PDA and paper enrollment instruments with respect to time required and errors generated. METHODS: We randomized consecutive patients enrolled in an ongoing prospective study to having their data recorded either on a PDA or a paper data collection instrument. For each method, we recorded the total time required for enrollment, and the time required for manual transcription (paper) onto a computer database. We compared data error rates by examining missing data, nonsensical data, and errors made during the transcription of paper forms. Statistical comparisons were performed by Kruskal-Wallis and Poisson regression analyses for time and errors, respectively. RESULTS: We enrolled 68 patients (37 PDA, 31 paper). Two of 31 paper forms were not available for analysis. Total data gathering times, inclusive of transcription, were significantly less for PDA (6:13 min per patient) compared to paper (9:12 min per patient; p < 0.001). There were a total of 0.9 missing and nonsense errors per paper form compared to 0.2 errors per PDA form (p < 0.001). An additional 0.7 errors per paper form were generated during transcription. In total, there were 1.6 errors per paper form and 0.2 errors per PDA form (p < 0.001). CONCLUSION: Using a PDA-based data collection instrument for clinical research reduces the time required for data gathering and significantly improves data integrity." [Accessed February 22, 2011]. Available at: http://www.biomedcentral.com/1472-6947/8/3.
David Pogue. Should You Worry About Data Rot?. The New York Times. 2009. Excerpt: "Data rot refers mainly to problems with the medium on which information is stored. Over time, things like temperature, humidity, exposure to light, being stored not-very-good locations like moldy basements, make this information very difficult to read. The second aspect of data rot is actually finding the machines to read them. And that is a real problem. If you think of the 8-track tape player, for example, basically the only way you can find 8-track cartridges is in a flea market or a garage sale." [Accessed March 30, 2009]. Available at: http://www.nytimes.com/2009/03/26/technology/personaltech/26pogue-email.html.
Circle Systems. Stat/Transfer Data Conversion Software Utility - Excel, SAS, Databases & Statistical Packages.. Excerpt: "Stat/Transfer has provided fast, reliable, and convenient data transfer between popular software packages for thousands of users, worldwide. Stat/Transfer knows about statistical data --- it handles missing data, value and variable labels and all of the other details that are necessary to move as much information as is possible from one file format to another." [Accessed February 22, 2011]. Available at: http://www.stattransfer.com/
This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-04-11.
22. StATS: Watch out for ambiguous data (February 14, 2007). Someone brought me a data set with some interesting values. It serves as a good example about why you need to carefully review simple descriptive statistics before you plunge into a complex analysis.
21. StATS: Auditing for data entry errors (June 20, 2006). There was an interesting query on the MedStats list about the appropriate sample size for an audit. This person had entered 1,500 records and wanted to check a sample of those records for data entry errors. There was not enough time to perform double entry or to check 100% of the records. So how many records should be checked?
20. StATS: Another regular expression tip (May 23, 2006). I had a large text file and I had to find the first example of a line that did NOT begin with the letter A. That's easier said than done, but you can use some special symbols in regular expressions to do this.
19. StATS: Lost files (May 23, 2006). I advise people all the time on how to set up a fool proof data entry system, and while you can't anticipate all of the possible things that can go wrong, here are some things that can sometimes help avert a disaster.
18. StATS: Using regular expressions to insert line breaks (May 18, 2006). I had to change a file written in XML format. The file was pretty easy to manipulate except that it had no line breaks in it. It was a single line of text with a length of 46,592 characters! That meant that I needed to be constantly scrolling left and right.
17. StATS: More lessons learned the hard way (January 31, 2006). The more I do, the more I realize how little I have thought about how to properly conduct a statistical analysis. One lesson I thought I had learned was that it costs next to nothing to store information electronically, but it can often save you a lot of time. But recently, I have relearned the value of this lesson.
16. StATS: Hard learned lessons (November 25, 2005). It's been a busy month, as noted below, and in a rush to complete all my projects, I ended up doing some things that may have caused a few problems (nothing permanent, of course, but they did up delaying further some projects that were already behind schedule). I alluded to a bit of this on another webpage, but I have a few more lessons worth mentioning.
15. StATS: Non-destructive data editing (November 2, 2005). I recently worked on a project looking at patients having two different types of operations, with and without collar sutures. The data set that the researchers sent to me had some inconsistencies, though.
14. StATS: Another disaster averted (August 16, 2005). When you are importing a file from one system to another, lots of little things can trip you up. Here's an example, and it shows a very subtle problem.
13. StATS: Moving R objects (July 28, 2005). I regularly work from home on my laptop, and when I need to re-run some analyses in R, I usually just re-create the original data sets. But there are several ways you can transfer objects from one R system to another.
12. StATS: More on regular expressions (July 21, 2005). As I work more and more with microarrays, the more I realize that having a knowledge of regular expressions will help. For example, I had a comma separated file (.CSV) and it had an extra comma at the end of every line. I wanted to remove those commas, but not any of the others.
11. StATS: Dumping data from R to a text file (June 27, 2005). In the prenatal liver study, I needed to give some of the normalized gene expression levels to a researcher in a form he could use. The data he needed was in a data frame with 94 rows and 16 columns (folate.signal). But unfortunately, the names of the rows (gene.symbol) and columns (liver.names) were stored in separate objects. Here's one way to match the values back up.
10. StATS: Importing value labels from Access into SPSS (May 24, 2005). Someone asked about importing data from Access into SPSS. The Access file has value labels (e.g., 1=Male, 2=Female, 3=Missing) and wanted to know if there was any way to get this information into SPSS.
9. StATS: A disaster averted (May 16, 2005). I'm working on a microarray experiment of prenatal liver samples. When I was trying to normalize the data, I noticed that three of the arrays had rather unusual properties.
8. StATS: String manipulations in R (May 10, 2005). As part of my efforts to analyze microarray data, I am finding that I need to do simple string manipulations in R. Here is a list of functions that might help.
7. StATS: Digitizing a graph (March 15, 2005). Someone brought me a graph with a trend line relating body surface area (BSA) to various cardiac measurements. This graph showed both the trend line and limits at +/-2 standard deviations and +/-3 standard deviations. She asked if I could write a program based on that graph that would allow her to input a patient's BSA and cardiac measures and get a Z-score in return.
6. StATS: Coding race/ethnicity (February 3, 2003). If you have to collect data on the race and/or ethnicity of your research subjects, you should be aware of the official U.S. government definitions that all federal agencies have to follow. You don't necessarily have to follow these guidelines, but they do offer up a way to code your data that is reasonably standardized.
5. StATS: Data management for survival data (August 27, 2002). Every project is different, of course, but here are some general concepts that may help you manage data a survival data analysis project.
4. StATS: Longitudinal data (July 26, 2002). Dear Professor Mean, I have longitudinal data on the growth pattern of patients given growth hormone. How should I store the data? --Jittery Jerry
3. StATS: Loading ODBC drivers from the Microsoft Data Access Pack (January 24, 2001). Here are excerpts from some emails posted to the SPSSX-L listserver on September 10-11, 2000. These emails describe how to load special drivers for ODBC, especially the driver for Access 97.
2. StATS: Exporting SPSS graphs and tables (created January 28, 2000). Dear Professor Mean, I need to export the output from SPSS and use some of it in my word processing file. What is the best way to do this? -- Manic Marsha
1. StATS: Spreadsheet or Database? (created January 28, 2000). Dear Professor Mean, I am not sure whether I should use a database or a spreadsheet to enter my data?
Browse other categories at this site
Browse through the most recent entries