B-1762. Recommendation: Who Says What to Whom on Twitter (Not found)

Not found

B-1761. Recommendation: Transformation and Weighting in Regression (Not found)

Not found

B-1760. Recommendation: Tests of one-sided versus two-sided hypotheses in placebo-controlled clinical trials (Not found)

Not found

B-1759. Recommendation: Testing for qualitative interactions between treatment effects and patient subsets (Not found)

Not found

B-1758. Recommendation: Tenancy, Marriage, and the Boll Weevil Infestation, 1892-1930 (Not found)

Not found

B-1757. Recommendation: Statistical power of negative randomized controlled trials presented at American Society for Clinical Oncology annual meetings (Not found)

Not found

B-1756. Recommendation: Statistical Issues in Drug Development (Not found)

Google-Books-ID: cmMbCcqnAXcC

B-1755. Recommendation: One hundred statistical tests (Not found)

Not found

B-1754. Recommendation: Statistical consideration of adaptive methods in clinical development (Not found)

Not found

B-1753. Recommendation: Susceptibility of live snails to predation (Not found)

A small dataset with counts. Ten rows and three columns.

B-1752. Recommendation: Simultaneous Statistical Inference (Not found)

Not found

B-1751. Recommendation: Seven items were identified for inclusion when reporting a Bayesian analysis of a clinical study (Not found)

Not found

B-1750. Recommendation: Scientific papers and presentations (Not found)

Not found

B-1749. Recommendation: The scientific dichotomy and the question of evidence (Not found)

Not found

B-1748. Recommendation: Are sample sizes clear and justified in RCTs published in dental journals? (Not found)

Not found

B-1747. Recommendation: Sample size recalculation in internal pilot study designs: a review (Not found)

Not found

B-1746. Recommendation: Sample size recalculation in internal pilot study designs: a review (Not found)

Not found

B-1745. Recommendation: Sample Size Estimation in Research With Dependent Measures and Dichotomous Outcomes (Not found)

Not found

B-1744. Recommendation: Sample size calculations for clinical studies allowing for uncertainty about the variance (Not found)

Not found

B-1743. Recommendation: A Review of Methods for Missing Data (Not found)

Not found

B-1742. Recommendation: The use and reporting of multiple imputation in medical research - a review (Not found)

Not found

B-1741. Recommendation: Regression Shrinkage and Selection Via the Lasso (Not found)

Not found

B-1740. Recommendation: 25 Recipes for Getting Started with R (Not found)

Not found

B-1739. Recommendation: Not found (Not found)

Not found

B-1738. Recommendation: Random-effects linear modeling and sample size tables for two special crossover designs of average bioequivalence studies: the four-period, two-sequence, two-formulation and six-period, three-sequence, three-formulation designs (Not found)

Not found

B-1737. Recommendation: In a Quiet Move, Washington Replaces the Head of AHRQ. Is It Too Late to Save the Agency? (Not found)

Not found

B-1736. Recommendation: Questioning the methodologic superiority of ‘placebo’ over ‘active’ controlled trials (Not found)

Not found

B-1735. Recommendation: Qualitative research articles: information for authors and peer reviewers (Not found)

Not found

B-1734. Recommendation: The prevalence of underpowered randomized clinical trials in rheumatology (Not found)

Not found

B-1733. Recommendation: Preliminary evaluation of factors associated with premature trial closure and feasibility of accrual benchmarks in phase III oncology trials (Not found)

Not found

B-1732. Recommendation: Preliminary evaluation of factors associated with premature trial closure and feasibility of accrual benchmarks in phase III oncology trials (Not found)

Not found

B-1731. Recommendation: Predicting accrual in clinical trials with Bayesian posterior predictive distributions (Not found)

Not found

B-1730. Recommendation: Patient perception of a long-term clinical trial: experience using a close-out questionnaire in the Studies of Left Ventricular Dysfunction (SOLVD) Trial. SOLVD Close-out Working Group (Not found)

Not found

B-1729. Recommendation: Patient characteristics compete with dose as predictors of acute treatment toxicity in early phase clinical trials (Not found)

Not found

B-1728. Recommendation: Optimism bias leads to inconclusive results-an empirical study (Not found)

Not found

B-1727. Recommendation: Opening up OpenURLs with Autodiscovery (Not found)

Not found

B-1726. Recommendation: Opening up OpenURLs with Autodiscovery (Not found)

Not found

B-1725. Recommendation: The use of one-sided tests in drug trials: an FDA advisory committee member’s perspective (Not found)

Not found

B-1724. Recommendation: A novel research design can aid disinvestment from existing health technologies with uncertain effectiveness, cost-effectiveness, and/or safety (Not found)

Not found

B-1723. Recommendation: Nomograms for calculating the number of patients needed for a clinical trial with survival as an endpoint (Not found)

Not found

B-1722. Recommendation: Negative studies published in Indian medical journals are underpowered (Not found)

Not found

B-1721. Recommendation: When was a negative clinical trial big enough? How many patients you needed depends on what you found (Not found)

Not found

B-1720. Recommendation: Naturopathy, Pseudoscience, and Medicine: Myths and Fallacies vs Truth (Not found)

Not found

B-1719. Recommendation: Multiple Comparisons: Theory and Methods (Not found)

Not found

B-1718. Recommendation: Not found (Not found)

Not found

B-1717. Recommendation: Modern Applied Statistics with S (Not found)

Not found

B-1716. Recommendation: Modeling and Validating Bayesian Accrual Models on Clinical Data and Simulations Using Adaptive Priors (Not found)

Not found

B-1715. Recommendation: Modeling the days of our lives: Using survival analysis when designing and analyzing longitudinal studies of duration and the timing of events (Not found)

Not found

B-1714. Recommendation: Mixing Qualitative and Quantitative Methods: Triangulation in Action. (Not found)

Not found

B-1713. Recommendation: Mixed-Effects Models in S and S-PLUS (Not found)

Not found

B-1712. Recommendation: Differences in Availability and Use of Medications for Opioid Use Disorder in Residential Treatment Settings in the United States (Not found)

Not found

B-1711. Recommendation: Mediators and moderators of treatment effects in randomized clinical trials (Not found)

Not found

B-1710. Recommendation: Beyond smartphones and sensors: choosing appropriate statistical methods for the analysis of longitudinal data (Not found)

Not found

B-1709. Recommendation: Inverse association between gastroesophageal reflux and blood pressure: results of a large community based study. (Not found)

Not found

B-1708. Recommendation: The role of internal pilot studies in increasing the efficiency of clinical trials (Not found)

Not found

B-1707. Recommendation: Inadequate statistical power of negative clinical trials in urological literature (Not found)

Not found

B-1706. Recommendation: Impure Science: AIDS, Activism, and the Politics of Knowledge (Not found)

Not found

B-1705. Recommendation: The Immortal Life of Henrietta Lacks (Not found)

2010-01-01

B-1704. Recommendation: Identifying clinical trials in the medical literature with electronic databases: MEDLINE alone is not enough (Not found)

Not found

B-1703. Recommendation: One hundred statistical tests (Not found)

Not found

B-1702. Recommendation: One hundred statistical tests (Not found)

Not found

B-1701. Recommendation: Hierarchical Commensurate and Power Prior Models for Adaptive Incorporation of Historical Information in Clinical Trials (Not found)

Not found

B-1700. Recommendation: The Team Handbook Third Edition (Not found)

Not found

B-1699. Recommendation: Group Sequential Methods (Not found)

Not found

B-1698. Recommendation: Generalizing the OpenURL Framework beyond References to Scholarly Works The Bison-Fute Model (Not found)

Not found

B-1697. Recommendation: Not found (Not found)

Not found

B-1696. Recommendation: A Smartphone App Combining Global Positioning System Data and Ecological Momentary Assessment to Track Individual Food Environment Exposure, Food Purchases, and Food Consumption: Protocol for the Observational FoodTrack Study (Not found)

Not found

B-1695. Recommendation: First Use of Multiple Imputation with the National Tuberculosis Surveillance System (Not found)

Not found

B-1694. Recommendation: Evidence-Based To Value-based Medicine (Not found)

Not found

B-1693. Recommendation: Evidence-Based Medicine: How to Practice and Teach EBM (Not found)

Not found

B-1692. Recommendation: Evidence Based Medicine (Not found)

Not found

B-1691. Recommendation: Essential Evidence-based Medicine (Not found)

Not found

B-1690. Recommendation: Eliciting and using expert opinions about dropout bias in randomized controlled trials (Not found)

Not found

B-1689. Recommendation: E-cigarette use and associated changes in population smoking cessation: evidence from US current population surveys (Not found)

Not found

B-1688. Recommendation: Design issues of randomized phase II trials and a proposal for phase II screening trials (Not found)

Not found

B-1687. Recommendation: The Design and Analysis of Sequential Clinical Trials, 2.Rev.Ed. (Not found)

Not found

B-1686. Recommendation: Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists (Not found)

Not found

B-1685. Recommendation: Critical Appraisal of Epidemiological Studies and Clinical Trials (Not found)

Not found

B-1684. Recommendation: Correcting for Optimistic Prediction in Small Data Sets (Not found)

Not found

B-1683. Recommendation: Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating (Not found)

Not found

B-1682. Recommendation: Classical and adaptive clinical trial designs with ExpDesign Studio (Not found)

Not found

B-1681. Recommendation: Classical and adaptive clinical trial designs with ExpDesign Studio (Not found)

Not found

B-1680. Recommendation: Novel citation-based search method for scientific literature: a validation study (Not found)

Not found

B-1679. Recommendation: Extinction of chromosomes due to specialization is a universal occurrence (Not found)

Not found

B-1678. Recommendation: Not found (Not found)

A nice comparison of software that you can use to manage a large set of research references.

B-1677. Recommendation: Challenges to accrual predictions to phase III cancer clinical trials: a survey of study chairs and lead statisticians of 248 NCI-sponsored trials (Not found)

Not found

B-1676. Recommendation: Bayesian Statistical Modelling (Not found)

Not found

B-1675. Recommendation: Bayesian methods in health technology assessment: a review (Not found)

Not found

B-1674. Recommendation: What is authorship, and what should it be? A survey of prominent (Not found)

Not found

B-1673. Recommendation: Applied Bayesian Hierarchical Methods (Not found)

Not found

B-1672. Recommendation: Analysis of pretest-posttest designs (Not found)

Not found

B-1671. Recommendation: Analysis of longitudinal data (Not found)

Not found

B-1670. Recommendation: Advantages of a wholly Bayesian approach to assessing efficacy in early drug development: a case study (Not found)

Not found

B-1669. Recommendation: Risk of acute lung injury/acute respiratory distress syndrome in critically ill adult patients with pre-existing diabetes: a meta-analysis (Not found)

Not found

B-1668. Recommendation: Achieving sufficient accrual to address the primary endpoint in phase III clinical trials from U.S. Cooperative Oncology Groups (Not found)

Not found

B-1667. Blog post: An example of a poor color choice (2020-03-11)

I ran across a graph in a journal article. The article itself was good, but the graph had a rookie mistake. I shouldn’t point this out, because I myself have been guilty of far worse mistakes. But this graph illustrated the point far better than anything I could have said.

B-1666. Blog post: UMKC big data and data science initiatives (2020-03-10)

I am trying to keep track of several recent initiatives at UMKC. Here are the details, as I best understand them.

B-1665. Recommendation: What the heck is data science (2020-03-04)

Everyone has a different definition of data science. Dr. Anderson says it is fundamentally about workflow and quotes John Tukey. This is a nice overview and a very brief read.

B-1664. Recommendation: Meet UMKC IDEAS (2020-03-04)

A brief Public Relations article about the new data science initiative at UMKC.

B-1663. Recommendation: Mass shootings in 2020 (2020-03-04)

A mass shooting is defined as four or more shot and/or killed in a single event. There are seven variables in this dataset. The number of rows increases as events are added over time.

B-1662. Recommendation: COVID-19 Complete Dataset (2020-03-04)

A list of confirmed Corona virus cases inlcuding country, privince/state, and latitude/longitude, pulled from WHO reports. The data set is updated daily. You need to sign up for a free account before you can download this data.

B-1661. Recommendation: Bob Ross (2020-03-04)

Bob Ross is a painter who taught a particular style of painting in a Public Television series. This dataset show categorical data on what visual elements were included in the pictures from every one of his shows. It has 404 rows and 69 columns.

B-1660. Recommendation: Barbershop music (2020-03-04)

A simple tab delimited file with three continuous variables and 34 observations.

B-1659. Recommendation: Food inspections (2020-03-04)

The city of Albuquerque, New Mexico, has an open data policy and places much of its data in a publicly available repository. This data set shows information on food inspections. It has 26 variables.

B-1658. Blog post: Smart quotes, em dashes, and en dashes (2020-03-02)

If you work with text data a lot, you will encounter some characters that are sort of close to what you need, but sort of not. These include the smart quotes, em dashes, and en dashes.

B-1657. Recommendation: The Facts In The Case Of Dr. Andrew Wakefield (2020-03-01)

A graphic novel depiction of the controversial career of Andrew Wakefield.

B-1656. Recommendation: Tidy Data (2020-02-29)

Not found

B-1655. Recommendation: Stop That Subversive Spreadsheet (2020-02-29)

Not found

B-1654. Recommendation: Biostats4You. Statistical resources for non-statisticians (2020-02-29)

A curated list of resources developed by the Biostatistics, Epidemiology and Research Design Special Interest Group of the Association for Clinical and Translational Science.

B-1653. Recommendation: Nine Simple Ways to Make It Easier to (Re)use Your Data (2020-02-29)

Not found

B-1652. Recommendation: Enhance Your Own Research Productivity Using Spreadsheets (2020-02-29)

Not found

B-1651. Recommendation: Data Intended for Human Consumption, Not Machine Consumption (2020-02-29)

Not found

B-1650. Recommendation: Gene Name Errors Are Widespread in the Scientific Literature (2020-02-29)

Not found

B-1649. Recommendation: Mistaken Identifiers: Gene Name Errors Can Be Introduced Inadvertently When Using Excel in Bioinformatics (2020-02-29)

Not found

B-1648. Recommendation: Data Organization in Spreadsheets (2020-02-29)

Not found

B-1647. Recommendation: Abandon All Hope, Ye Who Enter Dates in Excel (2020-02-29)

Not found

B-1646. Recommendation: Data Organization in Spreadsheets (2020-02-29)

Many people have condemned the use of spreadsheets for entering and organizing your data. The practice, however, continues unabated. If you do plan to use spreadsheets like Microsoft Excel for data entry, this article offers some simple recommendations that can help avoid some of the pitfalls.

B-1645. Blog post: Proposed talk on secondary data analysis (2020-02-27)

I am suggesting a talk on secondary data analysis for a seminar series here at UMKC. Here is the title and abstract.

B-1644. Recommendation: CloudLab (2020-02-26)

An NSF funded site that gives students access to a variety of cloud computing options.

B-1643. Recommendation: NextGen Data Science and Analytics Innovation Center (2020-02-25)

A brief description of the hyperconverged computational hub that UMKC is getting

B-1642. Recommendation: Ethical riddles in HIV research (2020-02-24)

Not found

B-1641. Recommendation: Machine Learning Prediction of Mortality and Hospitalization in Heart Failure With Preserved Ejection Fraction (2020-02-23)

The authors take various advanced modeling approaches and apply them to the same data set. The random forest model did the best for this data set.

B-1640. Quote: No aphorism is more frequently repeated… (2020-02-21)

“No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or, ideally, one question, at a time. The writer is convinced that this view is wholly mistaken. Nature, he suggests, will best respond to a logical and carefully thought out questionnaire; indeed, if we ask her a single question, she will often refuse to answer until some other topic has been discussed.” ~ Ronald Fisher

B-1639. Blog post: Data visualizaiton as art (2020-02-04)

I came across an interesting article in the New York Times and two books that all have a common theme. They use the tools of data visualization to make artistic statements.

B-1638. Blog post: Data collection in a research methods class (2020-02-04)

One of the discussion boards sponsored by the American Statistical Association had a question about data collection exercises in a research methods class. The initial question was how it might be done and how you could detect if a student was cheating by just making up some numbers. Several people raised the issue of IRB approval of research and I decided to chime in with a response.

B-1637. Blog post: Printing R Markdown output to png files (2020-01-29)

I have been using a tedious process to convert parts of the output of an R Markdown file to png files. I want to use png files because they can be inserted easily into a PowerPoint presentation. There’s a trick to make this work.

B-1636. Blog post: Artist wanted, here are the details (2020-01-26)

I have gotten several nibbles on a posting that I made last week. Here is some more information.

B-1635. Blog post: Celebrating the failures of medical research using a graphic novel format (2020-01-23)

I sent out a flurry of small grants in December and most did not get funded, but one of them did. And I need your help with finding the right collaborator.

B-1634. Blog post: Imbalanced sample sizes in the Fisher Exact test (2020-01-17)

Dear Professor Mean, We are conducting a study in which mice receive one of four different injections (various combinations of hormones and receptor antagonists)receive a bolus injection of a hormone (or vehicle, or hormone plus receptor antagonist, or receptor antagonist alone) and then we document whether an arrhythmias has occurred. Each of these treatment groups has a different number of animals (as few as six or as many as twelve). My understanding is that you cannot use Chi square with this low of sample size and you can’t use Fisher’s Exact Test when the groups have unequal sample size. What statistical test do we need to see if there are differences in responses between these groups.

B-1633. Recommendation: Disappointing Results of Major Study Point to Better Ways to Cut Health Care Waste (2020-01-10)

A very well run randomized trial upsets some of the pre-conceived notions of many (including myself) on how best to reduce health care costs. The results suggest a different way forward.

B-1632. Recommendation: Statistical rethinking : a Bayesian course with examples in R and Stan (2020-01-09)

A book recommended by Frank Harrell that he says makes it easy to introduce Bayesian concepts to non-statisticians.

B-1631. Recommendation: KU Hospital signs deal with IBA for proton beam therapy | The Kansas City Star (2020-01-09)

A local story about a national problem–the race to get the biggest and the best technology, perhaps more for bragging rights than for actual medical necessity. The equipment described in this article is extremely expensive, duplicates what is available in St. Louis and Oklahoma City, and has only been proven effective on a small number of cancers. This newspaper account shows both sides of this controversial purchase.

B-1630. Recommendation: Introducing Jupytext (2020-01-07)

Jupyter notebooks use a rather convoluted JSON format that is difficult to read. The Jupytext package will translate a basic script or markdown formatted file into a Jupyter notebook. This potentially allows a translation from R Markdown.

B-1629. Blog post: Identifying and manipulating non-breaking spaces (2020-01-07)

Non-breaking spaces are one of those weird little things in the computer that you have to keep your eye on. They can cause a lot of problems if they are assumed to be normal spaces. I found a non-breaking space “out in the wild” and decided to use it as an opportunity to explore what it was and better prepare myself for when I encounter more of these beasts.

B-1628. Recommendation: WRITE Statistics RIGHT! Tips on Good Writing Style for R&M Researchers (2020-01-06)

A Powerpoint presentation on writing with lots of good before and after examples.

B-1627. Recommendation: Some style and grammar tips for biostatistics and statistics students. (2020-01-06)

A nice guide on writing that is written directly to statisticians.

B-1626. Recommendation: When Russia and America Cooperated to Avert a Y2K Apocalypse (2020-01-04)

This article highlights the military concerns between the United States and Russia in December 1999 that the Y2K bug would cause an accident that could trigger a war.

B-1625. Recommendation: Who Maps the World? (2019-12-25)

Access to good quality map information is critical to many data analyses. This article talks about OpenStreet Map, an open source mapping effort, as well as efforts to extend mapping to parts of the world that are not yet mapped well.

B-1624. Recommendation: Seven of our favorite podcasts from 2019 (2019-12-25)

This is a list of podcasts that talk about Health Informatics.

B-1623. Recommendation: A Stolen Laptop Contained Data For More Than 114,000 Patients At Truman Medical Centers (2019-12-19)

This story is not quite as bad as it sounds. The laptop that was stolen had password protection. Even so, there is some concern whenever sensitive information is placed on something that is so easily misplaced or stolen.

B-1622. Recommendation: Period-Tracking Apps Say You May Have a Disorder. What if They’re Wrong? (2019-12-19)

Apps on your telephone can track lots of information about you. But, you might find out that these apps turn that information into recommendations that you may have a condition that needs medical attention. This is happening with apps that track menstrual period regularity. They are suggesting that some patients may have polycystic ovary syndrome and should see their doctors. Is this a good thing or a bad thing?

B-1621. Recommendation: Men Call Their Own Research Excellent (2019-12-19)

A text analysis of research abstracts showed that men tend to use stronger words to describe their research.

B-1620. Recommendation: Population-based age adjustment tables for use in occupational hearing conservation programs (2019-12-19)

PMID: 31846396

B-1619. Recommendation: An Analysis of Online Sermons in U.S. Churches | Pew Research Center (2019-12-17)

The Pew Research Center found 49,719 sermons online from 6,431 churches and looked at some traditional text mining tools like differential word counts between various types of churches.

B-1618. Recommendation: Guidelines for the Content of Statistical Analysis Plans in Clinical Trials (2019-12-15)

This article developed consensus guidelines on what belongs in a statistical analysis plan

B-1617. Recommendation: Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review (2019-12-15)

An article touting the great job market for data scientists.

B-1616. Recommendation: Evidence‐based statistical analysis and methods in biomedical research (SAMBR) checklists according to design features (2019-12-15)

This article offers a checklist of things you need to talk about when you are descring a statistical analysis plan.

B-1615. Recommendation: R packages (2019-12-14)

A nice guide to all the little details you need to know in order to create an R package.

B-1614. Recommendation: A Guide For Time Series Prediction Using Recurrent Neural Networks (LSTMs) (2019-12-12)

Time series models have a special structure that requires a special type of neural network. This article gives an overview of this network.

B-1613. Recommendation: DataBuilder - GPC Informatics (2019-12-12)

Data Builder is a system that allows you to extract individual patient information from an i2b2 system. This wiki page gives an under the hood look at how Data Builder works.

B-1612. Recommendation: Inaugural Florence Nightingale Day Inspires Students to Pursue Statistics, Data Science (2019-12-06)

The American Statistical Association promotes careers in Statistics for women on Florence Nightingale Day.

B-1611. Recommendation: Tukey tallying (2019-11-08)

The normal way of keeping a tally on a sheet of paper is with four vertical slashes followed by a diagonal slash through those four. But John Tukey came up with an alternative that is less likely to create errors. This blog post describes this method.

B-1610. Recommendation: Beyond Moran’s I: Testing for spatial dependence based on the spatial autoregressive model (2019-10-31)

This paper offers a review of various measures of spatial correlation.

B-1609. Recommendation: Introduction to Machine Learning Interpretability (2019-10-30)

A good overview of methods to try to understand how various machine learning models work, with applications in H20.ai.

B-1608. Recommendation: Why Should I Trust You?: Explaining the Predictions of Any Classifier (2019-10-30)

This article introduces Local, Interpretable, Model-agnostic Explanations (LIME), an approach that can help you understand the important variables in a black box regression model.

B-1607. Recommendation: A unified approach to interpreting model predictions (2019-10-30)

Many big data models are black boxes. This article describes an approach that helps you to peek a bit inside the black box and understand how it works.

B-1606. Recommendation: Data for the Public Good (2019-10-24)

Not found

B-1605. Recommendation: About the Correlates of War project (2019-07-02)

This website has nicely documented data associated on wars fought anywhere in the world, from 1816 onward.

B-1604. Recommendation: Effect of rapid diagnosis of influenza virus type A on the emergency department management of febrile infants and toddlers (2019-07-02)

A very nice example of secondary data analysis. I am one of the co-authors.

B-1603. Recommendation: UC Irvine Machine Learning Repository (2019-07-02)

Over 400 data sets, well documented and ready for advanced statistical analyses.

B-1602. Recommendation: Statistics Losing Ground to Computer Science (2019-06-25)

This article is a bit shrill for my tastes, but the message is important. Dr. Matloff claims that Computer Scientists are taking over the field of Statistics, to our personal loss and the loss of the profession of Statistics. he offers some advice on how to work with Computer Scientists so that both they and we can prosper.

B-1601. Recommendation: A theory of organizational readiness for change (2019-06-24)

This paper offers a formal method for assessing the extent to which an organization is ready to adopt new work practices.

B-1600. Recommendation: A checklist for identifying determinants of practice: A systematic review and synthesis of frameworks and taxonomies of factors that prevent or enable improvements in healthcare professional practice (2019-06-23)

This paper presents the TICD (Tailored Intervention for Chronic Diseases) checklist. This checklist includes 57 items in seven domains: guideline factors, individual health professional factors, patient factors, professional interactions, incentives and resources, capacity for organisational change, and social, political, and legal factors.

B-1599. Recommendation: The Utility of Template Analysis in Qualitative Psychology Research (2019-06-23)

This paper describes Template Analysis, an approach in coding qualitative data that saves time by imposing a structure that is determined prior to data collection. This is a sharp contrast to the more open ended approaches used commonly in qualitative research.

B-1598. Recommendation: An Information-Processing Analysis of Graph Perception (2019-06-23)

This article offers a general framework in how you perceive and analyze graphics. It shows a simple, but powerful experiment in the speed and accuracy of perceptions.

B-1597. Recommendation: Making sense of implementation theories, models and frameworks (2019-06-23)

There are a variety of psychological models used to implement new research findings in the health care system. This paper outlines the five major categories that these models fall into: process models, determinant frameworks, classic theories, implementation theories, and evaluation frameworks.

B-1596. Recommendation: Fostering implementation of health services research findings into practice: a consolidated framework for advancing implementation science (2019-06-23)

This paper outlines CFIR (Consolidated Framework for Implementation Research), a popular fraework for planning and evaluating interventions in the health care workplace.

B-1595. Recommendation: The development of a theory-based intervention to promote appropriate disclosure of a diagnosis of dementia (2019-06-22)

This article shows how models in Psychology (Theory of Planned Behavior and Social Cognitive Theory) allow you to develop effective interventions to change physician behavior.

B-1594. Recommendation: RStudio Pandoc - HTML To Markdown | R-bloggers (2019-06-12)

I’m revising my website and blog and want to do all my future web development using R Markdown. The first step is to convert all my old files from html format to markdown format. Dr. Wu shows how you can do this using Pandoc and some wrapper functions in the rmarkdown package.

B-1593. Recommendation: An Interview with W. J. Dixon Award Winner Dallas Johnson (2019-06-12)

Dallas Johnson has helped train hundreds if not thousands of statistical consultants. He has a very pragmatic approach to statistics. This interview reviews much of Dr. Johnson’s career.

B-1592. Recommendation: An Interview with Doug Zahn (2019-06-11)

Doug Zahn did some pioneering work on the process of statistical consulting. This interview covers some of the historical roots of his work.

B-1591. Recommendation: Student Computing Labs (2019-06-10)

The student computing labs at UMKC offer an easy way for students to get access to SAS and other programs. You can visit the labs at a variety of locations on campus or you can access them through the Internet using Remote Desktop Connection. This page gives a brief overview of these labs.

B-1590. Recommendation: Technical Knowledge? Check! Experience? Check! What Else Might Help You? Thoughts from the ASA President-Elect (2019-06-10)

This article stresses to statistical consultants the importance of communication skills and continuing education.

B-1589. Recommendation: Site-Licensed Software, SAS for Windows (2019-06-09)

If you are a student, staff, or faculty at UMKC, you can get SAS loaded on your desktop computer. The specialists in Information Systems will even to all the work for you. Here’s how you request access to SAS.

B-1588. Recommendation: An Expanded Approach to Educating Statistical Consultants (2019-06-09)

This article describes the process used at Florida State University to train new consultants.

B-1587. Recommendation: Explore the SAS workspace (1 of 8) (2019-06-08)

If you’ve never used SAS before, the interface can be a bit intimidating. It’s not that bad once you get used to it, and this introductory guide will help you get started.

B-1586. Recommendation: Recent Developments and Their Implications for the Future of Academic Statistical Consulting Centers (2019-06-08)

Over the years, some academic consulting centers have prospered and others have sputtered and faltered. Dr. Vance offers some thoughts on what consulting centers need to do in order to thrive in a changing world.

B-1585. Recommendation: A Model for an Interdisciplinary Undergraduate Research Program (2019-06-07)

A nice description of the efforts at St. Olaf’s College to build an undergraduate research curriculum.

B-1584. Recommendation: SAS University Edition (2019-06-07)

SAS University Edition is software that is freely available for academic uses. This page offers links on how to install and run this software.

B-1583. Recommendation: Statistical Consulting in a University: Dealing with People and Other Challenges (2019-06-06)

An old article, but the lessons are still relevant today. Dr. Kirk tackles issues of the diversity of problems and clients, inappropriate expectations, and the reward system in academia.

B-1582. Recommendation: My Life as a Statistical Consultant: JSM 2016 Invited Panel Discussion (2019-06-06)

A recap of a panel discussion of the experiences of three statistical consultants.

B-1581. Recommendation: SAS Using R Markdown (Windows) (2019-06-05)

You can run SAS code in an R Markdown program. You need a local license to SAS, and the code is a bit obtuse, but this is a nice option if you already know R and just want to include a bit of SAS code here and there.

B-1580. Recommendation: How to run SAS programs in Jupyter Notebook, The SAS Dummy blog, 2016-04-24 (2019-06-04)

A bit dated, but a nice guide to running SAS in a Jupyter notebook.

B-1579. Recommendation: Internet Movie Script Database (2019-06-03)

An amazing resource with scripts to a whole bunch of movies. A great resource for text mining.

B-1578. Recommendation: Thoughts on statistical consulting, The stupidest thing blog, 2013-04-02 (2019-06-03)

Dr. Broman offers practical advice on listening, admitting when you don’t know the answer, and how to say no.

B-1577. Recommendation: Running a Meeting: 10 Rookie Mistakes and How to Avoid Them (2019-06-02)

If you do a lot of statistical consulting, you will end up in a lot of meetings. Here is some guidance on how to make those meetings go well

B-1576. Recommendation: SSD: Single Shot MultiBox Detector (2019-06-01)

This article shows how to use a neural network to identify where specific objects in an image appear.

B-1575. Recommendation: ImageNet Classification with Deep Convolutional Neural Networks (2019-05-31)

This article provides a nice overview of how to use neural networks in image classification.

B-1574. Recommendation: How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data (2019-05-31)

It is hard to measure how often misconduct in research occurs, but this article provides a pretty good answer.

B-1573. Recommendation: The visibility of scientific misconduct: A review of the literature on retracted journal articles (2019-05-30)

This article presents a Sociological view of the retraction process.

B-1572. PMean: Finding me on the Internet (2019-05-15)

It is worth tracking the limited number of websites that mention me (and not people who share my first and last name). Here are a few places where I get mentioned, albeit very briefly in some cases.

B-1571. Recommendation: PROC-X.com. An online (unofficial) SAS journal – written by bloggers (2019-05-10)

A nice site that aggregates blog posts about SAS from a variety of authors.

B-1570. Recommendation: Scientists’ grant writing styles vary by gender (2019-05-09)

This is a brief summary of a research paper (apparently behind a pay wall, boo!) that looked at the language used in research grants submitted to the Gates Foundation. It found differences by gender in word choices. Men are more likely to choose “broad” words versus women who choose “narrow” words. These two terms are put in quotes because what you and I think they might mean are different from how the researchers defined them. Read the paper to find out more about this. It is definitely worth reading, even if you might disagree with the authors definitions of broad and narrow.

B-1569. PMean: Several Python resources (2019-05-02)

I have not had the time to learn Python yet, but it is on my short term list of research goals. I attended a very nice talk about Python and data science and tried to get a list of interesting resources in Python from that talk. Here is my incomplete and imperfect list.

B-1568. Recommendation: Assessment Resource Tools for Improving Statistical Thinking (2019-05-02)

A great teaching resource with a test bank of items to assess knowledge and attitudes about Statistics. There are also links to other helpful resources.

B-1567. PMean: Writing the methods section of a research paper (2019-04-25)

I’m teaching a class on Clinical Research Methodology and at least a few of the students are confused about what to put in the methods section of a research paper or a thesis. They’re confused? I’m even more confused than they are. Every paper and every thesis is different, so it is impossible to offer any coherent guidance. But let me try anyway.

B-1566. Recommendation: A tribute to Douglas C. Altman, DSc (2019-04-25)

This a brief and personal recollection of Doug Altman, a statistician who has had great influence in the practice of Statistics and the reporting of research results. There is also a nice Wikipedia write-up on Dr. Altman.

B-1565. Recommendation: Data science curriculum roadmap (2019-04-25)

How do you teach data science? That’s not an easy question, because data science means different things to different people. This site shows different curricula depending on what you want your program to emphasize.

B-1564. Recommendation: Separating Unique and Duplicated Observations Using PROC SORT in SAS 9.3 and Newer Versions (2019-04-18)

SAS has some very powerful ways to find duplicate values and to store the duplicates separate from the unique values. Many of these use the sort procedure. Here is a nice guideline for what would otherwise be very difficult to figure out on your own.

B-1563. Recommendation: Bibliographies and citations (in R Markdown) (2019-04-18)

If you want to use R Markdown to prepare research papers and presentations, you need to learn how to cite references and include a bibliography. This is a nice introduction and shows the variety of options at your fingertips.

B-1562. Recommendation: What would Florence Nightingale make of big data? (2019-04-17)

This is a nice video, professionally produced and very short (4 minutes) that shows the importance Florence Nightingale attached to Statistics. It reviews how she used Statistics aggressively to lobby for improvements to health care, and speculates on what she would think about the efforts today to use big data for decision making. The narrator is David Spiegelhalter, a famous statistician.

B-1561. Recommendation: Writing a discussion section: how to integrate substantive and statistical expertise (2019-04-16)

The paper in BMC Medical Research Methodology gives practical step by step guidance on writing your discussion section.

B-1560. Recommendation: When Big Data goes to school (2019-04-16)

File this under the “dark side” of data science. Alfie Kohn is a critic of many of the motivational methods used in business and education, and he makes many good points in this blog post about relying on readily available data without questioning its quality.

B-1558. Recommendation: Case for omitting tied observations in the two-sample t-test and the Wilcoxon-Mann-Whitney Test (2019-04-04)

When you are running a non-parametric test, like the Wilcoxon-Mann-Whitney test, you can only be 100% of the properties of that test (including Type I and Type II error rates) if the data are continuous. If there are ties in the data, the properties of the test are unknown. This paper shows four commonly used approaches for settings where values might be tied and runs simulations to measure Type I and Type II error rates for both the two-sample t-test and the Wilcoxon-Mann-Whitney test under a range of tied values and a range of distributions. The results are, at least to me, quite surprising.

B-1557. Recommendation: When life gives you coloured cells, make categories (2019-04-03)

A lot of people use formatting to denote important information in an Excel spreadsheet. In particular, they will use the color of a cell to designate a particular category. Pretty much all formatting information is lost when you import from Excel to R or any other statistical package. But rather than ask people to go back and fix things, there are a few tricks that you can use to recover this information, as is shown in this blog entry.

B-1556. Recommendation: Ethical Practice in Data Mining (2019-04-01)

This is a very nice summary of six major areas where data mining has led to serious ethical concerns.

B-1555. PMean: Writing the introduction section of a research thesis or dissertation (2019-03-29)

The introduction section of your research thesis or dissertation is the first thing that most people will read after reading the abstract. Some people use the introduction section to provide a literature review, and I won’t talk about that here. I did offer a nice recommendation on how to write a literature review in an earlier post. The introduction should provide present your research problem (research question, research hypothesis), but first you have to offer some context.

B-1554. Recommendation: Data Science Has Become About Lending False Credibility To Decisions We’ve Already Made (2019-03-25)

A rather harsh and cynical take on data science, but still worth reading. Let me share a story about this. Back in my college days (that would be the 1970s), someone found a New Yorker cartoon and shared it with me. It showed a politician, obviously a very powerful politician because his office had a view of the Washington Monument. He was speaking to his aide “That’s the gist of what I want to say. Now go and find me some statistics to base it on.” So the issues that this person brings up are no different than those from four decades ago. There’s no easy solution to the problem. You can’t say, “I’ll only work with people who have a commitment to the truth, no matter where it might lead” because even people without strong overt biases still have subtle biases that can profoundly skew the results. Requiring a priori specifications and reserving a hold out sample for a final quality check can help, but mostly it is just being careful and detail oriented and transparent in all your work.

B-1553. Recommendation: Census Geocoder (2019-03-25)

This website will take an address, either on a form or as a batch of up to 10,000 addresses (CSV format) and provide latitude, longitude pairs as well as U.S. Census tract information.

B-1552. Recommendation: Harvard University Program on Survey Research (2019-03-23)

This is a series of guides on survey research, written for the beginning student. It is written from the perspective of Political Science, but the advice works for other areas as well.

B-1551. Recommendation: A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models (2019-03-15)

This is one of those articles where you have to restrain yourself. Its message, that good old statistical tools like logistic regression can perform as well as these new fangled machine learning approaches that you haven’t taken the time to learn, is quite tempting. But I’d be cautious here. Maybe logistic regression is still competitive, but maybe the systematic overview got a bunch of biased studies. It’s worthwhile to cite this whenever someone makes an overly strong claim about machine learning models, but don’t use this as an excuse to keep from learning the new stuff yourself. This article is stuck behind a paywall. Sorry!

B-1550. Recommendation: Webinar Series, Congressionally Directed Medical Research Programs (2019-03-14)

We live in a golden age of learning, where you find find just about anything you’d ever need to learn from on the Internet. One example of this is a series of webinars about who to get research funding through the Congressionally Directed Medical Research Programs (CDRMP). I have not listened yet to any of these webinars, but they look like they would be very helpful for anyone seeking funding through this program.

B-1549. Recommendation: R Markdown Basics (2019-03-08)

This is actually a nice “peek under the hood” approach with lots of practical advice about getting that last tweak in to make your results go from good to great.

B-1548. Quote: Did you hear about the mathematician… (2019-03-08)

“Did you hear about the mathematician who was afraid of negative numbers? He would stop at nothing to avoid them.” (This joke is all over the Internet, and I’m not sure where the original source would be).

B-1547. Recommendation: LaTeX/Mathematics (2019-03-08)

You can incorporate very nice looking mathematical formulas in R Markdown fairly easily. The system relies on LaTeX for displaying formulas and is surprisingly easy to learn. But every once in a while you want to do something a bit exotic, like placing a “hat” in your equation. I’ve typically just done a quick Google search on something like “LaTeX hat symbol” and each different search yields a different website. Recently, I stumbled up a fairly comprehensive guide to displaying mathematical formulas in LaTex. It is published as an eBook.

Note: Some of the examples require additional libraries like amsmath and I haven’t figured out yet how to take advantage of these libraries in R Markdown.

B-1546. Recommendation: 12 things I wish I’d known before starting as a Data Scientist (2019-03-06)

This article was recommended to me at a webinar I attended. The author offers very personal and practical advice. The author’s third point “You’ll never have to know all the tools” is quite reassuring.

B-1545. Recommendation: Stop Saying ‘Exponential.’ Sincerely, a Math Nerd. (2019-03-06)

This is a brief plea to avoid using the word “exponential” when you really mean “a lot.”

B-1544. Recommendation: What is a Stepped-Wedge Trial? (2019-03-04)

S 32 minute video providing a non-technical introduction to the Stepped-Wedge designs.

B-1543. PMean: Slapping the word pilot on a failed study (2019-03-03)

Someone was asking on the MedStats listserv about a study that had gone off the rails. They had recruited only about a third of the patients that they had wanted. Things were going pretty well in the first arm of the study, but the second arm had a dropout rate of 50%.

Anyway, they decided to end the study (good call!) and wanted to know what they should do with the data that they had already collected. There were three options that they were considering (I’m paraphrasing a bit here).

  1. Analyze the study as originally planned, including a classic test of hypothesis for the primary outcome.
  2. Call this a pilot study and provide descriptive analyses only.
  3. Recognize that the data is so fatally flawed that any analysis of the data would be inappropriate.

This is what I suggested.

B-1542. Recommendation: Fostering Integrity in Research (2019-03-01)

Five important case studies in research ethics.

B-1541. Quote: All scientific work is incomplete… (2019-02-26)

“All scientific work is incomplete, whether it be observational or experimental. All scientific work is liable to be upset or modified by advancing knowledge. That does not confer upon us a freedom to ignore the knowledge we already have, or to postpone the action that it appears to demand at a given time. Who knows, asked Robert Browning, but the world may end tonight? True, but on available evidence most of us make ready to commute on 8:30 the next day.” Sir Austin Bradford Hill, as quoted in his landmark 1965 paper on causation.

B-1540. PMean: Make a loud mistake (2019-02-15)

I dated a piano major in college and I tried, with very limited success, to learn how to play the piano myself. She told me, “If you’re going to make a mistake, make a loud mistake.” You don’t want to play the piano nervously and hesitantly. The same is true in research.

B-1539. Recommendation: 8 Waste Types of DOWNTIME (2019-02-14)

This page reviews the commonly used acronym DOWNTIME that classifies the eight ways that waste can occur in a manufacturing process.

B-1538. Recommendation: SMART Objectives (2019-02-14)

The acronym SMART is a nice way to define an objective in a quality improvement. It has to be Specific, Measurable, Achievable, Relevant, and Time-Bound. This page outlines what each of these terms means and why they are important. It also provides examples of objectives that meet the SMART criteria.

B-1537. Recommendation: Checklist to Evaluate the Quality of Questions (2019-02-14)

A detailed and comprehensive list of things to look for when you are reviewing a new questionnaire. It is based on a document QAS-99 that was originally developed by the cancer.gov, but the original link is no longer active.

B-1536. Recommendation: Quality Improvement in Healthcare (2019-02-14)

This is a slick video that outlines the quality improvement process using clever drawings. It is only 11 minutes long but provides a very nice overview.

B-1535. Recommendation: A Framework for Program Evaluation (2019-02-14)

This page offers a broad and comprehensive overview of how to plan and conduct a program evaluation. There are tons of resources (a mix of html pages and PDF documents) behind various hyperlinks on this page.

B-1534. Cartoon: Some good predictive analytics software… (2019-02-14)

B-1533. Recommendation: How to Use Deming’s 14 Points to Improve Quality (2019-02-14)

This blog post provides a few short paragraphs elaborating on each of W. Edward Deming’s Fourteen Points. There’s a bit of self-promotion on this page, but the overall content is still quite good.

B-1532. Recommendation: NOVA, A Hole in the Sky (2019-02-13)

This 4 minute video talks about the discovery of the ozone hole over Antarctica and how some early indications of this hole were dismissed as outliers.

B-1531. Recommendation: Dr. Russell Waitman. Innovation: Breaking the Barriers on Medical Information (2019-02-12)

This is a short article and video of Russ Waitman. Russ is my boss at one of my three jobs, and this gives a nice introduction to the sort of work I help out with.

B-1530. Another short biography (2019-02-12)

I am contributing a chapter to a book (proposed title: Randomized controlled trials in medical research – gold standard or unhealthy fixation) and the book editor wanted a brief biography that emphasized “any relevant teaching experience within Medicine or allied health sciences.” So I adapted an earlier short biography to put in some of my teaching experience. Here it is.

B-1529. PMean: Finding those weird characters (2019-02-11)

When you take a text file from one system and use it in a different system, some of the more “exotic” characters can change on you. An example are the “smart quotes” in Microsoft Word. Here’s a brief explanation of why they occur and what you can do about it.

B-1528. Recommendation: Create Awesome HTML Table with knitr::kable and kableExtra (2019-02-09)

I use R Markdown for a lot of things, but the one thing that never seems to come out the way I like is the tables. This vignette highlights some of the ways you can customize the appearance of your tables with a new pacakge, kableExtra.

B-1527. Recommendation: An overview of randomization techniques (2019-02-05)

This is a nice overview of why you want to randomize. It also talks about block randomization and covariate adaptive randomization.

B-1526. Recommendation: The Learning Health Care System in America (2019-02-04)

This document was recommended at a talk I attended. I have not read it yet, but the stated goal of the document to “drive the process of discovery as a natural outgrowth of patient care” is something I support whole heartedly.

B-1525. Recommendation: HEDIS and Performance Measurement (2019-02-04)

This is another resource that came up during a talk. HEDIS is a series of performance measures in six clinical areas that is coordinated by the NCQA.

B-1524. Quote: The most challenging thing in the world… (2019-02-01)

“The most challenging thing in the world is not to learn fancy technologies, but control your own wild heart.” Yihui Xie, as quoted in Appendix C of Authoring Books wtih R Markdown.

B-1523. Recommendation: A Systematic Examination of the Citation of Prior Research (2019-02-01)

This was a nice study, and shows a very easy model to adapt to other research problems. The authors were concerned that many reports of randomized clinical trials (RCTs) seemed to miss out on citing previous RCTs on the same topic. So they dug out a bunch of meta-analysis and looked at the bibliographies of the individual RCTs in those trials. The meta-analysis gives you a reasonably comprehensive history of the RCTs done on a particular topic, and you would think that any RCTs in the meta-analysis should have cited any other RCT in that same meta-analysis that preceded it by at least one year. But that happened very rarely.

B-1522. Recommendation: Sins against science (2019-01-25)

This article discusses some of the recent research showing the prevalence of scientific misconduct, offers some explanations on why this misconduct occurs, and provides some resources.

B-1521. Recommendation: Sample applications and more (2019-01-25)

It’s hard to find good examples of well-written research grants, so this website is wonderful. It shows examples of all sorts of NIH grants (R01, R03, R15, R21, R33, R41, R42, R43, R44, K01, K08, F31) as well as sample data sharing plans, biosketches, and reference letters.

B-1520. Recommendation: Observational studies often make clinical practice recommendations (2019-01-25)

This is a nice example of mining the existing medical literature to discover trends and attitudes about research. The authors take too dim a view, in my opinion, of what observational studies can show, but the statistics are still interesting to follow.

B-1519. Recommendation: Competing commtments in clinical trials (2019-01-25)

The title is a bit misleading. It does not involve financial or non-financial conflicts of interest, but rather when clinical researchers violate the rules of a clinical trial for the perceived benefits of an individual patient. An anonymous survey reveals that this practice is quite common.

B-1518. Recommendation: Should you blow the whistle? (2019-01-25)

This article offers some practical advice about when, how, and whether to become a whistle blower.

B-1517. Recommendation: Bending the rules of clinical trials (2019-01-25)

Many clinical trials have restrictive entry criteria to insure homogeneity of the subject pool which increases power and allows the study to proceed with a not too outrageous sample size. But what if you (or one of your patients) would really benefit from being part of the trial, but does not meet the entry criteria? This study discusses the ethical problems that ensue when a doctor is faced with such a choice.

B-1516. PMean: Think positively, what has research done for us (2019-01-23)

Several years ago, I was part of panel presentation at the Joint Statistical Meetings. My talk was on how to teach Statistics from an evidence-based perspective. A question came up from the audience about the quality of medical research, and there’s a lot of cynicism in the Statistics community about this. Each comment from the audience seemed to get more negative and I stepped in to offer a counter argument. The research process has a lot of flaws, but we have made a ton of progress in how we provide medical care thanks to the careful and rigorously designed studies that have been done. I didn’t convince anyone, but it felt good to stand up for something I strongly believe in. Recently, I had to look for examples of research that has changed clinical practice for the better, and found several interesting articles.

B-1515. Recommendation: Framingham Contribution to Cardiovascular Disease (2019-01-23)

The Framingham Heart Study, a massive longitudinal observational study, has helped identify most of the risk factors for heart disease that we commonly accept today. This paper provides a historical overview of the study and the many valuable insights that it provided.

B-1514. Quote: In a world where the price of calculation continues to decrease rapidly… (2019-01-20)

“In a world where the price of calculation continues to decrease rapidly, but the price of theorem proving continues to hold steady or increase, elementary economics indicates that we ought to spend a larger and larger fraction of our time on calculation.” John Tukey, as quoted in “Sunset Salvo”, The American Statistician 1986; 40(10): 72-76.

B-1513. Quote: Exploratory data analysis is an attitude… (2019-01-20)

Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as the things we believe might be there.� John Tukey, as quoted in “Nonparametric statistical data modeling: comment.” J Am Stat Assoc 1979, 74, 121-122.

B-1512. Recommendation: How to write a literature review (2019-01-16)

This is a nice checklist of things that you should do when you are creating a literature review.

B-1511. PMean: Citing one of my web pages (2019-01-15)

I got an inquiry by email asking if it was okay to cite one of my web pages. Here’s what I said, more or less.

B-1510. Recommendation: Leading Questions – Yes Prime Minister (2019-01-14)

This short video clip is an excellent illustration of how the questions leading up to a particular question on a survey can bias the response to that survey. It comes from a British comedy, Yes Prime Minister, that ran in the 1980s.

B-1509. Recommendation: Hints for designing effective questionnaires (2019-01-14)

I had originally cited this resources on my survey design category page, but the link was broken, so here is the correct link. It’s a nice guide. A bit too firm in its opinions, perhaps, but still well worth reading.

B-1508. Recommendation: Making women in science visible (2019-01-09)

This video was recommended by my niece, and it caught my eye for a more subtle theme, perhaps. Rachel Ignotofsky is a great believer that illustration makes difficult material more accessible. This supports an idea I’ve had for a while to develop a book of case studies in research ethics using a graphic novel format. Anyway, the video also emphasizes the importance of recognizing barriers of sexism, racism, and classism that many great women scientists have faced and overcome. This video is a fairly easy 15 minutes to listen to.

B-1507. Recommendation: How a Feel-Good AI Story Went Wrong in Flint (2019-01-09)

Building a great statistical model does no one any good if it doesn’t pay attention to non-statistical issues. This story talks about a machine learning model to identify which houses in Flint Michagan that were the best candidates for removal of lead pipes. The model worked fairly well, but came up against problems like individual city council members wanting to assure their constituents that enough was being done in their district. I’m not sure what the actual moral of this story is, but it does serve as a warning to be careful when you are modeling data in a contentous area.

B-1506. Recommendation: Congratulations on the Promotion. But Did Science Get a Demotion? (2018-12-31)

I am recommending this article, not because I agree with it, but because it reinforces a common theme: the struggle to get and keep funding is skewing research as much as or more than conflicts caused by direct financial support. Like many of the previous articles on the topic, I find this article to be rife with speculation and lacking any empirical data to support the issue. I outlined similar concerns on my website back in 2005. Recently, the belief that obtaining a government grant somehow taints your credibility has led to a purge of good scientists from many EPA advisory panels.

I think this article offers bad advice and bad conclusions. But please read this article and decide for yourself.

B-1505. Recommendation: Convert PowerPoint Slides to xaringan (remark.js) Slides (2018-12-30)

I have been constructing most of my recent presentations to R Markdown. This includes presentations that have little or no R code in them. I like using R Markdown because you are manipulating simple text files. This makes it easy to use version control, among other things.

There’s a new package, which I have not tried yet, that will do a direct translation of a PowerPoint file into R Markdown. It uses a presentation format (xaringan) that I personally do not like, but it should be pretty easy to switch from xaringan to a different format like ioslides. The package owner warns that you will probably have to tweak the resulting R Markdown code to get it perfect, but the package should do “get you about 90% of the way there for about 80% of use cases.” That’s still a huge time savings.

B-1504. Recommendation: We Have Ways To Stop Rogue Scientists. They Don’t Always Work. (2018-12-29)

This is a nice review inspired by recent controversial work of a Chinese scientist who claims to have created genetically engineered babies. It outlines approaches that we use to regulate unethical science and explains why these approaches can fail.

B-1503. Recommendation: 9 Reasons Excel Users Should Consider Learning Programming (2018-12-29)

Microsoft Excel is very popular, but it has many serious limitations. This article explains what you lose out on if you rely just on Excel and what additonal capabilities that R and Python offer that allow you to do better work and do it more efficiently.

B-1502. Recommendation: Standardized Mortality Ratio (2018-12-11)

I was at a talk where mortality rates were presented in one column and<U+00A0> the standardized mortality ratio was presented in a different column. I was a bit confused; I could not remember how or why you calculate an SMR. It’s not because SMR calculations are complicated; it’s because my brain can’t remember things as well as it used to. So when I got back to my office, I searched for a web site with a simple tutorial on SMRs with a worked out example. This page popped up right away and I was impressed with the clarity of the writing style.

B-1501. Recommendation: In UC’s battle with the world’s largest scientific publisher, the future of information is at stake (2018-12-11)

The University of California (UC) is in the midst of a difficult negotiation with Reed Elsevier, a major publisher of research journals. The dispute relates to the traditional model of publishing where the author writes for a journal for free and the journal sells subscriptions to individuals and libraries. A newer publication model is Open Source, where the author pays a fee to get the article published, and then the article is made available for free to any and all readers. The UC library wants a large reduction in subscription fees and is threatening to cancel the Elsevier subscription and rely solely on open source journals. The issues are complicated and this article lays out both sides carefully.

B-1500. Recommendation: Make PowerPoint Presentations with R Markdown (2018-12-05)

This is a 42 minute presentation that covers the basics of using R Markdown to produce PowerPoint files. It touches on another couple of RStudio products: R Studio Connect and Shiny. This covers a lot of customizations issues. Also see Rendering PowerPoint Presentations with RStudio.

B-1499. Recommendation: Welcome to DASL – The Data And Story Library (2018-11-27)

The Data and Story Library (DASL) is a collection of small and simple data sets useful for teaching basic statistical concepts. It was originally housed at the Carnegie-Mellon website, but (like many classic websites) it disappeared one day. The nice folks at Data Description, Inc. (makers of Data Desk software) have revived and updated this resource.

B-1498. PMean: Getting R to shut the heck up (2018-11-11)

When you are using R Markdown to create various documents, you are often interested in displaying any informative messages that appear along the way. This is especially true for documents you plan to use yourself. But when you are preparing a report or a presentation for someone else, you may want to suppress these messages. That’s not always easy because different functions in R use different means to display messages, especially warning messages. So the option that might suppress a warning message from one function might not work for another function. Warnings when loading packages are notoriously difficult to suppress. I want to list, for my own benefit, all of the options that are available for getting R to shut the heck up.

B-1497. Recommendation: Installing R and Python in Anaconda for Biologists (2018-11-08)

Anaconda provides an easy way to manage the installation of both R and Python. The R installation includes R Studio, if you want it. While the title of this blog entry mentions Biologists, the advice is useful for anyone.

B-1496. PMean: Fighting SASism (2018-10-30)

I told a story today in a webinar workshop that I thought I should get down in writing for my blog. It involves a prejudice unique among statisticians called SASism.

B-1495. Recommendation: 1.1 Billion Taxi Rides with Spark 2.2 & 3 Raspberry Pi 3 Model Bs (2018-10-23)

Mark Litwintschik has taken a large open source data set (1.1 billion taxi rides with data storage on the order of hundreds of gigabytes) and ran some benchmark queries on a variety of different systems. Perhaps the most humble of these systems is a cluster of three Raspberry Pi computers. This webpage talks about how he set up the software on this cluster.

B-1494. Recommendation: Accessible R Markdown Documents (2018-10-23)

A class covering on-line teaching has reminded me about accessibility issues. This includes accessibility for blind students who rely on screen readers. This webpage post covers some of the very simple things you can do that would make life a lot easier for students with impaired vision.

B-1493. Recommendation: A top Cornell food researcher has had 13 studies retracted. That’s a lot. (2018-09-24)

This is a non-technical account of Brian Wansink, a food researcher, who has been accused of p-hacking,

B-1492. Recommendation: Call R from SPSS (2018-09-23)

This is a nice, easy to follow overview of how to use bits of R code within SPSS.

B-1491. PMean: What to do about claims of borderline statistical significance (2018-09-21)

A comment about the phrase “trend towards efficiency” on the Statistical Consulting Section discussion board raised a lot of interesting commentary. The phrase refers to a setting where the p-value is not small enough to allow you to claim statistical significance, but still was close enough to 0.05 to be worth commenting on. Most of responses were fairly negative and stressed that we need to refuse to sign off on any report of publication using that phrase. I posted a response that differed from the others. Here’s the gist of what I said.

B-1490. PMean: What I did in the last twelve months (2018-09-19)

For the yearly evaluation for my part-time position in the Department of Biomedical and Health Informatics at the University of Missouri-Kansas City, I have had to prepare a list of accomplisments and goals for the upcoming year. I thought that I’d share this year’s list on my blog.

B-1489. Recommendation: False-Positive Psychology (2018-09-14)

This article was recommended at a talk I attended. The full title says it all.

B-1488. Recommendation: Making it easier to discover data sets (2018-09-14)

I heard about this from the UMKC Bioinformatics twitter feed. Google has a blog entry highlighting a new search feature they’ve developed, Dataset Search. It lets you find interesting data sets using standard Google search criteria. The system only works if people on the web provide reasonable documentation of their data sets. I’ve not had a chance to work with this yet, but it looks interesting.

B-1487. Recommendation: Next-generation phenotyping of electronic health records (2018-09-13)

This article was recommended at a talk I attended. The thesis is that we need to look at the electronic health record from the bottom up.

B-1486. Recommendation: Use of Electronic Health Record Data in Clinical Investigations. Guidance for Industry (2018-09-13)

The U.S. Food and Drug Administration (FDA) is encouraging great use of electronic health record data to supplement the traditional randomized clinical trials. But you need to use care. Here is some guidance on what the FDA is recommending to industry.

B-1485. Recommendation: Complex Innovative Trial Designs Pilot Program (2018-09-13)

The FDA is encouraging new complex trial designs that are adaptive and/or Bayesian. This page describes a program where drug companies are encouraged to propose these designs and they publicly disclose the details so that others can learn from their successes (and failures).

B-1484. PMean: Super Pi, a group to teach cluster computing using the Raspberry Pi (2018-09-10)

If you want to learn cluster computing and you didn’t have easy access, you had two choices. You could simulate a cluster computer on your laptop, or you could buy time in the cloud. There’s a third approach, build your own cluster computer system using several Raspberry Pi computers.

B-1483. Recommendation: Leaflet, an open source Javascript library for mobile-friendly interactive maps (2018-09-07)

I have not tried this package but it was independently recommended by two separate sources that I have a lot of respect for. It uses a minimal computer interface to produce high quality interactive maps.

B-1482. Recommendation: lavaan tutorial (2018-09-07)

I have not run too many Structural Equation Modeling (SEM), so I can’t comment too much about this resource, but anytime I see something like SEM features added to R, I have to rejoice. This page gives a nice tutorial introduction to lavaan, the R package for SEM, and the material is pitched at a level easy to follow even for people like me who have a limited appreciation for SEM terminology.

B-1481. Recommendation: JupyterLab is Ready for Users (2018-09-07)

Jupyter is an integrated development environment that uses a notebook interface. It was originally developed for Python, but is now available for a variety of other languages, including my favorite, R. I attended a talk about Jupyter from one of the main developers, and after explaining what Jupyter is, he demonstrated JupyterLab. JupyterLab is a new IDE that uses the same structure and files as Jupyter. It looks to be very simple but also very powerful.

B-1480. Recommendation: The five safes, designing data access for research (2018-09-07)

This is a nice overview of the various dimensions of security that you need to consider when making research data available on the Internet.

B-1479. PMean: Python, Raspberry Pi, and cluster computing (2018-09-04)

I’ve been experimenting with connecting a small number of Raspberry Pi in a cluster computer, and a good place to start is MPI (Message Passing Interface). Unfortunately, many of the books and websites that I have looked at use examples in C and FORTRAN. These are fine languages, but ones that I am unlikely to need in the future. I want to explore MPI from with a newer programming language, Python. Here are some resources I have leaned on in getting this started.

B-1478. Recommendation: Python bindings for MPI (2018-09-04)

The Python Software Foundation supports PyPA, the Python Packaging Authority, a group that maintains an easy to use infrastructure for distributing Python software packages. I’ve used one of their sites, PyPI, to download mpi4py, a way to run Message Passing Interface applications within Python.

B-1477. Recommendation: Cluster Hat 2.0 (2018-09-01)

I am working on creating a cluster computer using three Raspberry Pis. Once I get that working, I might see what I can do with the smaller model, the Raspberry Pi Zero. If I did, this interface would be a good start. It sits on top of a regular Pi and connects up to four Pi Zeros to it.

B-1476. Recommendation: MPI Tutorial Introduction (2018-08-31)

MPI (Message Passing Interface) is the grandfather of all parallel programming systems. It is a series of routines that you can call from FORTRAN or C++ programs, and was developed in the early 1990s. This tutorial is a slow and easy introduction to MPI.

B-1475. Recommendation: r2d3: R Interface to D3 Visualizations (2018-08-28)

There’s an interesting visualization system called d3, that I only became aware of a few months ago. It uses a fairly minimal system, javascript and support vector graphics, but is capable of producing tremendously rich graphics. I’ve tried, when I have a few spare moments, to learn d3. It’s not that complicated, but I am hindered by a lack of knowledge about javascript. It’s also difficult to debug a d3 program. There’s a new library in R that should make things a bit easier for someone like me. It’s called r2d3 and is being promoted with the new preview release of RStudio.

B-1474. PMean: What day is it in my R program? (2018-08-27)

There are three different ways to find out inside your R program what day it is. Let’s quickly look at each.

B-1473. PMean: Business essentials for starting an independent consulting practice (2018-08-17)

I’m giving a talk on Wednesday, August 22, titled “Business essentials for starting an independent consulting practice.” Here are a few details about this talk.

B-1472. Recommendation: NOOBS – Raspberry Pi documentation (2018-08-15)

You have several ways of installing the Raspbian operating system for the Raspberry Pi computers, but the simplest is to use NOOBS (New Out Of Box Software). This page shows you how to download and install the Raspbian operating system using NOOBS.

B-1471. Recommendation: How to Clone Your Raspberry Pi SD Cards With Windows (2018-08-15)

I’m building a cluster of Raspberry Pi computers to run Hadoop and one of the tasks that I need is to make working backups of the micro SD cards that these computers use to store their operating system. This page takes you carefully through all the steps you need to make a backup and then to restore it to a different micro SD card.

B-1470. PMean: The Dark Side of Data Science (2018-08-13)

I’m planning to give a talk on “The Dark Side of Data Science” and I’m hoping to get some interesting references and articles from my colleagues. Here is a first draft of my abstract, with a few references that I am already familiar with.

B-1469. Recommendation: GloVe word vector embeddings (2018-08-01)

When you are working with text mining, you might want to reduce the dimensionality of your problem. The word2vec algorithm, developed by Tomas Mikolov and others at Google, offers a nice approach. This page shows how to apply this algorithm within R.

B-1468. Recommendation: Practical deep learning for coders (2018-08-01)

This is a MOOC (Massive Open Online Course) covering deep learning models. I have not taken it, but it comes highly recommended by others. It uses Python as the underlying language.

B-1467. PMean: A simple structure for documentation (2018-07-24)

Everybody has different standards for documentation, and if you are already using a standard you like, don’t let me stop you. But if you’ve never used much documentation and decide that you need to do better, here’s a guideline that I developed.

B-1466. Recommendation: Use of Electronic Health Record Data in Clinical Investigations (2018-07-23)

This press releases announces a “Guidance for Industry” document that the U.S. Food and Drug Administration provides from time to time on technical issues. This document discusses the use of the Electronic Health Record as an additional source of information for prospective clinical trials.

B-1465. PMean: Grading rubric for computer assignments (2018-07-20)

I’ve been teaching a variety of classes that require students to run a statistical analysis in a package like SAS or R and report the results. There is a tremendous variety of formats that students use, and I thought it would be helpful to offer some guidance. It would save me time in grading, but more importantly it would emphasize that students need to think about what they produce rather than just tossing together whatever comes out of the computer. The five requirements for homework assignments are they be complete, concise, clear, error-free, and interpretable.

B-1464. Recommendation: A Review of Published Analyses of Case-Cohort Studies and Recommendations for Future Reporting (2018-07-20)

I got a question about how to analyze a case cohort study in Stata. The person was following the code in a Stata conference presentation but was unsure about some of the details. Always looking for simple explanations that I myself don’t understand well, I found this nice article on how these case cohort studies are written up in the literature. Naturally, it provides a brief explanation of how you analyze data from a case cohort design along with several helpful references.

B-1463. Recommendation: How to be more effective in your professional life (2018-07-10)

This article starts with a nice anecdote about being dismissive about what someone else is saying ends up hurting you. It also provides a nice structure, POWER, for organizing consulting meetings. POWER stands for Prepare, Open, Work, End, and Reflect. This article was a basis for some of the content in an interesting webinar on consulting.

B-1462. PMean: How much missingness can you tolerate? (2018-07-07)

I got a question about how much missing data could you have in a study and still feel comfortable with your data analysis. It’s a question with no hard and fast answer, but I get the question so often that I have developed some general guidance.

B-1461. Recommendation: Bayesian meta-analysis of two proportions in random control trials (2018-07-05)

I got a question about Bayesian meta-analysis and found this nice teaching example. I’m not sure if the graphs are from the R package bayesmeta, but it looks like it.

B-1460. Recommendation: Section 508 CoP: PDF Accessibility – Part One (2018-07-03)

I have been somewhat lax in making my work accessible for people with disabilities. This video covers some of the basic things you can do with a PDF file to insure that it is can be easily read by screen reading software. There are similar videos for Microsoft Word and Microsoft Powerpoint files.

B-1459. Recommendation: Microsoft is creating an oracle for catching biased AI algorithms (2018-06-28)

Artificial Intelligence (AI) algorithms that are used for crime detection, loan approvals, and employee evaluations are considered by many to be objective, but they can sometimes have many of the same prejudices and biases that human evaluators have. Given the opacity of many black box approaches to AI, this could lead to serious problems with fairness and equity. This article discusses an admittedly imperfect approach by Microsoft to evaluate these AI algorithms using (surprise!) an AI algorithm. It flags situations where an algorithm appears to have problems with unfair differential treatments<U+00A0> based on race, gender, or age.

B-1458. PMean: What goes into a contract for a consultation (2018-06-24)

Someone asked me about what sort of contract to use with a new client. This person did not need a very detailed contract, but said that a handshake would not suffice. Here’s what I suggested.

B-1457. PMean: Big data groups at UMKC and UM/Columbia (2018-06-09)

I was at a meeting where I learned about some recent efforts with big data at the University of Missouri-Kansas City and the University of Missouri/Columbia. Here’s a brief description along with links.

B-1456. Recommendation: Analysis of Pretest-Postest Designs (2018-05-22)

A very nice resource that talks about difference scores, relative change models, analysis of covariance, and repeated measures models.

B-1455. Recommendation: Applied Survival Analysis (2018-05-22)

Although the title says “Applied”, this book has a fair amount of mathematics in it, which helps you understand why certain approaches work well. The sample outputs for this book are reproduced in several different software programs at the UCLA Institute for Digital Research and Education.

B-1454. Recommendation: Interval regression, R data analysis examples (2018-05-21)

This page shows R code to handle the tricky data sets where the response is known to be inside some interval.

B-1453. Recommendation: Interval censored data analysis (2018-05-21)

Michael Fay gave a short course on interval censored data at the 2010 useR! meeting. The slides from this short course provide a nice overview of the complexities of this type of data.

B-1452. Recommendation: Adjusted survival curves (2018-05-20)

This is a nice overview of how to use R to adjust survival curves in an observational study. It covers weighting and modeling with covariates and criticizes several approaches.

B-1451. PMean: What are we doing to justify all that time we’re budgeting? (2018-05-17)

An email discussion about the appropriate percentage effort on research grants has produced a lot of interesting discussions. One person raised an interesting question. The typical data analysis, he claimed, might involve a few hours reviewing the input data set, a few hours conducting the analysis and a few hours preparing a statistical summary, but even after a generous estimate of the work at each of the time points, he could only come up with 22 hours of effort, which corresponds roughly with a 1% FTE. I wrote back describing some of the things that might occur before the data analysis that might add time to this effort.

B-1450. Recommendation: Reproducible research with Stata (2018-05-13)

This outlines a way to produce “beautiful Beamer presentations” using Stata. This is a step towards the goal of reproducible research.

B-1449. PMean: Some interesting quality improvement resources (2018-05-13)

I attended a conference on quality and patient safety, and some of the speakers mentioned some interesting resources. I googled them and saved the links here.

B-1448. Recommendation: beepr: easily play notification sounds on any platform (2018-05-13)

This package was mentioned at the most recent meeting of the Kansas City R Users Group and was too cute not to mention.

B-1447. PMean: Draft policy on statistical support for research (2018-05-10)

I am drafting up a policy on statistical support for research at my part-time job at UMKC. It is loosely based on standards at the University of California, Davis and Kansas University Medical Center. An early draft appears below. I’ve gotten some suggestions that setting a minimum percentage effort is a bad idea. What do you think?

B-1446. PMean: Learning more about SAS (2018-05-01)

I had three students who successfully completed the Introduction to SAS class that I am teaching at UMKC. Here is the advice that I offered about how to continue to learn more about SAS.

B-1445. Recommendation: SMS Spam Collection Data Set (2018-04-19)

If you are interested in text mining, this is a good data set to start with. It is a bunch of text messages, each one line long, that have been classified by a human as either spam or ham (ham is a legitimate message).

B-1444. Recommendation: A crummy drop-down menu appeared to kill dozens of mothers in Texas. (2018-04-19)

This article talks about how bad the maternal mortality rates are in the United States and how bad our effort to try to quantify the rate is.

B-1443. Recommendation: Dogs vs cats. Create an algorithm to distinguish dogs from cats (2018-04-19)

This is a classic data set for testing out image analysis. You have a data set of 25,000 images which are labelled dogs or cats. This is easy for a human to do, but can you develop an algorithm that can tell the difference?

B-1442. Recommendation: Definitions of Criteria and Considerations for Research Project Grant Critiques (2018-04-11)

I have to help write NIH grants from time to time, and I need to always keep front and center the criteria that NIH peer reviewers use when they evaluate grants. They look at five broad areas: significance, investigators, innovation, approach, and environment. This document explains what each of these five broad areas means.<U+00A0>

B-1441. Recommendation: Data Sharing Network (SHRINE) (2018-04-10)

I’m ginvg a talk about i2b2 (among other things) and when browsing through their website, I cam across an interesting project, SHRINE. This is an acronym for Shared health Research Informatics NEtwork., and represents a way of allowing users to review information across multiple i2b2 sites. It requires the individual institutions who have i2b2 systems to cooperate with one another, which is not always easy. But this has tremendous potential.

B-1440. Recommendation: Exploits of a Mom (2018-04-10)

xkcd cartoon showing a mother talking about her son, Robert'); DROP TABLE Students;--

xkcd cartoon showing a mother talking about her son, Robert'); DROP TABLE Students;--

This xkcd cartoon by Scott Munro is open source, so I can hotlink the image directly. But if you go to the source, https://xkcd.com/327/, be sure to hover over the image for a second punch line.

B-1439. Recommendation: ISO 8601 (2018-04-09)

xkcd cartoon showing the standard date format and some silly alternatives

xkcd cartoon showing the standard date format and some silly alternatives

This xkcd cartoon by Scott Munro is open source, so I can hotlink the image directly. But if you go to the source, https://xkcd.com/1179/, be sure to hover over the image for a second punch line.

B-1438. Recommendation: TinyTeX: A lightweight, cross-platform, portable, and easy-to-maintain LaTeX distribution based on TeX Live (2018-04-05)

I’ve been using a version of LaTeX (MikTeX) for a couple of years, and it’s not bad. But when I heard about Yihui Xie’s R package, tinytex, I jumped at the opportunity to try it. Dr. Xie is the author of knitr, a package that makes it easy to create well documented R programs where the code and the output are gracefully merged. He created this new package, tinytex, because he felt that the current versions of LaTex had complex installation processes and forced you to choose between a minimal installation that couldn’t do anything useful and a full installation that was bloated with features you’d never use. I can’t say too much about the package yet except that he is right in that it is very easy to install. If I find out more, I’ll let you know.

B-1437. Recommendation: EuSpRIG horror stories. (2018-04-03)

There has been a lot written about data management problems with using spreadsheets, and there is a group the European Spreadsheet Risks Interest Group that has documented the problem carefully and thoroughly. This page on their website lists the big, big, big problems that have occurred because of spreadsheet mistakes. Any program is capable of producing mistakes, of course, but spreadsheets are particularly prone to errors for a variety of reasons that this group documents.

B-1436. Recommendation: The Reinhart-Rogoff error – or how not to Excel at economics (2018-04-03)

There has been a lot written about how lousy Microsoft Excel (and other spreadsheet products) are at data management, but the warning sinks in so much more effectively when you can cite an example where the use of Excel leads to an embarrassing retraction. Perhaps the best example is the paper by Carmen Reinhart and Peter Rogoff where a major conclusion was invalidated when a formula inside their Excel spreadsheet accidentally included only 15 of the relevant 20 countries. Here’s a nice description of that event and some suggestions on how to avoid this in the future.

B-1435. Recommendation: Statistical and Machine Learning forecasting methods: Concerns and ways forward (2018-04-03)

At first glance, you might think that this article looks like a vindication of traditional statistics. Classical time series models (methods that were available in the 1960’s) outperform newer machine language forecasting models. Then, you might worry that the comparisons were unfair. But neither viewpoint is accurate. The classical time series models have certain structural advantages for certain types of problems, but you might be better off with machine learning if you use classical time series as a preprocessing step, such as de-seasonalizing your data. If nothing else, this article provides a nice overview of some of the major machine learning methods.

B-1434. Recommendation: Guidelines for estimating biostatistician effort and resources on grants (2018-04-03)

What percentage effort is reasonable for Biostatistics support on a research grant? The UC Davis Biostatistics Group says 10% as a bare minimum, 35-60% for straightforward projects with uncomplicated analyses, and 50-100%+ for large or complex projects. They give examples of large and complex projects: interim analysis, multi-site projects, development of novel statistical methods, and assembly of data from large, complex, or poorly documented administrative or survey data sets.

They also describe how to split the effort between a PhD Biostatistician, who supervises the overall effort, and a MS Biostatistician, who does most of the data management and statistical analysis.

Another point worth noting is that any grant listing less than 10% effort for a Biostatistician requires a special sign off.

B-1433. Pmean: My teaching and research statement (2018-04-02)

I am applying to a variety of jobs and some of them ask for a statement about teaching or research. Here’s something I wrote where the employer asked for a combination of the two.

B-1432. PMean: My teaching interests, one page limit (2018-04-02)

I have been applying to a variety of jobs, and some of them, mostly universities, want a statement of teaching philosophy, research interests, or some combination. I enjoy writing these, except for the ones that have page limits. In this and the next few blog posts, I will share what I wrote. If you read these, it might give you a better idea of what I do at my current and previous jobs and what I would like to do in a future job. Here’s a one page limit statement on my teaching interests and experience. It won’t be one page on my blog because of formatting differences, of course, but it will be brief than I like.

B-1431. PMean: My teaching approach (2018-04-02)

I am applying for a variety of jobs and some of the universities that I am applying to want to know my approach to teaching. It’s an interesting thing to write about, because most of my teaching experience is in a non-traditional format. Here’s what I wrote for one job I applied for.

B-1430. PMean: My research interests, one page limit (2018-04-02)

Another place asked for my research interests and asked me to keep it to a single page. Gack! I do not have an easy time keeping withing page limits. Here’s what I wrote. It won’t be one page on my blog, because of formatting changes, of course.

B-1429. PMean: My research interests (2018-04-02)

I’ve been applying for a variety of jobs, and one of them asked for a statement on my research interests. I tried to emphasize the collaborative nature of my research. Here’s what I wrote.

B-1428. PMean: Developing an interdisciplinary research program (2018-04-02)

I’ve been applying for a variety of jobs, and one of them asked for a statement on how I would develop an interdisciplinary research program. It’s fun to write these, and I thought I’d share what I wrote with you all.

B-1427. Recommendation: To combat physician burnout and improve care, fix the electronic health record (2018-04-01)

This article is a nice counterbalance to all the glowing reports about how moving to the electronic health record is going to revolutionize health care. This effort certainly has value, but it comes at a cost. The article talks about the improvements needed to the crude 1990s interface and how to avoid overburdening the medical record with extraneous data.

B-1426. PMean: Starting a heron-i2b2-analytics repository (2018-03-29)

I am working on a CTSA grant to develop repeatable downstream pipelines that directly access i2b2 and CDM. In order to promote this work and encourage others to participate, I was given a repository site on github, kumc-bmi/heron-i2b2-analytics. Right now, it is just a shell, but here’s what I want to do with it, short term and long term.

B-1425. Recommendation: Evaluation Methods in Biomedical Informatics (2018-03-29)

When I was talking about using the electronic health record as a measurement tool for quality improvement studies, a colleague recommended that I look at a book, Evaluation Methods in Biomedical Informatics by Friedman and Wyatt. I don’t have a copy yet, but the preview offered by Amazon is intriguing.

B-1424. PMean: Using the transpose procedure in SAS (2018-03-28)

A couple of my students are having difficulty with restructuring data sets in SAS. This is not surprising. Restructuring is very important, but not so easy. I decided to run a few simple examples of PROC TRANSPOSE to help clarify things. Here is the code and output.

B-1423. PMean: And the least important variable is… (2018-03-28)

I heard a story a long time ago, and I don’t remember who told it to me and I’m probably getting all the details wrong, but I wanted to try to recreate the story from memory because it illustrates one of the perils of blind reliance on statistical models to identify “important” variables.

B-1421. PMean: Next stop, BMC Medical Informatics and Decision Making (2018-03-27)

I’m working part-time on a research grant and I want to publish some of the work I’ve done on this grant. The title of the paper tentatively is “Validating elastic net generated electronic health record breast cancer phenotypes against hospital tumor registries: a case control study.” My co-authors are Dan Connolly and Russ Waitman. I want to summarize the history of the effort so far and why I am considering the BMC Medical Informatics and Decision Making as the next place to submit the article.

B-1420. Recommendation: Good Publication Practice for Communicating Company-Sponsored Medical Research: GPP3 (2018-03-27)

Very little of my research fits into the category of company-sponsored medical research, but it is important to be aware of the special concerns and the extra oversight that this research requires. This article cover a consensus standard of guidelines that make a lot of sense, in my opinion, to avoid some of the recent controversies about research abuses. It is also a pretty good guideline, for the most part, for other medical research beyond company-sponsored research.

B-1419. PMean: Two articles published in the Encyclopedia of Big Data (2018-03-25)

I don’t call myself a “big data” analyst, but when a call went out seeking authors for various topics for the Encyclopedia of Big Data, I volunteered to write two articles. Here are the details.

B-1418. Recommendation: Textbook Examples Applied Survival Analysis (2018-03-24)

I’m teaching an online workshop for The Analysis Factor on survival analysis. It’s not announced yet, and I have a LOT of work to do before it is ready. One thing that will save me time is that I am taking many of my examples from the excellent textbook, Applied Survival Analysis Second Edition. One nice perk of this book is that the helpful folks at UCLA have taken every textbook example, and written up code (with comments!) to reproduce the book’s results. With the exception of a few advanced methods in later chapters, where only one or two software packages have the right capability, the code is written in parallel in R, SAS, SPSS, and Stata. They also have links to the raw data at the publishers website, and datasets stored in SAS format and SPSS format. How nice! Browse around and you’ll find software code for all the examples in other popular statistics textbooks as well.

Warning! The R examples look like they are from the first edition, not the second edition. A small nitpick for an otherwise very nice resource.

B-1417. PMean: They want a short biography from me (2018-03-23)

I should have titled this page “I’m a Star!” because the School of Medicine’s Marketing and Communications Office is asking me questions to prepare a short biography to highlight the research I’m doing. Actually, that office is talking to over a hundred researchers in the School of Medicine, so I’m not really a star after all. But here are the questions that they started with. I’ll reply by email and they may get more information by email or a face-to-face interview. Makeup!

B-1416. Recommendation: Getting Started with the SAS 9.4 Output Delivery System (2018-03-23)

I don’t use SAS that much anymore. Not because it’s a bad program. Mostly it’s because it’s hard to keep on top of too many statistical packages all at once. But I’m teaching an Introduction to SAS class this semester, and I need to keep up with recent innovations. One of the more important of these is ODS, which is short for Output Delivery System. ODS allows you to customize the output using formats like HTML, RTF, PDF, or PostScript. ODS also produces PowerPoint and Excel files.

ODS also allows you to customize how your output appears. Finally, ODS makes some big changes to procedures that used to only produce printed output. With ODS enabled, these procedures will add in extra high resolution plots, which you can also customize.

I do not know if the Introduction to SAS class should incorporate ODS or not. It’s similar to asking if the Introduction to R class should incorporate markdown documents or not. In general, I tend to think that we should teach plain vanilla versions of SAS and R, but I do worry that we may be missing something important if we don’t teach ODS or markdown.

B-1415. PMean: Exporting a graph in SAS (2018-03-23)

I got a question about how to export a graph in SAS to a program like PowerPoint. There are several ways to do this, and I explained that you can right click on any graph that appears on your screen and copy it to the clipboard and then open up PowerPoint and right click on a slide and paste it in. That’s fairly standard on any Windows system. I presume that SAS supports similar approaches on the Macintosh and Linus, but I have no easy way of testing this.

But there are other ways to export a graph. You can tell SAS to save a particular graph to a file and then you can import that file into PowerPoint. It works, but there is a twist.

B-1414. PMean: Tests of equivalence and non-inferiority (2018-03-23)

I’m making a webinar presentation in April for The Analysis Factor. I’ve done this several times in the past. The talk in April will be on tests of equivalence and non-inferiority, a topic which I have covered briefly in my newsletter. I thought I’d share a first draft of the abstract here on my blog.

B-1413. Recommendation: Welcome to developerWorks (2018-03-23)

I got this recommendation from a friend. IBM has a large number of free resources explaining things like cloud computing and blockchain. I’m most interested in their section on analytics. There’s a nice introduction, for example, to natural language processing.

B-1412. PMean: Peer grading in Introduction to R, SPSS, SAS (2018-03-22)

I’ve gotten some helpful feedback that I need to encourage more interactions among students in the on-line classes, Introduction to R, Introduction to SPSS, and Introduction to SAS. No just interactions of the students with the teacher, but interactions between the students.

In many online classes this is done by encouraging online discussion of the material in the class. This is not so easy, however, for these three classes. I can just imagine myself posting the following on Blackboard. “Tell me what you think about the read.csv function in R.”

There are a couple of ways, however, that make sense for technical classes like these.

B-1411. PMean: Mining the electronic health record, why and how (2018-03-22)

I’m submitting a talk to an upcoming research conference. I’m not sure if they’ll accept it, but wish me luck. Here’s the title and abstract.

B-1410. PMean: Changes to the Introduction to R, SAS, and SPSS classes (2018-03-21)

I have helped develop and have taught (along with other faculty in our department) three one credit hour pass/fail classes: Introduction to R, Introduction to SPSS, and Introduction to SAS. These classes were developed back in 2014-2015 and they are in need of some serious updates. I will try to outline some of the updates that I think these classes need in this blog post.

B-1409. Recommendation: Is vaccine effectiveness (VE) different from vaccine efficacy (2018-03-01)

This is a non-technical discussion of the difference between effectiveness and efficacy (two easily confused terms) in the context of vaccination. Short answer: efficacy is a measurement under ideal circumstances while effectiveness is a measurement in a “real-world” setting.

B-1408. Recommendation: A sampling of outstanding women in analytics (2018-02-28)

This is a list (with single paragraph descriptions) of 186 women who have accomplished great things in the area of Analytics. There is a brief accompanying article at the Forbes magazine website, but it is very brief. The author of this list, Meta S. Brown, defines Analytics quite broadly, so the women have very diverse backgrounds and interests. I only recognized one name off the bat, Grace Wahba, an excellent researcher, but someone, unfortunately, that I haven’t met. If I get a chance, I’ll include in a separate blog post a list of outstanding women in Analytics that I HAVE met. Meta Brown’s list includes links so you can find out more about these talented women.

B-1407. Recommendation: The history of Hadoop (2018-02-24)

If you want to understand big data, you need to understand Hadoop. Hadoop is the technology underlying many big data efforts. But most of the descriptions of Hadoop are jargon laden and impenetrable to newcomers. Well, maybe just impenetrable to this newcomer. But one great revelation to me was a historical note as to WHY there was a need to develop Hadoop. It was all those pages that had to be indexed by search engines at Google and Yahoo. So I went out to try to find more details. This article, with a ton of references throughout, is an excellent introduction to the precursors to Hadoop, the development of Hadoop itself, and the explosion of systems that used Hadoop as their foundation.

B-1406. Recommendation: Data dictionaries (2018-02-24)

This is a nice explanation of what goes into a data dictionary, written from the perspective of research data management.

B-1405. Recommendation: Adherence to Methodological Standards in Research Using the National Inpatient Sample (2018-02-20)

I normally don’t recommend articles that are stuck behind pay walls, but this is an important article. It shows how 85% of a sample of research studies using the National Inpatient Sample database failed to follow at least one of seven well documented practice recommendations of the Agency for Healthcare Research and Quality.

B-1404. Recommendation: An introduction to implementation science for the non-specialist (2018-02-06)

I’ve done a lot of work with Evidence-Based Health, but one big and largely unsolved problem is how to get health care professionals to change their practices once the evidence for these changes becomes obvious. If no one changes in the face of evidence, then all the effort to produce and critically appraise the evidence becomes worthless. A new field, implementation science, has been developed to get at methods to encourage the adoption of new evidence-based practices. This paper outlines how implementation science is supposed to work and offers two real world examples of implementation science studies.

B-1403. Recommendation: The Survey Statistician newsletter (2018-01-30)

The International Association of Survey Statisticians publishes a newsletter every six months that covers general information about surveys, announcements about meetings, and other activities of the association.

B-1402. Recommendation: Hi, I’m Mike Bostock. (2018-01-23)

This is an AMA (Ask Me Anything) session with Mike Bostock, a former graphics editor for the New York Times and creator of the d3.js data visualization package. I’ll be writing a few things about d3.js once I figure things out. Mike is someone worth watching, because he is working on high visibility, high impact stuff.

B-1401. Recommendation: How to be more effective in your professional life (2018-01-22)

Doug Zahn has done a tremendous amount of work on what I like to call the human factors in statistical consulting. He summarizes some key ideas in this article. His humorous anecdote about his prized Mustang car illustrates the tendency of all of us to be poor listeners. Pay special atention to Table 1 where he outlines the five steps you should always follow in any consulting interaction.

B-1400. Recommendation: Philosophy News Network: Postmodernism Special Report (2018-01-02)

I generally shy away from Philosophical debates, but I did discuss a Postmodern critique of Evidence Based Medicine a while back. When one of my more intellectual friends posted a link to a commentary on Postmodernism on the Existential Comics web site, I had to take a look. I think I did a pretty good job of summarizing Postmodernism without stereotyping it, but maybe I’m setting my standards too low if I try to compete with a comic strip. You can judge for yourself.

B-1399. Recommendation: The Origins of ‘Big Data’ (2018-01-02)

I’m not a big fan of the term “big data” but I’ve been applying for a couple of jobs that ask for expertise in big data instead of expertise in Statistics. So in one of the cover letters, I wrote that I was doing big data analysis before the term was even coined. That forced me to do a quick fact check, and it looks like the term first came into wide use in the late 1990s. Here’s an article on the person who first coined the term “big data.”

B-1398. Recommendation: Designing and conducting semi-structured interviews for research (2017-12-27)

This is a very helpful guide on collecting qualitative data through a semi-structured interview. It emphasizes the need for probe questions and on behaviors that you should adopt to put your subject at ease and get the best information possible. This handout was developed for a college course on Organizational Communication, and the syllabus for this class has other valuable resources.

B-1397. Recommendation: No more rainbows (2017-12-15)

This is a nice article explaining why using a rainbow (red-orange-yellow-green-blue-indigo-violet) is a bad idea. The colors produce an artefactual banding pattern, they do not follow a consistent trend from light to dark, they cause trouble for people with color blindness, and they translate poorly to black-and-white reproductions. The article also shows some nice alternatives. Thanks to @EarlGlynn for sharing this.

B-1396. Recommendation: Network analysis in cross-sectional data using R (2017-12-12)

These are the slides for a very nice webinar presented by Eiko Fried. Dr. Fried provided a wealth of resources during his webinar (some of these are behind pay walls).

He offered examples of network analysis in the study of bereavement and depression and of post-traumatic stress disorder. He also provided tutorial papers on network models with binary data and regularized partial correlation networks., as well as a nice general overview of network models in mental health. He shared a blog posting on the relationship between a latent variable model and a network model and a facebook page on psychological dynamics. He also showed analyses from several R packages, qgraph, IsingFit, and bootnet. I’m putting those links here so I don’t lose track of them when I revisit this stuff six months from now.

B-1395. Recommendation: How to use social media in your career (2017-12-04)

This is a short overview of five major social media sites: LinkedIn, Twitter, Facebook, Instagram, and Snapchat and how you might use them to promote your career. The article ends with a few good overall suggestions.

B-1394. Recommendation: Can A.I. be taught to explain itself (2017-11-27)

This is a nice article in the popular press that talks about some of the problems with “black box” models (in particular deep neural nets) used extensively in many big data projects. It is a bit shy on technical details, which is understandable for a paper like the New York Times. Even so, the stories are quite intriguing. This is a wake up call for those people who fail to recognize the serious problems with many big data models.

B-1393. Recommendation: Intro to SQL for Data Science (2017-11-15)

This is a series of videos and homework exercises that you can work on at your own pace. I have only viewed the outline for this, but anything from DataCamp comes highly recommended.

B-1392. Recommendation: Teaching precursors to data science in introductory and second courses in statistics (2017-11-15)

This paper talks about how to get students to think about large databases in an introductory class that normally uses “toy” problems with a few dozen rows of data.

B-1391. Recommendation: Databases using R (2017-11-15)

This is a page outlining several related efforts at RStudio to make it seaier for you to work with data stored in various relational databases.

B-1390. Recommendation: beanumber repository (2017-11-15)

This is the github repository of Ben Baumer. He is one of the co-authors of “Modern Data Science with R” and the data and code from that book is available here. He also provides code and data for OpenWAR, an open source method for calculating a baseball statistic, Wins Above Replacement. Finally, there is an R library for extracting, transforming, and loading “medium” sized datasets into SQL. Medium here means multi-gigabyte sized files. Related to this are a couple of “medium” sized data sets from the Internet Movie Database and from the NYC CitiBike dataset.

B-1389. Recommendation: Writing about numbers (2017-11-08)

This is a chapter in a classic book, Medical Uses of Statistics. The writer of this particular chapter was a giant in Statistics, Frederick Mosteller. This chapter talks about some of the style issues associated with the data that you would normally present in your results section of your research paper. The advice is a bit dated, perhaps, but still well worth reading.

B-1388. Recommendation: When the revolution came for Amy Cuddy (2017-10-19)

This is one of the best articles I have ever read in the popular press about the complexities of the research process.

This article by Susan Dominus covers some high profile research by Amy Cuddy. She and two co-authors found that your body language not only influences how others view you, but it influences how you view yourself. Striking a “power pose” meaning something like a “legs astride or feet up on a desk” can improve your sense of power and control and these subjective feelings are matched by physiological changes, Your testosterone goes up and your cortisol goes down. Both of these, apparently, are good things.

The research team publishes these findings in Psychological Science, a prominent journal in this field. The article receives a lot of press coverage. Dr. Cuddy becomes the public face of this research, most notably by garnering an invitation to give a TED talk and does a bang-up job. Her talk becomes the second most viewed TED talk of all time.

But there’s a problem. The results of the Psychological Science publication do not get replicated. One of the other two authors expresses doubt about the original research findings. Another research team reviews the data analysis and labels the work “p-hacking”.

The term “p-hacking” is fairly new, but other terms, like “data dredging” and “fishing expedition” have been around for a lot longer. There’s a quote attributed to the economist Robert Coase that is commonly cited in this context, “If you torture the data long enough, it will confess to anything.” I have described it as “running ten tests and then picking the one with the smallest p-value.” Also relevant is this XKCD cartoon.

If p-hacking is a real thing (and there’s some debate about that), then it is a lot more subtle than the quotes and cartoon mentioned above. You can find serious and detailed explanations at a FiveThirtyEight web article by Christie Aschwanden and this 2015 PLOS article by Megan Head et al.

If p-hacking is a problem, then how do you fix it? It turns out that there is a movement in the research world to critically examine existing research findings and to see if the data truly supports the conclusions that have been made. Are the people leading this movement noble warriors for truth or are they shameless bullies who tear down peer-reviewed research in non-peer-reviewed blogs?

I vote for “noble warriors” but read the article and decide for yourself what you think. It’s a complicated area and every perspective has more than one side to it.

One of the noble warriors/shameless bullies is Andrew Gelman, a popular statistician and social scientist. He comments extensively about the New York Times article on his blog, which is also worth reading as well as many comments that others have made on his blog post. It’s also worth digging up some of his earlier commentary about Dr. Cuddy.

B-1387. Recommendation: Search for unpublished data by systematic reviewers: an audit (2017-10-17)

The authors looked at all systematic reviews (excluding methodological reviews) published in a few key journals as well as a random sample of Cochrane reviews to see how often the authors tried to search for unpublised data. The answer is not often enough (64% or 130/203). The article also describes the success rate in getting unpublished data when the attempt was made (89% or 116/130) and how often authors found evidence of publication bias when they did such an assessment (40% or 27/68). Although some people have argued that it is not that important to search for unpublised data, this is still a big concern. A closely related article is Searching for unpublished data for Cochrane reviews: cross sectional study.

B-1386. Recommendation: Get credit for your data – BMC Research Notes launches data notes (2017-10-17)

This is a new effort to get data out into the open for others to use. A data note can be on data that was not published or it could be an addendum describing data used in another publication. This is just getting started, but could end up being a great teaching resource.

B-1385. Recommendation: OpenRefine: A free, open source, powerful tool for working with messy data (2017-09-28)

I have not had a chance to use this, but it comes highly recommended. OpenRefine is a program that uses a graphical user interface to clean up messy data, but it saves all the clean up steps to insure that your work is well documented and reproducible. I listed Martin Magdinier as the “author” in the citation below because he has posted most of the blog entries about OpenRefine, but there are many contributors to this package and website.

B-1384. Recommendation: How to increase value and reduce waste when research priorities are set (2017-09-25)

This is the first in a series of articles on reducing waste in research. It focuses on funding agencies and recommends that funders should support more work on making research replicable, be more transparent on how they set priorities, make sure that research proposals are justified through a systematic review of previous research, and encourage greater openness of research in progress to encourage collagoration. Other articles in this series cover research design, conduct, and analysis, regulation and management, inaccessible research, and incomplete reports of research.

B-1383. Recommendation: Randomized Controlled Trials in Health Insurance Systems (2017-09-20)

While researchers often use data from health insurance systems to conduct observational studies, the authors of this research paper point out that you can also conduct randomized trials as well. You can randomly assign different levels of insurance coverage and then get claims data to evaluate how much difference there is, if any, in the levels of coverage. This approach is attractive because you do not need a lot of resources, and you can very quickly get a very large sample size. Since insurance data is collected for administrative needs rather than research needs, you have to contend with inaccurate or incomplete data, potentially causing loss of statistical efficiency or producing biased results. The authors offer some interesting examples of actual studies, propose new potential studies, and offer general guidance on how to conduct a randomized trial from health insurance systems.

B-1382. Recommendation: Announcing a new monthly feature: What’s going on in this graph (2017-09-14)

Through the effort of a team of statisticians with the American Statistical Association, the New York Times is producing a new resource for educators called “What’s Going On in This Graph?”. This is similar to another New York Times effort called “What’s Going On in This Picture?”

Every month the New York Times will publish a graph stripped of some key information and ask three questions: What do you notice? What do you wonder? and What do you think is going on in this graph?

The content will be suitable for middle school and high school students, but I suspect that even college students will find the exercise interesting.

The first graph will appear on September 19 and on the second Tuesday of every month afterwards.

B-1381. Recommendation: Trump’s Android and iPhone tweets, one year later (2017-09-13)

This is a nice example of using R for text mining of twitter feeds, and the author gives lots of links and hints on how you could do something similar.

B-1380. Recommendation: Notice of Revised NIH Definition of Clinical Trial (2017-09-13)

The NIH recently updated their definition of what a clinical trial is. Here is the link to the new definition. Also worth reviewing are an FAQ list, some case studies, some additional training resources, and blog posts on September 8, 2017 and September 11, 2017.

B-1379. Recommendation: Good enough practices in scientific computing (2017-09-05)

There is more than one way to approach a data analysis and some of the ways lead to easier modifications and updates and help make your work more reproducible. This paper talks about steps that they recommend based on years of teaching software carpentry and data carpentry classes. One of the software products mentioned in this article, OpenRefine, looks like a very interesting way to clean up messy data in a way that leaves a well documented trail.

B-1378. Recommendation: Dryad Digital Repository (2017-08-16)

I’ve been looking for something like this for a while. It is a repository for data sets associated with peer-reveiwed publicattions. I have only glanced at it briefly, but it looks fairly easy to use with a fair number of interesting data sets/publications.

B-1377. Recommendation: Union Army Veterans, All Grown Up (2017-08-06)

Not found

B-1376. Recommendation: The Relationship of Housing and Population Health: A 30-Year Retrospective Analysis (2017-08-06)

Not found

B-1375. Recommendation: Mixed land-uses and commuting: Evidence from the American Housing Survey (2017-08-06)

Not found

B-1374. Recommendation: A Micro-Temporal Geospatial Analysis of Medical Marijuana Dispensaries and Crime in Long Beach California (2017-08-06)

Not found

B-1373. Recommendation: E-cigarette use and associated changes in population smoking cessation: evidence from US current population surveys (2017-08-06)

Google-Books-ID: cmMbCcqnAXcC

B-1372. Recommendation: WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibility (2017-07-19)

Not found

B-1371. Recommendation: Stan: A Probabilistic Programming Language Carpenter Journal of Statistical Software (2017-07-19)

Not found

B-1370. Recommendation: Analyze Survey Data for Free (2017-07-12)

I’ve not had a chance to test this code, but it looks pretty good for anyone who might want to analyze one of the dozens of large databases produced by the U.S. Government.

B-1369. Recommendation: Greece’s troubling prosecution of its former chief statistician (2017-07-12)

This is a nice summary about the prosecution of a statistician, Andreas Georgiu, who was only doing his job.

B-1368. Recommendation: Is the staggeringly profitable business of scientific publishing bad for science? (2017-07-10)

I attended a talk about a decade ago on the problems with for-profit publishing of scientific research and the need to aggressively adopt the open source publication model. It was a message I was ready for, because I had benefited greatly from citing open source resources on my website. I knew that if I cited an open source resource, anyone anywhere could look up that resource. They didn’t need access to a University Library.

B-1367. Recommendation: The new Enigma Public (2017-06-26)

This is yet another interesting source of data. This site specializes in databases prepared by the United States government.

B-1366. Recommendation: Zotero Quick Start Guide (2017-06-20)

Not found

B-1365. Recommendation: Kaggle data (2017-06-19)

In contrast to the five thirty eight databases which are mostly smallish, the Kaggle data sets are, as a rule, very large. They also include a lot of text data, for natural language processing, sentiment analysis, etc.

B-1364. Recommendation: Five Thirty Eight Data (2017-06-16)

This is a github repository of a lot of interesting data sets created by the Five Thirty Eight website. I presume there is a story associated with most of these data sets. The data sets look to be moderate in size for the most part and would make interesting teaching examples.

B-1363. Recommendation: ProbOnto (2017-06-12)

If you work with probability distributions a lot, you find there are mutliple parameterizations (e.g., the two different forms of the exponential distribution), as well as interesting relationships (the geometric distribution is a discrete version of the exponential distribution). I have found Wikipedia to be a nice guide for some of this, but the coverage is uneven in quality. One of the Wikipedia links mentioned a new website, ProbOnto, that offers a systematic and standardized attempt to catalog every important probability distribution and the relationships among these distributions.

B-1362. Recommendation: Why R is Bad for You (2017-05-18)

Arguing about R versus SAS often takes on a religious fervor, so I normally hesitate to recommend articles that trash one package or the other. But this one raises an interesting point which makes it worth reading. Note that “recommended” does not mean that I endorse these conclusions. But rather than bias you with my perception of the issue, just read this on your own.

B-1361. Recommendation: This is your machine learning system? (2017-05-18)

machine_learning.png not found.

machine_learning.png not found.

This xkcd cartoon by Scott Munro is open source, so I can hotlink the image directly. But if you go to the source, https://xkcd.com/1838/, be sure to hover over the image for a second punch line.

B-1360. Recommendation: One in Five Clinical Trials for Adults with Cancer Never Finish (2017-05-05)

This is a research summary of a study that found one out of every five cancer trials that “did not finish” which actually means that they stopped early for futility, if I am reading between the lines properly. Of those studies, 40% stopped early because of poor accrual.

B-1359. Recommendation: The numbers for the Science March… (2017-05-04)

I normally don’t recommend other people’s tweets, but these two, from Siobhan Tompson, were too funny to pass up.

B-1358. Recommendation: ROSE: A package for binary imbalanced learning (2017-05-02)

Logistic regression and other statistical methods for predicting a binary outcome run into problems when the outcome being tested is very rare, even in data sets big enough to insure that the rare outcome occurs hundreds or thousands of times. The problem is that attempts to optimize the model across all of the data will end up looking predominantly at optimizing the negative cases, and could easily ignore and misclassify all or almost all of the positive cases since they consistute such a small percentage of the data. The ROSE package generates artificial balanced samples to allow for better estimation and better evaluation of the accuracy of the model.

B-1357. Recommendation: Proving the null hypothesis in clinical trials (2017-04-27)

I’m attending a great short course on non-inferiority trials and the speaker provided a key reference of historical interest. This reference is the one that got the Statistics community interested in the concept of non-inferiority. The full text is behind a paywall, but you can look at the abstract. A footnote is a paper, Dunnett and Gent 1977, (also trapped behind a paywall) addressed this problem earlier.

B-1356. Recommendation: Standards and guidelines for the interpretation of sequence variants (2017-04-13)

This article outlines a standardized way to describe genetic variants.

B-1355. Recommendation: Genetics: Clues in the code (2017-04-13)

If you want to understand the value of genomic medicine, you can learn a lot by reviewing the case of Nicholas Volker, one of the first success stories in this area. Here’s a nice review.

B-1354. Recommendation: How Bright Promise in Cancer Testing Fell Apart (2017-04-13)

A nice overview of the problems with shoddy research in genetics testing. It highlights the work of “forensic statistics” of Keith Baggerly and Kevin Coombes.

B-1353. Recommendation: Blind analysis: Hide results to seek the truth (2017-04-11)

This paper advocates something I would call a triple blind, keeping the doctor, the patient, and the statistician who analyzes the data in the dark as to which treatment group is which. This avoids problems where the people analyzing the data will either consciously or subconsciously manipulate the data to get a preferred result. Interesting idea, though it represents an awful amount of work to pull it off.

B-1352. Recommendation: Launch a Shiny App on Your Own Server in 4 Steps (2017-04-10)

It’s very easy, apparently, to set up your own server to run Shiny apps (Shiny is a web based method for interacting with R code). If you have set up Amazon Web Services, then it is even easier. Here is a detailed account of how to do this. Once I get my own Shiny server going, I will let you know.

B-1351. PMean: Obnoxious use of red text in RStudio (2017-04-07)

I really enjoy using RStudio, but one thing about it drives me bats. It seems to use red text for some very innocuous error messages.

B-1350. Recommendation: Why be an independent consultant? (2017-04-04)

I might as well recommend something that I wrote. This is a short article in the Amstat News, a monthly newsletter of the American Statistical Association. I talk about all the reasons you wouldn’t want to be an independent consultant and the one big reason why you would–being in control.

B-1349. Recommendation: PheKB. A knowledge base for discovering phenotypes from electronic health records (2017-03-24)

Some of the work I am doing right now could be characterized as discovering phenotypes from electronic health records. So when one of my co-workers mentioned this database, I thought “Oh boy! Oh boy! Oh boy!” This is a list of validated algorithms for various systems, and typically refers to a peer-reviewed publication. So once I get my stuff published, I’m heading here next.

B-1348. Recommendation: Medicare Claims Synthetic Public Use Files (SynPUFs) (2017-03-23)

The Centers for Medicare & Medicaid Services (CMS) provides researchers with access to Medicare claims data, which is a wonderful resources. But you have to sign a restrictive agreement before they will give you this data and you have to pay a non-trivial amount of money to get the data. Fair enough, because CMS has to guarantee patient confidentiality among other things. But what if you want to “play” with the data before taking the plunge? Thankfully, CMS has provided to the general public a synthetic (read fake) data set that has the same data structure. This allows you to prototype your programs on the synthetic data and then transition easily to the real data.

B-1347. Recommendation: Conducting Clinical Research (2017-03-12)

This is a website associated with a very nice book on the pragmatic aspects of running a clinical trial. I came across this site because I was looking for a simple example of a letter to doctors asking them to help recruit patients for a clinical trial. This was in an appendix along with other nice examples of things like case report forms, serious adverse event forms, HIPAA consent template, etc. You can download a free PDF version of this book or you can buy a paper copy.

B-1346. Recommendation: Text Mining with R (2017-02-23)

This is an O’Reilly book (cute animal on the cover is a rabbit) that is available online for free. It’s a great resource for someone just getting started with text mining.

B-1345. Recommendation: R and SQL Server 2016 (2017-02-02)

I have not viewed this video yet, but it comes from a good friend. There is a substantial effort at Microsoft to better integrate the R programming language and their flagship database produce, SQL Server.

B-1344. Recommendation: TSHS Resource Portal (2017-01-30)

The Teaching of Statistics in the Health Sciences (TSHS) section of the American Statistical Association has put together a set of resources for teachers including several very interesting datasets. some of the resources are open to anyone, but others require a registration.

B-1343. Recommendation: i2b2 Design Document. Ontology Management (ONT) Cell (2017-01-30)

I’m digging into some of the complexities of i2b2, especially in the concept path that shows how a particular piece of information in the electronic health record fits into the hierarchy. A colleague pointed me to this nice online document that explains some of this hierarchy.

B-1342. Recommendation: Guidelines for Assessment and Instruction in Statistics Education College Report 2016 (2017-01-24)

As a community, we statisticians have known for a long time that we do not teach that introductory level class in Statistics as well as we should. This guideline list the things and ways we SHOULD teach as well as things that we might think about leaving out.

B-1341. Recommendation: Statistical Issues Seen in Non-Statistics Proposals (2017-01-24)

If you are writing a research grant, there are a lot of statistical issues that you need to consider. This guide, prepared by the American Statistical Association, highlights three areas: framing the problem, designing the study, and specifying the data analysis plan. It doesn’t talk enough about data management, but otherwise it is an excellent resource.

B-1340. Recommendation: Defining Urban and Rural Areas in U.S. Epidemiologic Studies (2016-12-02)

I’m somewhat new to geocoding. One of the first things you might be interested in, if you have geographic data, is an indicator as to whether a certain address, zip code, or county is urban or rural. This is actually quite a complex topic. This paper outlines some of the basic systems to classifying a location as urban, rural, or something in between (e.g., suburban).

B-1339. Recommendation: Practical advice for analysis of large, complex data sets (2016-11-10)

This is a nice compilation of issues that you should be concerned. The examples are mostly from things that interest Google, but you will find this advice itself is useful no matter what type of data you work with. The advice is split into three broad categories: technical (e.g., look at your distributions), process (e.g., separate validation, description, and evaluation), and communication (e.g., data analysis starts with questions, not data or a technique).

B-1338. Recommendation: Where Do You Run Your R Scripts? (2016-10-20)

I’m an experienced R programmer trying to learn a little about SQL. One of my good friends who lives totally in the database world (I call her the Teradata Queen), shared a link to a blog post at SQLServerCentral about using R. Microsoft is including R in its SQL Server distribution, so this is an opportunity for a lot of interesting work combining the data manipulation power of SQL Server with the data analysis power of R. Anyway, the blog post explains some of the cost and performance issues associated with R scripts running on a SQL Server CPU.

B-1337. Recommendation: A Tutorial on Loops in R (2016-10-20)

This is a very clear, but also very detailed explanation of the for, while, and repeat loops along with the concept of vectorization. A great resource for beginners.

B-1336. Recommendation: Oracle Dates and Times (2016-10-18)

I’m working with R and SQL, and some of the work uses SQLite, and some of the work uses Oracle. There are subtle differences between the two, and for that matter between any two database programs. While there are SQL standards, most packages have minor deviations, or enhancements. Dates in Oracle represent one deviation. In particular, Oracle does not use the ISO 8601 standard date format (yyyy-mm-dd) by default. Here’s a nice overview of how to work with Oracle dates.

B-1335. Recommendation: Published methodological quality of randomized controlled trials does not reflect the actual quality assessed in protocols (2016-09-28)

When evaluating a series of research articles, you often have to assess the quality of the individual papers based on the type of blinding, for example. What do you do if the paper does not discuss these items? I have usually advocated a “no news is bad news policy.” If a paper does not mention blinding, assume that no blinding was done. It seems reasonable, but the paper by Mhaskar et al provides empirical evidence that sometimes authors leave out information that would strengthen the credibility of their study. A similar paper is at https://www.ncbi.nlm.nih.gov/pubmed/22424985

B-1334. PMean: By the skin of my teeth (2016-09-23)

I have to brag a bit. I’m working part-time at Kansas University Medical Center (along with a couple other part-time jobs) and my boss asked me two weeks ago if I was interested in writing a paper on the data analyses I had been working on. It would be submitted to the AMIA 2017 Joint Summit on Translational Research and I’d be the first author.

B-1333. Recommendation: Diverse Perspectives on a Flipped Biostatistics Classroom (2016-09-06)

This article is a synthesis of a panel discussion at the 2014 Joint Statistical Meetings on the flipped classroom. The article discusses it solely from the perspective of Biostatistics classes, though they offer some references for the flipped classroom in a more general setting. A flipped classroom is a course where the traditional didactic lectures are recorded and watched at home and the homework that would normally be done at home is done instead in the classroom. This homework in a Biostatistics class often takes the form of active learning in small groups, such as critiquing published research studies or conducting analyses on real world data sets. The key component, according to the authors, is the in class interactions during these assignments. Students learn from each other as they work in groups.

Now you could do active learning in a traditional course format. What a flipped classroom does is increases the emphasis and the amount of time spent in active learning.

The common theme of the paper is that the flipped classroom has been successfully applied in a variety of settings. It is not a “one size fits all” approach, but rather can be adapted to the needs of the particular class. Some students may not like the flipped classroom format, and you shouldn’t underestimate the amount of time needed to prepare the videotaped lectures (one rule of thumb is ten hours of work for every hour of video). Still the student reactions and the instructors perceptions of the flipped classroom are generally positive.

B-1332. Recommendation: What I need from statisticians (2016-09-01)

This interview with Nate Silver was conducted shortly after his keynote address at the 2013 Joint Statistical Meetings. I was at those meetings, but was stuck in a class (a very good class by the way, but I still felt stuck) on software engineering for statisticians. This article summarizes the main points of Mr. Silver’s keynote address and adds some extra insights through an interview after the speech. The best part was the quote at the end.

When asked that “Data science is the term of the day. Do you think there is a difference between data science and statistics? Silver replied,”I think data-scientist is a sexed up term for a statistician“, the reaction from the audience was for most, one of instantaneous laughter and applause.”Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldn’t berate the term statistician."

If Nate Silver can say something this controversial, then maybe I shouldn’t be so bashful.

B-1331. Recommendation: Organizing data in spreadsheets (2016-09-01)

I have some guidance on how to organize data (written back in 1999), but these guidelines are far superior. To be honest, you should use a database for anything more than half complicated, but for those simple data sets where you can use a text file or a spreadsheet, Dr. Broman’s comments are very helpful.

B-1330. Recommendation: The FAIR Guiding Principles for scientific data management and stewardship (2016-08-25)

I’ve always been supportive of efforts to share data. For me, it’s a bit selfish, because I want to find interesting real world examples to use in teaching and on my web pages. But the issue goes way beyond this, of course. Sharing data is an ethical imperative, especially for federally funded research or research that relies on volunteer subjects. It has led to many important discoveries beyond the realm of the original context in which the data was collected. In order for data sharing to be effective, you need to embrace four guiding principles: your data needs to findable, accessible, interoperable, and re-usable. This paper highlights those principles and offers some current examples of data sharing systems.

B-1329. Recommendation: 10 Easy Steps to a Complete Understanding of SQL (2016-08-15)

This page outlines some of the fundamental properties of SQL programming that you need to know as you start learning SQL. For example, SQL is a declarative language, meaning that you tell it what you want and not how to compute it. Also SQL syntax is not well-ordered, meaning that the order in which SQL statements are evaluated is not the same as the order that they appear.

B-1328. Recommendation: Tibbles (Tibbles are a modern take on data frames) (2016-08-15)

I’m an old dog R programmer who tends to rely on features of R that were available 10 years ago (an eternity for computers). But it’s time for this old dog to learn new tricks. One thing I need to use in my R programs is called a “tibble” (sometimes called a “tidy tibble”). It’s a minor but important improvement on data frames and many of the newer packages are using tibbles instead of data frames. Tibbles are available in the package, tibble. This web page offers a nice description of the improvements on tibbles.

B-1327. PMean: Changing the font size in RStudio (2016-08-08)

Suppose you’re giving a talk and using R Studio. You want to make the fonts a bit larger so your audience can read them. It’s easy to do, once you know where to look.

B-1326. PMean: Changing the font size in R (2016-08-08)

This is one of those obvious things that’s not obvious when you need it most. Suppose I’m doing a demo of R for a group like our wonderful Kansas City R Users Group. I want to have a readable sized font. Here’s how you do it.

B-1325. Recommendation: Hadley Wickham, the Man Who Revolutionized R (2016-08-07)

Hadley Wickham has written many popular R packages, so many that they are sometimes referred to as the “Hadleyverse.” This is a nice biography that emphasizes the impact that Dr. Wickham has had on R.

B-1324. Recommendation: dplyr and pipes: the basics (2016-08-07)

One of the recent developments in R that I was unaware of until I attended some talks at the Joint Statistical Meetings was the use of dplyr and pipes. It’s an approach to data management that isn’t different from earlier approaches, but the code is much easier to read and maintain. This blog post explains in simple terms how these work and why you would use them.

B-1323. Recommendation: The Importance of Reproducible Research in High-Throughput Biology (2016-08-03)

I have not viewed this video yet, but have attended a similar talk and read a similar research paper by Keith Baggerly. His general message is that large biological and genetic experiments are sometimes designed so poorly as to invalidate the results. You can often discover these design flaws through a careful examination of the data sets themselves and their metadata. This process of uncovering design flaws is sometimes called “Forensic Statistics.”

B-1322. Recommendation: Enrichment design studies should enhance signals of effectiveness. (2016-08-03)

I noticed several talks at the<U+00A0> JSM 2016 on enrichment designs. I was only very vaguely familiar with what this meant, so I did a quick Google search. I found this very nice non-technical overview.

B-1321. Recommendation: 100+ Interesting Data Sets for Statistics (2016-06-26)

This list starts out with a data set of 216,930 previous Jeopardy questions and goes from there. Not everything<U+00A0> suggested is easily amenable for statistical analysis, but the list is extremely interesting and diverse. In particular, this list is very helpful for anyone interested in text data.

B-1320. Recommendation: Institute for Digital Research and Education - Statistical Computing (2016-06-24)

This is a wonderful site, but for some reason, it is difficult to find. The Institute for Digital Research and Education (IDRE) at UCLA has put together some wonderful resources on how to do simple data analyses in R, SAS, SPSS, and Stata. The examples cover just about everything you’d ever want to do in any of these statistical packages. If you are making a transition from one statistical package to another, this site offers you the opportunity to see how things are done in the package you know well and compare it to how things are done in the package you are learning. Of special note are the worked textbook examples from many classic statistics textbooks.

B-1319. Recommendation: Handling date-times in R (2016-06-15)

Dates in R, like dates in any other software package, are tricky to work with. Here’s a nice guide that will help you get started.

B-1318. Recommendation: Bayesian computing with INLA (2016-04-28)

This page promotes a new approach to a broad class of models (spatio-temporal models, latent variable models, mixed models) using a fast approximation to the Bayesian solution. It runs under R and appears to handle very large datasets. I have not had a chance to try this, but it looks very interesting.

B-1316. PMean: Some open source Kaplan Meier curves (2016-04-10)

I’m giving a talk on the Kaplan-Meier survival curve and wanted to show and interpret a few real examples from the open source literature.

B-1315. Recommendation: The number of subjects per variable required in linear regression analyses (2016-03-30)

There are several rules of thumb out there about how many subjects that you need for a multiple linear regression model. Most of these rules look at the ratio of subjects per variable (SPV). If you have 100 subjects and 20 independent variables in your regression model, then the SPV is 5. This article comes to the surprising conclusion that an SPV of 2 is just fine. In other words, you could have 40 subjects and 20 independent variables and still be okay. This is independent of power considerations, by the way, but it still seems rather small to me. Read the paper yourself and let me know what you think.

B-1314. Recommendation: Reporting and methodological quality of sample size calculations in cluster randomized trials could be improved: a review (2016-03-30)

The sample size justification for a cluster randomized trial is messy. It requires the use of an intra-class correlation or something similar (the authors use the term within-cluster correlation). In a review of 300 cluster randomized trials, the authors found that in only about a third of the trials did the authors specify the within-cluster correlation. Even fewer compared this to the observed within-cluster correlation observed in the data. We need to do better.

B-1313. Recommendation: PLOS ONE 2015 Reviewer Thank You (2016-03-08)

I reviewed a paper for PLOS One in 2014 and got a nice acknowledgment, but I also reviewed a paper for the same journal in 2015. Here’s the acknowledgment for that contribution. They’re still having a bit of trouble with alphabetization (Steve Simon should be the last “Simon” on the list, but it’s not). Still, it’s nice to have a public record of my small contribution.

B-1312. Recommendation: Selection of controls in case-control studies (2016-02-29)

I’m working on a project where the researchers need a case-control study, though they may not know that yet. I want to show them what a case-control study gives them that would not be available with other methods. But I need to come up with a reasonable control group for the case-control design. It doesn’t have to be perfect, but it can’t be a totally stupid control group either. This article is the classic reference on the theoretical principles that underlie the selection of controls in a case-control study.

B-1311. PMean: Simulating power for a test of association in a two by two table (2016-02-05)

In an earlier blog post, I slogged through the calculation of power for a test of association in a two by two table. You can also approximate power using a simulation. It is done quite easily in R, but I want to show it in SPSS. Why? Just because.

B-1310. PMean: Calculating power for a test of association in a two by two table (2016-02-05)

A colleague was curious to see the formulas behind the power calculations done by many statistical software programs and online calculators. In particular, she wanted to see the formula used for power of the Chi-squared test of association in a two dimensional contingency table. It gets pretty messy for anything larger than a two by two table, but even a two by two table is a bit tricky. Here ins one mathematical approach that you can choose for a power calculation.

B-1309. Recommendation: The Survey Statistician (2016-02-03)

The International Association of Survey Statisticians (IASS) has a twice-yearly newsletter that talks about meetings and events sponsored by the association, informal overview articles about new methodologies, and book reviews. This is the archive page for the current and all previous issues of this newsletter.

B-1308. Recommendation: The Empirical Evidence of Bias in Trials Measuring Treatment Differences (2016-01-25)

When I wrote a book about Evidence Based Medicine back in 2006, I talked about empirical evidence to support the use of certain research methodologies like blinding and allocation concealment. Since that time, many more studies have appeared, more than you or I could easily keep track of. Thankfully, the folks at the Agency for Healthcare Research and Quality commissioned a report to look at studies that empirically evaluate the bias reduction of several popular approaches used in randomized trials. These include

selection bias through randomization (sequence generation and allocation concealment); confounding through design or analysis; performance bias through fidelity to the protocol, avoidance of unintended interventions, patient or caregiver blinding and clinician or provider blinding; detection bias through outcome assessor and data analyst blinding and appropriate statistical methods; detection/performance bias through double blinding; attrition bias through intention-to-treat analysis or other approaches to accounting for dropouts; and reporting bias through complete reporting of all prespecified outcomes.

The general finding was that failure to use these bias reduction approaches tended to exaggerate treatment effects, but the magnitude and precision of these exaggerated effects was inconsistent.

B-1307. Recommendation: The Coin Flip: A Fundamentally Unfair Proposition (2016-01-25)

A flip of a coin does not result in an exact 50-50 chance of heads or tails. It depends a lot on how the coin is flipped, of course, but there is a bias. This article explains when, why, and how much bias there is.

B-1306. Recommendation: ENCODE: Encyclopedia of DNA Elements (2016-01-19)

The genetics research community should be lauded for the openness with which they share research data. You can find numerous data sources that are free and without ANY restrictions. One very good example is ENCODE, the Encyclopedia of DNA Elements. This repository, mostly of human data, but some mouse, fruit fly, and round worm data as well. It has data from many different assays including ChIP-seq, RNA-seq, and DNase-seq. It looks like a great teaching resource, though it does require a fairly hefty understanding of genetics to browse through the data.

B-1305. Recommendation: Lyndon B. Johnson Space Center job opening for a GS-14 Statistician (2016-01-13)

I get a lot of emails mentioning job openings and I normally delete them unread. This one caught my eye, not because I wanted to apply for it, but because it illustrates how statisticians get to work on very interesting jobs. This is a Senior Statistician job at the Lyndon B. Johnson Space Center in Houston. If you got this job, you’d be providing assistance on “a wide range of biomedical and technical areas in support of space exploration.” How cool is that!

The other interesting thing is that they say that “accreditation by the American Statistical Association is highly desired.” I’m not accredited by the ASA and don’t plan on it anytime soon, but if you want to be the Buck Rogers of Statistics, maybe you should.

B-1304. PMean: My book chapter on R (2016-01-13)

I was asked by a colleague to write a chapter for a book he was editing, Big Data Analysis for Bioinformatics and Biomedical Discoveries. My chapter was “R for Big Data Analysis.” It just about killed me to write that chapter, but I got it done about nine months ago, and now the book is out officially.

B-1303. Recommendation: A Grant Submission New Year’s Resolution (2016-01-06)

Michael Lauer, the Deputy Director for Extramural Research at the United States National Institutes for Health shows some interesting statistics on when people submit grants and shows that grants submitted earlier than the day of the deadline tend to fare slightly better in the review process. There’s one gross miscalculation on this page, but the message is still interesting.

B-1302. Recommendation: Points to consider on switching between superiority and non-inferiority (2015-12-20)

One of the most confusing aspects of medical research is the difference between non-inferiority and superiority trials. This article explains in simple terms what the two type of trials are. Then it covers the desire of many researchers to switch from a non-inferiority trail to a superiority trial or vice versa. In general, if you would like to make the claim of superiority if the data justifies it, or to fall back on a claim of non-inferiority if you must, you are best off designing a high quality non-inferiority trial. The extra methodological rigor and the typically larger sample sizes that come with a non-inferiority trial make the transition from a non-inferiority hypothesis to a superiority hypothesis much smoother than the reverse. A high quality non-inferiority trial includes pre-specifying the margin of non-inferiority, demonstrating adequate power for the non-inferiority hypothesis, and justifying that the control group has demonstrated efficacy in previous trials. You need to show sufficient methodological rigor in your research design to establish that a non-inferiority finding is not just caused by an insensitive research design. Finally, you need to consider a “per protocol” analysis for the non-inferiority hypothesis, but switch to an “intention to treat” analysis for the superiority hypothesis.

B-1301. Recommendation: Differences between information in registries and articles did not influence publication acceptance (2015-12-07)

Here’s a research article tackling the same problem of changing outcome measures after the data is collected. Apparently, this occurs in 66 of the 226 papers reviewed here or almost 30% of the time. The interesting thing is that whether this occurred or not was independent of whether paper was accepted. So journal editors are missing an opportunity here to improve the quality of the published literature by demanding that researchers abide by the choices that they made during trial registration.

B-1300. Recommendation: The COMPare Project (2015-12-07)

One of the many problems with medical publications is that researchers will choose which outcomes to report based on their statistical significance rather than their clinical importance. This can seriously bias your results. You can easily avoid this potential bias by specifying your primary and secondary outcome measures prior to data collection. Apparently, though, some researchers will change their minds after designating these outcome variables and fail to report on some of the outcomes and/or add new outcomes that were not specified prior to data analysis. How often does this occur? A group of scientists at the Centre for Evidence-Based Medicine at the University of Oxford are trying to find out.

B-1299. Recommendation: Safeguarding Patients in Clinical Trials with High Mortality Rates (2015-12-02)

This is an article the I would trot out if anyone tried to argue that a Data Safety and Monitoring Board should, like the investigators, be blinded as to treatment status during their deliberations.

B-1298. PMean: A book review of my first book (2015-12-02)

I wrote a book about nine years ago and interest in it has largely died down. Perhaps I should write a second edition. Anyway, I ran across a book review that I had not seen before. It was published in 2006, but I never noticed it until now. Sarah Boslaugh wrote the review and it got published in MAA Reviews (MAA stands for Mathematical Association of America). It says some nice things like my approach was “fresh.” Dr. Bosluagh also likes my web site, according to the review.

B-1297. PMean: My Google Scholar profile (2015-11-16)

A while back, I set up a publication list in Google Scholar. It tracks the number of citations received by each article that I published. One of my articles has a massive 287 citations. I’m not the first author on this or on an of the other articles that received 100+ citations. So I’m mostly riding on the coattails of some very good researchers.

B-1296. PMean: How many research subjects… (2015-10-13)

Here’s a quote from yours truly. I’ve added a cute graphic from the public domain (downloaded from www.pixabay.com) to liven it up a bit.

B-1295. Recommendation: PS: Power and Sample Size Calculation (2015-09-24)

Someone stopped by today with a power calculation and I asked what software they used. They showed me something I had not seen before, a program developed by the Department of Biostatistics at Vanderbilt University (more specifically, William Dupont and Walton Plummer). The Vanderbilt Biostatistics Department is run by Frank Harrell, so you can be pretty sure that anything that they develop will be high quality.

B-1294. Recommendation: Research vs Quality Improvement (2015-09-16)

I ran across a one page handout in PDF format that discussed the difference between research and quality improvement. It was written from the perspective of the IRB (Institutional Review Board) at UMKC. It’s a nice summary, although the topic is a bit more complex than a single page handout might imply. This is a good starting point for deciding what type of study you want to do.

B-1293. Recommendation: R number 6 in IEEE 2015 Top Programming Languages, Rising 3 Places (2015-08-20)

This Revolutions blog talks about a fairly rigorous evaluation of popular programming languages done by the Institute of Electrical and Electronics Engineers (known by most people by its acronym, IEEE). The list shows all programming languages, including general purpose programming languages. Java C, and C++ are at the top of the list, but R, a language pretty much dedicated to data analysis, is number 6 on the list (up three places from the previous year. Quite an impressive showing. I have mentioned another webpage, http://r4stats.com/articles/popularity/, that compares R and other statistical software packages, and that is worth reading as well.

B-1292. Recommendation: This is Statistics promotional toolkit (2015-06-30)

The American Statistical Association is promoting careers in statistics though a new campaign titled “This is Statistics”. They just added some very nice promotional material: one and two page handouts, a PowerPoint presentation (a bit too glitzy, but still very informative), and a set of talking points. The materials emphasize the broad range of areas that statisticians work in, the very strong pay and high demand for statisticians, and the diversity of people who go into Statistics.

B-1291. PMean: My research contributions to patient accrual models (2015-06-03)

The U.S. National Institutes of Health (NIH) has a new biosketch format where they ask you to summarize “up to five of your most significant contributions to science.” Here’s a first draft of my research contributions to patient accrual models.

B-1290. PMean: My research contributions to Evidence Based Medicine (2015-06-03)

The U.S. National Institutes of Health (NIH) has a new biosketch format where they ask you to summarize “up to five of your most significant contributions to science.” Here’s a first draft of my research contributions to Evidence Based Medicine.

B-1289. PMean: My research contributions to reproductive toxicology (2015-05-30)

The U.S. National Institutes of Health (NIH) has a new biosketch format where they ask you to summarize “up to five of your most significant contributions to science.” Here’s a first draft of my research contributions to reproductive toxicology.

B-1288. PMean: My research contributions to numerical accuracy (2015-05-30)

The U.S. National Institutes of Health (NIH) has a new biosketch format where they ask you to summarize “up to five of your most significant contributions to science.” Here’s a first draft of my research contributions to numerical accuracy.

B-1287. PMean: I am now a number in the ORCID database (2015-05-27)

Life in the world of research is complicated, but it gets worse when you have a relatively common last name like “Simon.” There are thousands of research publications written by a Simon, and narrowing it down to “Simon S” or even “Simon SD” doesn’t seem to help. So how do you quickly identify all the publications that you have written? One way is to apply for a unique identifier from ORCID, an “open, non-profit, community-based effort to provide a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers.”

B-1286. Recommendation: Developing Grant Proposals: Guidelines for Statisticians Collaborating Under Limited Resources (2015-05-26)

This article provides guidance for developing the “statistical considerations” section of a research grant. I normally do not use that term, and suggest separate sections on statistical methods, sample size justification, data management plan, etc. But that’s a quibble. This is very good practical advice, such as reminding you that you need to write both for the statistical reviewer and the non-statistician who is also reviewing the proposal.

B-1285. Recommendation: PLOS ONE 2014 Reviewer Thank You (2015-05-20)

I don’t do nearly enough peer reviewing, in part because it is a thankless, anonymous task. But one journal editor sent me a nice email pointing out that my name was listed along with 80,000 other reviewer names for helping out with peer review of an article in 2014 for PLOS ONE. If you click on the link on the article and go down about 61,000 lines, you’ll find my name. Caution, the list is not quite perfectly in alphabetical order (Simons and Simonton should come AFTER Simon).

B-1284. Recommendation: Whole Animal Experiments Should Be More Like Human Randomized Controlled Trials (2015-04-30)

Not found

B-1283. Recommendation: An Introduction to Social Media for Scientists (2015-04-30)

It’s easy to mock social media, but these are important tools not just for sharing pictures of the food your eating but for informing your colleagues about your research. This article gives a nice overview of how to effectively use tools like Twitter, Facebook, Tumblr, and Pinterest.

B-1282. Recommendation: An Introduction to Social Media for Scientists (2015-04-30)

Not found

B-1281. Recommendation: Improving Bioscience Research Reporting: The ARRIVE Guidelines for Reporting Animal Research (2015-04-30)

Not found

B-1280. Recommendation: Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm (2015-04-30)

Many scientists rely on bar graphs and line graphs that effectively reduce your data to a single mean per group. Even with the addition of error bars, the whole process tends to hide important information. These authors suggest that scatterplots that show every data point would be a better way to present your research data.

B-1279. Recommendation: Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm (2015-04-30)

Not found

B-1278. Recommendation: Improving Bioscience Research Reporting: The ARRIVE Guidelines for Reporting Animal Research (2015-04-30)

A lot of people have adapted and updated the CONSORT Guidelines to reporting clinical trials to handle other types of research. One of these adaptations is the ARRIVE guidelines for reporting animal research. Many of these guidelines follow CONSORT quite closely, but there are details, such as documenting the species and strain of the experimental animals and describing the housing conditions, that are specific to animal experiments.

B-1277. Recommendation: Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm (2015-04-30)

Not found

B-1276. Recommendation: Restoring invisible and abandoned trials (2015-04-16)

Too much research data goes unreported, leading to a serious distortion of the evidence base that clinicians need to make intelligent medical decisions. The authors of this paper in BMJ argue that if you can document that a study has been abandoned before publication, and if you formally requestthe researchers to publish the data, and if they fail to act within a certain amount of time,then the data should be considered public access so that you or anyone else could publish those results. It’s an interesting proposal and one that will generate a lot of controversy.

B-1275. Recommendation: Restoring invisible and abandoned trials: a call for people to publish the findings (2015-04-16)

Not found

B-1274. PMean: Equations using MathType (2015-04-12)

I’m ordinarily not a big fan of commercial software, but one product that I would have a hard time living without is MathType. It produces mathematical equations with ease and the appearance is almost always perfect. It’s hard to do this, especially with equations have lots of superscripts and subscripts. You get the size or spacing wrong and all of a sudden things look really ugly and it is hard to fix. TeX is a very good product, too, but I have grown so used to MathType that it is really hard to make the switch.<U+00A0> I had to upgrade MathType recently to version 6.9 and I wanted to experiment with MathType equations on my blog. Here are some examples.

B-1273. PMean: Examining relationships in R (2015-04-03)

I’m giving a talk for the Kansas City R Users Group on how to get a preliminary impression of relationships between pairs of variables. Here is the R code and output that I will use.

B-1272. Recommendation: Rich Data, Poor Data (2015-02-24)

Nate Silver emphasizes an important point about when statistical models can really shine: when there is a rich source of data and lots of opportunities to test the predictive power of your models. This is why baseball statistics provide such a great platform for teaching modelling techniques.

B-1271. Recommendation: Editorial (Basic and Applied Social Psychology) (2015-02-24)

Recommended does not always mean that I agree with what’s written. In this case, it means that this is something that is important to read because it offers an important perspective. And this editorial offers the perspective that all p-values and all confidence intervals are so fatally flawed that they are banned from all future publications in this journal. The editorial goes further to criticize most Bayesian methods because of the problems with the “Laplacian assumption.” The editorial authors have trouble with some of the ambiguities associated with creating a non-informative prior distribution that is, a prior distribution that represents a “state of ignorance.” They will accept Bayesian analyses on a case by case basis. Throwing out most Bayesian analyses, all p-values, and all confidence intervals makes you wonder what they will accept. They suggest larger than typical sample sizes, strong descriptive statistics (which they fail to define), and effect sizes. They believe that by “banning the NHSTP will have the effect of increasing the quality of submitted manuscripts by liberating authors from the stultified structure of NHSTP thinking thereby eliminating an important obstacle to creative thinking.” It’s worth debating this issue, though I think that these recommendations are far too extreme.

B-1270. Recommendation: P-Values (2015-02-03)

Randall Munroe, author of the xkcd comic strip, will often comment on Statistics. This cartoon shows how p-values are typically interpreted.

B-1269. Recommendation: New R Package: cdcfluview (2015-02-03)

I work a lot with secondary datasets and I’m always looking for new and interesting resources. There is a CDC site that tracks flu reports and with a bit of effort, you can get the raw data behind these reports. A blogger, hrbrmstr (Bob Rudis, if you dig long enough to find his real name), developed an R package that makes it easy to import this data into R. He illustrates the use of this package with a graph that shows some interesting trend lines across several major cities.

B-1268. Recommendation: Avoidable waste in the production and reporting of research evidence (2015-01-15)

Not found

B-1267. Recommendation: The answer is 17 years, what is the question: understanding time lags in translational research (2015-01-15)

Not found

B-1266. Recommendation: The answer is 17 years, what is the question. Understanding time lags in translational research (2015-01-15)

A widely quoted statistic is that it takes 17 years for research to find it’s way from the initial discovery to clinical practice. That statistic has always bothered me. How do you know that it takes this long? How could you measure such a thing? Wouldn’t it depend on the type of discovery? Apparently, I’m not the only one bothered by this statistic. The authors of this research paper looked at all the publications that purported to estimate the time lag between discovery and clinical adoption. They found that different authors used different markers for the date of discovery and the date of clinical adoption. Furthermore, reporting is poor, with little discussion of the variation in the estimated time lag.

B-1265. Recommendation: Tessera. Open source environment for deep analysis of large complex data (2015-01-12)

I have not had time to preview this software, but it looks very interesting, It takes large problems and converts them to a form for parallel processing, not by changing the underlying algorithm, which would be very messy, but by splitting the data into subsets, analyzing each subset, and recombining these results. Such a method “Divide and Recombine” should work well for some analysis, but perhaps not so well for others. It is based on the R programming language. If I get a chance to work with this software, I’ll let you know what I think.

B-1264. Recommendation: Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement (2015-01-08)

If you are writing up a paper that uses a complex regression model (complex meaning multiple independent variables), you need to document information that allows the reader to assess the quality of the predictions that your model would produce. This paper provides a checklist of things that you need to document in such a paper, and is an extension of the CONSORT guidelines to this particular type of research.

B-1263. Recommendation: Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement (2015-01-08)

Not found

B-1262. Recommendation: Report on Survey Participation Refusals (2014-12-23)

The American Association for Public Opinion Research (AAPOR) convened a task force to address the increasing tendency of people to refuse to respond (as is their right) to a survey. This group prepared a report, published in September 2014, documentation what is a refusal and characterizes who refuses to participate in surveys. The report also discusses efforts to persuade initially reluctant individuals to participate (refusal conversion) and how that effort might infringe on someone’s privacy. If you are conducting almost any type of survey, you will have to confront participation refusals, and this document can serve as a starting point for handling the conflicting demands of scientific integrity and an individual’s right to be left alone.

B-1261. Recommendation: In search of justification for the unpredictability paradox (2014-12-11)

This is a commentary on a 2011 Cochrane Review that found substantial differences between studies that were adequately randomized and those that were not adequately randomized. The direction of the difference was not predictable, however, meaning that there was not a consistent bias on average towards overstating the treatment effect or a consistent bias on average towards understating the treatment effect. This leads the authors of the Cochrane review to conclude that “the unpredictability of random allocation is the best protection against the unpredictability of the extent to which non-randomised studies may be biased.” The authors of the commentary provide a critique of this conclusion on several grounds.

B-1260. Recommendation: In search of justification for the unpredictability paradox (2014-12-11)

Not found