Bug 16513 - minimal reproducible read.fwf() example that crashes the console on windows 8 with 32-bit R
Summary: minimal reproducible read.fwf() example that crashes the console on windows 8...
Status: UNCONFIRMED
Alias: None
Product: R
Classification: Unclassified
Component: Windows GUI / Window specific (show other bugs)
Version: R 3.2.1
Hardware: Other Windows 32-bit
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2015-08-18 18:44 UTC by Anthony Damico
Modified: 2017-02-02 15:07 UTC (History)
2 users (show)

See Also:


Attachments
Proposed patch (831 bytes, patch)
2015-08-25 23:23 UTC, Mikko Korpela
Details | Diff
Function for splitting a long line of text (6.59 KB, text/plain)
2015-08-25 23:51 UTC, Mikko Korpela
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Anthony Damico 2015-08-18 18:44:49 UTC
in addition to the sessionInfo() below, this bug was reproduced on R version 3.2.1 and also on windows 7.  copying and pasting this code into the 32-bit R console on windows should prompt a crash.  thanks!!



sessionInfo()
# R version 3.2.2 (2015-08-14)
# Platform: i386-w64-mingw32/i386 (32-bit)
# Running under: Windows 8 x64 (build 9200)

# locale:
# [1] LC_COLLATE=English_United States.1252
# [2] LC_CTYPE=English_United States.1252  
# [3] LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C                         
# [5] LC_TIME=English_United States.1252   

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base   

setInternet2( FALSE )

widths <- c(5, 2, -3, 2, 2, 1, 1, 1, 1, 1, 1, 5, -2, 2, 1, 1, 1, 2, 2,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,
2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, -1, 1, 2, 1, 2, 1, 2,
2, 1, 1, 1, 5, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1,
2, 1, 2, 2, 1, 1, 1, 5, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2,
2, 2, 1, 2, 1, 2, 2, 1, 1, 1, 5, 1, 1, 2, 2, 1, 1, 1, 5, 5, 5,
5, 5, 5, 1, 3, 5, 5, 3, 5, 5, 3, 5, 5, 3, 5, 5, 5, 1, 1, 1, 1,
1, 1, 1, 3, 4, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1,
2, 2, 2, 2, 2, 7, 7, 7, 7, 7, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, -2369, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8)

varnames <- c("SEQNUM", "RECTYPE", "PREG_NUM", "PREGTYPE", "NUMBIRTH", "OUTCOME1",
"OUTCOME2", "OUTCOME3", "DELIVERY", "NEWFLAG", "B14MO", "B_15",
"B_16", "BOX7", "B_17", "B_18", "B_19", "B_20", "B_21", "B_22",
"B_23", "B_24", "B25A", "B25B", "B25C", "B25D", "B25E", "B25F",
"B_26", "B_27", "B_28", "B29A", "B29B", "B29C", "B29D", "B29E",
"B29F", "B29G", "B_30", "BOX8", "BLIVEBIR", "LASTPREG", "B12_1",
"B31LB_1", "B31OZ_1", "B32_1", "BOX10_1", "B33A_1", "B33B_1",
"B33C_1", "B33D_1", "B33E_1", "B33F_1", "B34_1", "B35_1", "B36_1",
"B37_1", "B38_1", "BOX11_1", "B39_1", "B40_1", "B41MO_1", "BOX12_1",
"B42_1", "B43_1", "B44_1", "B12_2", "B31LB_2", "B31OZ_2", "B32_2",
"BOX10_2", "B33A_2", "B33B_2", "B33C_2", "B33D_2", "B33E_2",
"B33F_2", "B34_2", "B35_2", "B36_2", "B37_2", "B38_2", "BOX11_2",
"B39_2", "B40_2", "B41MO_2", "BOX12_2", "B42_2", "B43_2", "B44_2",
"B12_3", "B31LB_3", "B31OZ_3", "B32_3", "BOX10_3", "B33A_3",
"B33B_3", "B33C_3", "B33D_3", "B33E_3", "B33F_3", "B34_3", "B35_3",
"B36_3", "B37_3", "B38_3", "BOX11_3", "B39_3", "B40_3", "B41MO_3",
"BOX12_3", "B42_3", "B43_3", "B44_3", "B_45", "B_46", "C12A",
"C13F1MO", "C13T1MO", "C13F2MO", "C13T2MO", "C13F3MO", "C13T3MO",
"C_14", "C15M1", "C16M1MO", "C17M1MO", "C15M2", "C16M2MO", "C17M2MO",
"C15M3", "C16M3MO", "C17M3MO", "C15M4", "C16M4MO", "C17M4MO",
"C18MO", "C_19", "C_20", "C_21", "C_22", "C_23", "C_24", "C_25",
"PRGLNGTH", "AGEPREG", "WANTWIFE", "WANTMAN", "OUTCOME", "YRPREG",
"FMAROUT", "LIVBABY1", "LIVBABY2", "LIVBABY3", "LOW1", "LOW2",
"LOW3", "PREGTEST", "PNCAREWK", "PNCARENO", "RACE", "CEND84",
"BIRTH071", "BIRTH072", "BIRTH073", "PREGNUM7", "PREGNUM8", "W_1",
"W_2", "W_3", "W_4", "W_5", "FLAG341", "FLAG372", "FLAG373",
"FLAG374", "FLAG375", "FLAG376", "FLAG426", "FLAG427", "FLAG614",
"FLAG621", "FLAG991", "FLAG992", "REPWGT1", "REPWGT2", "REPWGT3",
"REPWGT4", "REPWGT5", "REPWGT6", "REPWGT7", "REPWGT8", "REPWGT9",
"REPWGT10", "REPWGT11", "REPWGT12", "REPWGT13", "REPWGT14", "REPWGT15",
"REPWGT16", "REPWGT17", "REPWGT18", "REPWGT19", "REPWGT20", "REPWGT21",
"REPWGT22", "REPWGT23", "REPWGT24", "REPWGT25", "REPWGT26", "REPWGT27",
"REPWGT28", "REPWGT29", "REPWGT30", "REPWGT31", "REPWGT32", "REPWGT33",
"REPWGT34", "REPWGT35", "REPWGT36", "REPWGT37", "REPWGT38", "REPWGT39",
"REPWGT40", "REPWGT41", "REPWGT42", "REPWGT43", "REPWGT44", "REPWGT45",
"REPWGT46", "REPWGT47", "REPWGT48", "REPWGT49", "REPWGT50", "REPWGT51",
"REPWGT52", "REPWGT53", "REPWGT54", "REPWGT55", "REPWGT56", "REPWGT57",
"REPWGT58", "REPWGT59", "REPWGT60", "REPWGT61", "REPWGT62", "REPWGT63",
"REPWGT64", "REPWGT65", "REPWGT66", "REPWGT67", "REPWGT68", "REPWGT69",
"REPWGT70", "REPWGT71", "REPWGT72", "REPWGT73", "REPWGT74", "REPWGT75",
"REPWGT76", "REPWGT77", "REPWGT78", "REPWGT79", "REPWGT80", "REPWGT81",
"REPWGT82", "REPWGT83", "REPWGT84", "REPWGT85", "REPWGT86", "REPWGT87",
"REPWGT88", "REPWGT89", "REPWGT90", "REPWGT91", "REPWGT92", "REPWGT93",
"REPWGT94", "REPWGT95", "REPWGT96", "REPWGT97", "REPWGT98", "REPWGT99",
"REPWGT100")

x <-
    read.fwf(
        file = "ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NSFG/1988PregData.dat" ,
        widths = widths ,
        col.names = varnames ,
        comment.char = "" ,
        colClasses = "character" ,
        buffersize = 50 ,
        n = 1000 ,
        skip = 0
    )
Comment 1 Mikko Korpela 2015-08-25 23:23:03 UTC
Created attachment 1898 [details]
Proposed patch

Patch for "R Under development (unstable) (2015-08-25 r69177)". Adds USE.NAMES=FALSE to an sapply() call inside read.fwf(). This change seems to solve the excessive memory use that occurs when executing the original poster's example code with an unpatched read.fwf(). It is possible that a lower level fix somewhere else would also solve this read.fwf() issue, but I don't see how removing / not adding names would be harmful here.

The patched read.fwf() was tested and compared with the original on two platforms / R versions: R-devel r69177 on Ubuntu 14.04 and R 3.2.2 patched r69078 on OS X 10.7.5.

When running the example code on the first platform with the original read.fwf(), the memory used by R grew to about 15 GB (amount of RAM 16 GB). The time used was 49 / 7 / 84 seconds (user / system / elapsed). The other platform had insufficient RAM (4 GB) to finish the example in a reasonably short time.

The patched read.fwf() ran the example case in 11 and 19 seconds (elapsed) on the first and second platform, respectively. The amount of memory used by R didn't grow noticeably from the level before the read.fwf() call. The patched version was source()d to R and not byte-compiled.
Comment 2 Mikko Korpela 2015-08-25 23:51:31 UTC
Created attachment 1899 [details]
Function for splitting a long line of text

The original poster's code only extracts one record from the example data file which has no newline characters. This is probably not the intended result. Adding newlines to separate the fixed width records enables read.fwf() to extract all the records in the file.

I wrote a function which splits a character string into fixed width pieces. It is attached here and may be freely used, copied, modified, distributed, whatever. The function accepts input from a character string or connection. The output goes to a character vector or connection.
Comment 3 Mikko Korpela 2017-02-02 15:07:40 UTC
It appears that the issue (excessive memory use with the original poster's example code) has been solved in the development branch of R since revision 70909. The corresponding NEWS item is:

    Speedup in simplify2array() and hence sapply() and
    mapply() (for the case of names and common length > 1),
    thanks to Suharto Anggono's PR 17118.

I tested this on Linux, R-devel revisions 70908 (problem exists) and 70909 (problem is gone), as well as on a recent Windows R-devel build running on Wine (problem is gone).