Bug 17311 - huge single-line file with embedded nul causes segfault
Summary: huge single-line file with embedded nul causes segfault
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: I/O (show other bugs)
Version: R 3.3.*
Hardware: All Windows 64-bit
: P5 trivial
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2017-07-15 15:25 UTC by Anthony Damico
Modified: 2017-07-18 08:34 UTC (History)
3 users (show)

See Also:


Attachments
Patch for readLines (2.14 KB, patch)
2017-07-17 14:27 UTC, Hannes Mühleisen
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Anthony Damico 2017-07-15 15:25:48 UTC
hi, this segfault occurs on a text file in an R session without contributed packages.  i could not figure out how to *create* the text file without contributed packages, but from the example below, you can see that the segfault can be reproduced in a session without loading any external libraries.  hope this write-up makes sense.  here's the related thread on r-help:  http://r.789695.n4.nabble.com/readLines-without-skipNul-TRUE-causes-crash-tt4741892.html



# # # uses a contributed package just to *create* the problem file
# # # the segfault can be triggered in a separate session

	# change this line to save the problem file on your local disk
	path_to_problem_file <- "C:/My Directory/problem file.txt"

	install.packages( "devtools" )
	devtools::install_github("jimhester/archive")

	file_folder <- file.path( tempdir() , "file_folder" )

	tf <- tempfile()

	# warning: huge download
	download.file( 'http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf , mode = 'wb' )

	archive::archive_extract( tf , dir = normalizePath( file_folder ) )

	unzipped_files <- list.files( file_folder , recursive = TRUE , full.names = TRUE  )

	infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )

	# copy the problem file somewhere that it can be accessed in a different session
	file.copy( infile , path_to_problem_file )
	
# # # # close and re-open R # # # #
# # # # do not use any contributed packages in the new session

	# reading in this file alone causes a crash
	x <- readLines( "C:/My Directory/problem file.txt" )
Comment 1 Anthony Damico 2017-07-17 11:59:12 UTC
hi, these four lines of code crash for me on windows R 3.4.1 and also on linux R version 3.3.3.  i apologize for the large download, i believe this is overloading some buffer and needs to be close to as big as it is to trigger the segfault.  watching on windows task manager, the segfault seems to happen when rterm.exe hits 2GB of RAM


	# consider changing `tempfile()` to a permanent location
	# so you don't lose the large downloaded file after the crash
	tf <- tempfile()
	download.file( "https://sisyphus.project.cwi.nl/r-bug-17311-crash.txt" , tf , mode = 'wb' )
	sessionInfo()
	x <- readLines( tf )


==================================


	Microsoft Windows [Version 6.1.7601]
	Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

	C:\Users\AnthonyD>"c:\Program Files\r\R-3.4.1\bin\x64\Rterm.exe"

	R version 3.4.1 (2017-06-30) -- "Single Candle"
	Copyright (C) 2017 The R Foundation for Statistical Computing
	Platform: x86_64-w64-mingw32/x64 (64-bit)

	R is free software and comes with ABSOLUTELY NO WARRANTY.
	You are welcome to redistribute it under certain conditions.
	Type 'license()' or 'licence()' for distribution details.

	  Natural language support but running in an English locale

	R is a collaborative project with many contributors.
	Type 'contributors()' for more information and
	'citation()' on how to cite R or R packages in publications.

	Type 'demo()' for some demos, 'help()' for on-line help, or
	'help.start()' for an HTML browser interface to help.
	Type 'q()' to quit R.

	>
	>
	> tf <- tempfile()
	> download.file( "https://sisyphus.project.cwi.nl/r-bug-17311-crash.txt" , tf $
	trying URL 'https://sisyphus.project.cwi.nl/r-bug-17311-crash.txt'
	Content type 'text/plain; charset=UTF-8' length 2526887936 bytes (2409.8 MB)
	downloaded 2409.8 MB

	> sessionInfo()
	R version 3.4.1 (2017-06-30)
	Platform: x86_64-w64-mingw32/x64 (64-bit)
	Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1

	Matrix products: default

	locale:
	[1] LC_COLLATE=English_United States.1252
	[2] LC_CTYPE=English_United States.1252
	[3] LC_MONETARY=English_United States.1252
	[4] LC_NUMERIC=C
	[5] LC_TIME=English_United States.1252

	attached base packages:
	[1] stats     graphics  grDevices utils     datasets  methods   base

	loaded via a namespace (and not attached):
	[1] compiler_3.4.1
	> x <- readLines( tf )

	C:\Users\AnthonyD>






==================================

	[damico@rocks010 ~]$ R

	R version 3.3.3 (2017-03-06) -- "Another Canoe"
	Copyright (C) 2017 The R Foundation for Statistical Computing
	Platform: x86_64-redhat-linux-gnu (64-bit)

	R is free software and comes with ABSOLUTELY NO WARRANTY.
	You are welcome to redistribute it under certain conditions.
	Type 'license()' or 'licence()' for distribution details.

	  Natural language support but running in an English locale

	R is a collaborative project with many contributors.
	Type 'contributors()' for more information and
	'citation()' on how to cite R or R packages in publications.

	Type 'demo()' for some demos, 'help()' for on-line help, or
	'help.start()' for an HTML browser interface to help.
	Type 'q()' to quit R.

	>
	>
	> tf <- tempfile()
	> download.file( "https://sisyphus.project.cwi.nl/r-bug-17311-crash.txt" , tf , mode = 'wb' )
	trying URL 'https://sisyphus.project.cwi.nl/r-bug-17311-crash.txt'
	Content type 'text/plain; charset=UTF-8' length 2526887936 bytes (2409.8 MB)
	==================================================
	downloaded 2409.8 MB

	> sessionInfo()
	R version 3.3.3 (2017-03-06)
	Platform: x86_64-redhat-linux-gnu (64-bit)
	Running under: Fedora 24 (Twenty Four)

	locale:
	 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
	 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
	 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
	 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
	 [9] LC_ADDRESS=C               LC_TELEPHONE=C
	[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

	attached base packages:
	[1] stats     graphics  grDevices utils     datasets  methods   base
	> x <- readLines( tf )

	 *** caught segfault ***
	address 0x7cffffff, cause 'memory not mapped'

	Traceback:
	 1: readLines(tf)

	Possible actions:
	1: abort (with core dump, if enabled)
	2: normal R exit
	3: exit R without saving workspace
	4: exit R saving workspace
	Selection:
	--------------------------------------------------------------------------------
Comment 2 Martin Maechler 2017-07-17 13:56:57 UTC
Thank you, Anthony.

I confirm the segmentation fault, also for me on (Fedora F24) Linux, 64-bit.
Calling R as  `R -d gdb`  gives the extra information that indeed the seg.fault is a buffer overflow in R's own do_readlines :

> x <- readLines( largeF )

Program received signal SIGSEGV, Segmentation fault.
0x000000000047c754 in do_readLines (call=<optimized out>, op=<optimized out>, 
    args=<optimized out>, env=<optimized out>)
    at ../../../R/src/main/connections.c:3664
3664		    if(c != '\n') buf[nbuf++] = (char) c; else break;
Missing separate debuginfos, .................

(gdb) p nbuf
$1 = 2097152000
(gdb) 
---  and from another R session:

> .Machine$integer.max - 2^31
[1] -1
> 2097152000 / .Machine$integer.max
[1] 0.9765625
> 2097152000 - .Machine$integer.max
[1] -50331647
> log2(50331647)
[1] 25.58496
> 
--------------
I *am* not yet about to fix it but rather looking forward to patch proposals for
R's  src/main/connections.c
Comment 3 Hannes Mühleisen 2017-07-17 14:27:20 UTC
Created attachment 2274 [details]
Patch for readLines

Here is a patch that adds support for long lines in readLines
Comment 4 Martin Maechler 2017-07-18 08:34:22 UTC
(In reply to Hannes Mühleisen from comment #3)
> Created attachment 2274 [details]
> Patch for readLines
> 
> Here is a patch that adds support for long lines in readLines

Thank you, Hannes.
Indeed as you show, it is only that we need to use `size_t` instead of `int`
in a few places.

In addition to your patch, logically we should also change the _return_ type of `Rconn_getline` to size_t, not only the type of its `bufsize` argument.

Also, your patch proposes to slightly change nbuf counting... but after applying it, something is wrong and  tests/reg-tests-1a.R  fails.

After applying (your patch, or my version which keeps nbuf counting, and adds the above `size_t`), interestingly one does not get an error but 2 warnings
which I think _are_ helpful enough :

> x <- readLines( largeF )
Warning messages:
1: In readLines(largeF) : line 80937 appears to contain an embedded nul
2: In readLines(largeF) :
  incomplete final line found on '/userdata/maechler/R/r-bug-17311-crash.txt'
> str(x)
 chr [1:80937] "1000039779602009 22F3113404CARATINGA                                                                           "| __truncated__ ...
> ncx <- nchar(x,"bytes")
> summary(ncx)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    616     982     982     982     982     982 
> head(ncx)
[1] 982 982 982 982 982 982
> tail(ncx)
[1] 982 982 982 982 982 616
> 

--------

I will commit my patch only to R-devel initially.. just in case there is again code that somehow depends on non-API parts of R's internals..
--> svn rev 72925