Bug 16881 - readChar() on url() returns inconsistent number of characters
Summary: readChar() on url() returns inconsistent number of characters
Status: UNCONFIRMED
Alias: None
Product: R
Classification: Unclassified
Component: I/O (show other bugs)
Version: R 3.3.*
Hardware: x86_64/x64/amd64 (64-bit) OS X Mavericks
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-05-05 16:53 UTC by Erik Wright
Modified: 2016-05-19 13:50 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Erik Wright 2016-05-05 16:53:07 UTC
When I use readChar() to read a UTF-8 file from a URL (ftp://) connection it returns an inconsistent number of characters when useBytes = TRUE.  This process works fine if I first download the file, and then use readChar to read it in.  It also works fine if I used useBytes = FALSE.

#######################

Commands I used:

ftp <- "ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Yersinia_pestis_Antiqua_uid58607/NC_008150.fna"

# read the text file from url() using useBytes = TRUE
for (i in 1:10) {
	u <- url(ftp, open = "rb")
	r <- readChar(u, nchars = 1e5, useBytes = TRUE)
	print(nchar(r))
	close(u)
}

# read the text file from url() using useBytes = FALSE
u <- url(ftp, open = "rb")
r <- readChar(u, nchars = 1e7, useBytes = FALSE)
nchar(r)
close(u)

# read the text file locally using useBytes = TRUE
tf <- tempfile()
download.file(ftp, tf, quiet=TRUE)
r <- readChar(tf, nchars = 1e7, useBytes = TRUE)
unlink(tf)
nchar(r)

sessionInfo()

#######################

The output of running these commands is shown below:

> ftp <- "ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Yersinia_pestis_Antiqua_uid58607/NC_008150.fna"
> 
> for (i in 1:10) {
+ u <- url(ftp, open = "rb")
+ r <- readChar(u, nchars = 1e5, useBytes = TRUE)
+ print(nchar(r))
+ close(u)
+ }
[1] 5472
[1] 4104
[1] 5472
[1] 5472
[1] 5472
[1] 4104
[1] 5472
[1] 5472
[1] 4104
[1] 2736
> 
> u <- url(ftp, open = "rb")
> r <- readChar(u, nchars = 1e7, useBytes = FALSE)
> nchar(r)
[1] 4769548
> close(u)
> 
> tf <- tempfile()
> download.file(ftp, tf, quiet=TRUE)
> r <- readChar(tf, nchars = 1e7, useBytes = TRUE)
> unlink(tf)
> nchar(r)
[1] 4769548
> 
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.2 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
Comment 1 Erik Wright 2016-05-19 10:33:57 UTC
Update on attempting to debug this problem:

Specifying method="libcurl" in url() enables the entire connection to be read by readChar(). So the issue probably has something to do with method="internal" (the default)?

ftp <- "ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Yersinia_pestis_Antiqua_uid58607/NC_008150.fna"

u <- url(ftp, open = "rb", method="internal")
r <- readChar(u, nchars = 1e7, useBytes = TRUE)
print(nchar(r)) # incomplete
close(u)

u <- url(ftp, open = "rb", method="libcurl")
r <- readChar(u, nchars = 1e7, useBytes = TRUE)
print(nchar(r)) # complete
close(u)
Comment 2 Erik Wright 2016-05-19 13:50:30 UTC
Update, setting options(internet.info=0) provides the information:

In url(ftp, open = "rb", method = "internal") : 
<<<
150 Opening BINARY mode data connection for /genomes/archive/old_refseq/Bacteria/Yersinia_pestis_Antiqua_uid58607/NC_008150.fna (4769548 bytes)

So it seems to be establishing the connection properly.