Bug 17227 - nlme::gnls seg.fault (reported in 2008)
Summary: nlme::gnls seg.fault (reported in 2008)
Status: NEW
Alias: None
Product: R
Classification: Unclassified
Component: Low-level (show other bugs)
Version: R-devel (trunk)
Hardware: Other Linux
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2017-02-22 11:47 UTC by Martin Maechler
Modified: 2017-03-10 14:24 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Maechler 2017-02-22 11:47:31 UTC
When searching for (known) bugs in nlme::gnls, I stumbled accross an old bug report about a fatal behavior in gnls  with "random" seg.faults, which must be from memory corruption,
 https://stat.ethz.ch/pipermail/r-devel/2008-September/050816.html

which was concluded there as "something like a wishlist",  https://mailman.stat.ethz.ch/pipermail/r-devel/2008-October/050926.html

I think we should have it here in our bug repository, 
as something we would be happy if someone addressed it.

Here's my current version of the reproducible code example

library(nlme)
##
r <- vector("list", 200) # now (64bit, 2017, R 3.3.2) takes longer before seg.fault
## for(i in 1:20)# 3 to 8 times were sufficient in 2008 for a seg.fault
for(i in seq_along(r)) {
    cat(i)
    r[[i]] <- tryCatch(
        gnls.exp <- gnls(circumference ~ Asym/(1 + exp(-(age-xmid)/scal)) ,
                         data = Orange, correlation = corExp(form = ~1 | Tree),
                         start = c(Asym=150, xmid=750, scal=300))
       , error = conditionMessage)
    cat(". ")
    if(i %% 10 == 0) {
        cat("\n Unique error messages till now:\n ")
        print(unique(unlist(r)))
    }
}


Note that the crash happens "randomly" and using R in the debugger and looking at the backtrace has *NOT* been useful for me, because the stack shows only "base R" memory allocation routines,  and nothing from nlme  which must be the culprit....  

There, the "culprit" is most probably in   nlme/src/gnls.c  (but possibly in the code *that* calls).
Nlme is maintained using svn, specifically the above file is always current at
https://svn.r-project.org/R-packages/trunk/nlme/src/gnls.c
Comment 1 Mikko Korpela 2017-03-10 14:24:52 UTC
This is my short technical analysis of what seems to be happening. I don't know the code well enough to understand which part of it is wrong or should be changed. Line numbers are for revision 7332 of the nlme SVN repository.

The relevant lines from running the example under R -d valgrind are:

1==19916== Invalid read of size 8
==19916==    at 0x11AACEF6: mult_mat (matrix.c:90)
==19916==    by 0x11AAAA02: corStruct_recalc (corStruct.c:60)
==19916==    by 0x11AAC791: gnls_objective (gnls.c:97)
==19916==    by 0x11AAC944: gnls_iterate (gnls.c:132)
==19916==    by 0x11AAC944: fit_gnls (gnls.c:183)

and

==19916== Invalid write of size 8
==19916==    at 0x4C2F793: memcpy@@GLIBC_2.14 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==19916==    by 0x11AACDC5: memcpy (string3.h:51)
==19916==    by 0x11AACDC5: copy_mat (matrix.c:62)
==19916==    by 0x11AACF60: mult_mat (matrix.c:95)
==19916==    by 0x11AAAA02: corStruct_recalc (corStruct.c:60)
==19916==    by 0x11AAC791: gnls_objective (gnls.c:97)
==19916==    by 0x11AAC944: gnls_iterate (gnls.c:132)
==19916==    by 0x11AAC944: fit_gnls (gnls.c:183)

What is not shown on those lines is that valgrind also reports the same memory address for both the read and the write.

In this example, an array of 140 double precision numbers is allocated in a call to gnls_init() made from fit_gnls(). On line 71 of gnls.c, nResult = 140 and the memory is allocated on line 72:

  nResult = evaluate(ptheta, npar, model, gnls->result);
  gnls->result[0] = Calloc(nResult, double);

The number 140 comes from 35 observations * 4, where 4 is 1 + number of parameters (Asym, xmid, scal).

Then we follow the call chain fit_gnls() -> gnls_iterate() -> gnls_objective() -> corStruct_recalc() -> mult_mat(), where both reads and writes relative to a pointer to that 140-element array fail to stay within the limits of the array, which is probably wrong.

On lines 55-64 of corStruct.c we have corStruct_recalc():

void
corStruct_recalc(double *Xy, int *pdims, int *ZXcol, double *Factor)
{
    int N = pdims[0], M = pdims[1], *len = pdims + 4, *start = len + M, i;
    for(i = 0; i < M;  i++) {
	mult_mat(Xy + start[i], N, Factor, len[i], len[i], len[i],
		 Xy + start[i], N, *ZXcol);
	Factor += (len[i] * len[i]);
    }
}

There, in this example, M is 5, and start[i] has values {0, 21, 42, 63, 84}. Xy is a pointer to the 140-element array.

In matrix.c, lines 79-98, we have mult_mat(): 

double *
mult_mat(double *z, int ldz,
	 double *x, int ldx, int xrows, int xcols,
	 double *y, int ldy, int ycols)
{				/* z <- x %*% y */
  double *t, *tmp = Calloc((size_t)(xrows * ycols), double);
  int i, j;			/* use tmp so z can be either x or y */

  t = tmp;
  for (i = 0; i < ycols; i++) {
    for (j = 0; j < xcols; j++) {
      d_axpy(t, y[j], x + j * ldx, xrows);
    }
    t += xrows;
    y += ldy;
  }
  copy_mat(z, ldz, tmp, xrows, xrows, ycols);
  Free(tmp);
  return z;
}

The call from corStruct_recalc() to mult_mat() sets z and y to Xy + start[i], the largest value of which in this example is Xy + 84. The other arguments get the following values: x is a pointer to (a part of) a 245-element array; ldz and ldy are 35; ldx, xrows and xcols are 7, and ycols is 4.

In this example, the read from y[j] in mult_mat() at worst (i.e. the largest array index) refers to y[111], which is Xy[195] for start[4] == 84. Remember that the last element allocated for the array is Xy[139]. The same worst case (Xy[195]), this time for write access, occurs in the call to copy_mat() when z points to Xy + 84.