Bug 16587 - Incorrect prediction with original data in loess
Summary: Incorrect prediction with original data in loess
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Models (show other bugs)
Version: R-devel (trunk)
Hardware: All All
: P4 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2015-11-03 12:41 UTC by Peter Saunders
Modified: 2015-12-14 13:44 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Saunders 2015-11-03 12:41:57 UTC
When using loess with "direct" surface and weights the result of 'predict' is not the same as the fitted values.

This appears similar to Bug 611, which was fixed, but when using weights.

This only occurs for surface="direct", but for families both "symmetric" and "gaussian", statistics both "approximate" and "exact" and trace.hat both "exact" and "approximate".


# ----- Example Code ----- #
# Loess
cars.lo = loess(dist ~ speed, cars, control = loess.control(surface="direct"))
carsPred    = predict(cars.lo)
carsPredNew = predict(cars.lo, newdata=cars)

all.equal(carsPred, carsPredNew, check.attributes = FALSE)
#[1] TRUE
# This is fine, equivalent to the fix of bug 611

# Loess with weights
set.seed(123)
wt = runif(nrow(cars))

cars.wt = loess(dist ~ speed, cars, weights = wt, control = loess.control(surface="direct"))
carsPred    = predict(cars.wt)
carsPredNew = predict(cars.wt, newdata=cars)

all.equal(carsPred, carsPredNew, check.attributes = FALSE)
#[1] "Mean relative difference: 0.02900143"
# This is not right

# Look at mean square error
weighted.mean((cars$dist - predict(cars.wt))^2, wt)
weighted.mean((cars$dist - predict(cars.wt, newdata=cars))^2, wt)

#[1] 184.4413
#[1] 186.9075
# The error is (always) worse for the re-predict method,
# which suggests this is where the problem is
# ----- End of Example ----- #
Comment 1 Peter Saunders 2015-11-04 17:27:14 UTC
After a bit of testing we think we have figured out what the issue is.  The issue is with the scaling of the 'robust' variable in the simpleLoess and predLoess functions.

On line 133 of loess.R, in the function simpleLoess you find robust is scaled by the weights:
	robust <- weights * robust
	
This is then stored into the loess$pars list towards the end of the function so that the robust variable which is stored in the loess object is scaled by the weights.

On lines 289 and 310 of loess.R, in the function predLoess, the weights vector is multiplied by the robust vector in the call to C_loess_dfitse and C_loess_dfit respectively.  

Since the 'robust' value has already been multiplied by the weights we are calculating the new prediction passing the weights variable as the weights squared, which is wrong.

Here is an example to demonstrate this:

# ----- Example Code ----- #
# Loess with weights
set.seed(123)
wt = runif(nrow(cars))

cars.wt = loess(dist ~ speed, cars, weights = wt, control = loess.control(surface="direct"))
carsPred    = predict(cars.wt)
carsPredNew = predict(cars.wt, newdata=cars)

# Divide robust by weights to remove re-scaling effect
cars.wt.robustFixed = cars.wt
cars.wt.robustFixed$pars$robust = cars.wt$pars$robust / cars.wt$weight

carsPredNewFixed = predict(cars.wt.robustFixed, newdata=cars)

all.equal(carsPred, carsPredNew, check.attributes = FALSE)
# [1] "Mean relative difference: 0.02900143"
# This is not right

all.equal(carsPred, carsPredNewFixed, check.attributes = FALSE)
# [1] TRUE
# This is now good
# ----- End of Example ----- #

I think, for family="gaussian", that the weights argument (the 4th argument) in the calls to C_loess_dfitse and C_loess_dfit should not be multiplied by robust.

The family = "symmetric" case is even more complicated, as the iterations can be > 1 and I am less sure if this is correct.
Comment 2 Martin Maechler 2015-11-05 17:02:54 UTC
I see that this is clearly a bug.
You are right the robust case (family = "symmetric") needs special care... and so I will look at that particularly.
Comment 3 Martin Maechler 2015-11-10 16:32:24 UTC
(In reply to Martin Maechler from comment #2)
> I see that this is clearly a bug.
> You are right the robust case (family = "symmetric") needs special care...
> and so I will look at that particularly.

After lots of looking, I've also added a new feature: `iterTrace = TRUE` now allows to get some robustness iteration monitoring... and
I also have added (incomplete) checks for the surface x family x statistics x trace.hat combinations, and had to look at the C code to see where and when 
'robust' and 'weights' are used there.
{{and there may be more simplification possible,..}}

Commited to R-devel (svn rev  ) -- to be ported to R '3.2.2 patched' subsequently.