Welcome
Welcome to Appraisers' Free Forum

You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!

Using R for statistical analysis

Using other software for appraisal work, such as PhotoShop, GIMP, Excel, OpenOffice, and R.

Using R for statistical analysis

Postby Jim Plante on Tue Nov 18, 2008 10:15 am

R is available from http://www.r-project.org as a free download. It does everything that the more expensive commercial apps do, but it's free. It's main constraint from our viewpoint is that it uses a command-line interface. The command-line interface is what DOS used. (Remember the C: prompt?)

This little blurb will help you get started. The rest is in the documentation.

First, download some data from your MLS. Let's say it's all residential sales over the last two years for a defined neighborhood. You tell your MLS to export the search results as an Excel file.

Open the resulting download in Excel, and look at the results. Your first row should be a header, with names for each of the columns, e.g., Map ID, Property Address, Sale Date, Sale Price, GLA, etc. Why open it in Excel first? Well, because R is very picky about its diet. It won't eat just anything. We're going to use Excel to clean it up a little.

First, look at the header. You'll be typing those names into R commands later on, so now's the time to change them to something easier to read or type. My data service, for example, insists on sending "Last_Sale_Date" as one of the headers. I change that to "sdate" for convenience. These things are case-sensitive, too.

Next, make sure none of the cells are merged. Choose Edit->Select All and un-merge any cells. My service transmits data like this:
Code: Select all
1.-> 091-011.07 ->1194 Sulfur Springs Rd->11/9/070->$205,000->2194 <-
:  ->           ->Selmer, TN 38375    <-

I remove the merged cells, then (since that second line is the owner's city and address and not needed) I sort on Column A. This brings all the "second" lines to the top, and I delete them.

Next thing to do is to select the column of sale prices. My data service sends these in the form "$140,000". R interprets anything with non-numeric characters as ALL non-numeric (called "factors" in R-speak.) So you want to change the number format to plain vanilla, thus: "140000". No commas, no dollar signs. Decimals are allowed, but are not practical. Make the same change to land values, improvement values, and assessed values if you downloaded those. **Note: To select the whole column, just click on the column letter at the top of the screen. E.g., clicking the "A" selects all rows in column A from Row 1 to Row 65535.

Finally, look at the date columns and make sure they conform to "DD/MM/YY" format. You can use any format you want, but I'll be explaining this on the basis of the format shown.

Scroll through the data from top to bottom quickly. You're looking for artifacts, extra rows, formatting glitches, or anything else that isn't simple, straightforward ASCII numbers and letters.

Now, save that file as MyData.csv, and choose "," for a separator. You're finished with the first step. In the next post, we'll examine how to pull this file into R and run some simple analyses.
Jim Plante
Jim Plante
Certified General
 
Posts: 2343
Joined: Sat Aug 11, 2007 1:51 am
Location: Selmer, TN

Re: Using R for statistical analysis

Postby Jim Plante on Tue Nov 18, 2008 12:26 pm

I assume that you've now downloaded and installed R. You may have started the program and are wondering what in hell to do with it.
I've got some work to do now, and I'll post instructions later. For now, read the help file a little to get familiar with a few of the program's quirks.
If you run R, you'll see the command window on the screen. There's a ">" prompt at the bottom.
Click after that ">" prompt and type "demo("graphics")" like so:
Code: Select all
> demo ("graphics")
Follow the prompts at the bottom of the command window. (Mostly, it's just hitting "return" to see the next plot.) This demo will show you some of R's fantastic graphing capabilities. **Note: To cancel any command, use the "esc" key.

The next few posts will deal with loading your .csv file into R as a data frame; creating a few histograms; creating a scatterplot; and running a regression line through the plot. Later on we'll run a linear regression and examine the product of the linear model. If you want to preview this, the command is "lm". Type ?lm at the command prompt, and the help topic will display. You might also try ?plot. It's not going to be much help without an explanation, but you'll see how *some* of the stuff is created.
Jim Plante
Jim Plante
Certified General
 
Posts: 2343
Joined: Sat Aug 11, 2007 1:51 am
Location: Selmer, TN

Re: Using R for statistical analysis

Postby Jim Plante on Wed Nov 19, 2008 11:22 am

In order to provide consistency, I'm attaching a data file from one of my neighborhoods. Instead of MyData.csv, it's name is AllProperty.csv (I probably ought to change that, but it'll do for now.) You can use your own .csv if you like, but I'll be explaining this process in terms of AllProperty.csv. It contains all parcels of property, regardless of type, value, or sale date, in one of my defined rural neighborhoods. Download the file, and figure out where it is stored on your file system. (It's probably in your "Downloads" directory if you have one; if not, figure out where it is using Explorer.

If R is not running, start it. At the command prompt, type the following:
Code: Select all
prop=read.delim(file.choose(), header=TRUE, sep=",")
This stuff IS case-sensitive. "Header" is not the same as "header"; TRUE is a logical value, "true" is not.

What you just did was to create a data frame named "prop," whose contents are the same as your data file. The "header=TRUE" part tells R that the data file you chose has column names on the first line. The 'sep = "," ' part tells R that the data fields (columns) are separated by commas. (.csv="character-separated values"; the comma is the character which separates values.)

If you got a syntax error, you typed the command incorrectly. Just hit the "up-arrow" key at the command prompt to bring back the last command. Then you can use the left- and right-arrow keys and the backspace key to repair the command. Hit return again when you think you've got it. In the alternative, simply click the "Select All" link at the top of the code box above, copy the code, and paste it into R's command window.

Now, don't conclude that this is already too much trouble. You're going to save that command, and several others, into a script. In the future, you'll simply run the script, choose the file, and the data frame will be loaded automatically.

Let's see what we've got so far. What are the column names that R recognized when it read your .csv file? The command to get that is "names(frame name)," and you use it like so:
Code: Select all
> names(prop)
[1] "X"                "Parcel.ID"        "Property.Address"
[4] "sprice"           "sdate"            "gla"             
[7] "acres"            "yrbuilt"          "appraisal"       
[10] "ptype"            "itype"            "use"             
[13] "stories"          "extwall"         
Here you see the column names that I applied to the spreadsheet before I saved it as a .csv file. The "X" means that there wasn't any name for that column. (It's the line number in the .csv file. I should have removed it before saving.)

I've gotta run for now. Try these, one at a time:
Code: Select all
hist(yrbuilt)
hist(yrbuilt, col="red")
hist(yrbuilt, breaks=20, col="blue")

We'll do some more exploring later.
Attachments
AllProperty.csv
(78.41 KiB) Downloaded 6 times
Jim Plante
Jim Plante
Certified General
 
Posts: 2343
Joined: Sat Aug 11, 2007 1:51 am
Location: Selmer, TN

Re: Using R for statistical analysis

Postby Chris H on Wed Nov 19, 2008 4:47 pm

OK, got in and it looks "cool" :lol:

Thanks, it could be a great forum to share some of the stuff we have found to be useful and critique some of the things we've found worthless.

I will give the software a try in the next couple of days.
Wheresoever you go, go with all your heart. Confucius
Chris H
Certified Residential
 
Posts: 547
Joined: Mon Aug 13, 2007 10:03 am
Location: Utah

Re: Using R for statistical analysis

Postby Jim Plante on Wed Nov 19, 2008 9:47 pm

One of the advantages of R is that it's both an interactive program, meaning you can specify specific plots just to check things out; and you can script it, so that you don't have to do things over and over. You folks keep in mind that I can make this thing work, but I'm by no means an expert on it. The commands you type in are actually a programming language related to C, C++, or Objective-C. It is an object-oriented language. Bet you didn't know you were that smart, did you?

Anyway, I'm going to walk through the Neighborhood section of the URAR, establishing support for the land-use section; and the age-value (Hi, lo, predominant) section using R's graphics. We'll look at the price/GLA over time, and show how to handle dates. We'll use that one to do a scatterplot, and we'll overplot it with a regression line. We'll also examine the output of that linear regression to see what R can tell us. I'm hoping PC will dive in here and tell us a little more about that output, and how to interpret it.

When we've finished, we'll go back through the console log (that thing you're typing the commands in) and extract the statements that worked. We'll put those in a separate document, tweak it a little, and run the whole analysis on a different data file--but with only a single command and a file choice.

Business has picked up here, so I won't have unfettered time in which to do this. And you don't want it all at once, trust me. You can Google for "R tutorial" and find several good tutorials written by professional teachers. I, well, I have the power to cloud men's minds. Women's too, but for different purposes. I'm going to walk through elements of a market analysis using this thing, in order to show how to apply it to our uses.

There are three histogram commands to try shown at the end of the last post. Here's a better way:
Code: Select all
hist(2008-yrbuilt [ptype=="Residential"], col="blue", breaks=20, main="Age of Properties", xlab="Age (Years)")
Don't worry if it all won't fit on one line; just keep typing. The editor will wrap the text as necessary; do NOT insert a line break or carriage return to make it fit. Look at that histogram, and tell me what you'd report as the predominant age.

The command breaks down like this: hist --the command for a histogram plot. Type ?hist() to see the full monte on what that command can do. It'll make your head hurt, so be careful. It's written by academics, for academics. The arguments--the part in the outer parentheses--tell hist what to plot. In this case, I want a plot of the age of the improvements, so I subtract the year built (yrbuilt) from 2008. But I want only the residential properties plotted, so I tell it to plot the yrbuilt vector, but only if the property type (ptype) is equal to (==) "Residential". Observe the way that condition is contained in brackets. The "breaks=20" tells the hist function to break the data into 20 sections, or bins. The main="Age of Properties" tell the function to put the quoted stuff at top center as the title of the plot. The "xlab" option tells it to label the x-axis "Age (Years)". We could also tell it to label the y-axis by putting a comma after the xlab="Age (Years", and adding ylab="Number of Properties", for example.
Jim Plante
 

Re: Using R for statistical analysis

Postby Jim Plante on Sat Nov 29, 2008 11:16 am

Well, Thanksgiving has come and gone; leftovers are being nibbled; and guests have returned home. Time to get back to work.

If any of you have tried to plot anything with the date of sale, you have not been successful. That's because R doesn't know a date from diddly until you tell it otherwise. It thinks the sdate column is a factor, or category. In other words, it thinks those dates are something like quality ratings: Low, Fair, Average, Good, Excellent. Even if you changed those quality ratings to 1, 2, 3, 4, and 5, you still could not logically do arithmetic on them, because they'd be ordinal numbers, and not a quantity. (N.B.: Fanning does exactly that in his book on market analysis; I strongly suspect that this practice is incorrect.)

To remedy this, we'll need to convert the "sdate" column from whatever it is into a date vector: But before we can do that, we need to add a package of programs to R. On the Mac, the Package Manager is found on the main menu bar under Packages and Data. If you're on Windows, your program should have a similar choice. Select the Package Installer, and tell it to update the list. Mine has a button that says "Get List".

Having done that, scroll down until you see a package labeled "Survival." Select it by clicking on it, and then below the list, select the button to install it at system level (not at user level). Make sure the little box is checked to "Install dependencies". Then click the "Install Selected" button.

One more task, and we'll be able to use it. Choose "Packages & Data->Package Manager" from the main menu bar. You'll get a window that shows you all the packages available to you. Scroll down to the "Survival" package, and click the "Loaded" box. You're now ready to do the conversion with the code below:

Code: Select all
d=as.Date(prop$sdate, "%m/%d/%y")
This tells R to take the "sdate" vector and convert everything it finds there into dates, if it can. The part in quotes "%m/%d/%y" tells it to expect to find the values in a format of month first, separated by a slash "/", then the day, slash, and two-digit year. If the year were four digits, you would use a capital-Y, like so: %m/%d/%Y. We tell R to put all this into a temporary container, which I have called "d".

Now, let's add that date vector to our original data frame:
Code: Select all
prop=transform(prop, SaleDate=d)
Here, we've told R that the data frame we named "prop" is to be transformed; it is to add another column named "SaleDate" to the data frame, and to take its values from "d".

After you've executed those statements, try
Code: Select all
prop$sdate
prop$SaleDate
See the difference?

Now try these:
Code: Select all
class(prop$sdate)
class(prop$SaleDate)
You find that "class(prop$sdate) returns "factor", and class(prop$SaleDate) returns "date". Classes are related to object-oriented programming, which is how R's functions operate. Now, this is an open source program, which means that anybody can work on it. The source code (what the programmers wrote) is available for download. Users can also write their own packages. And there are bugs in it. Here's one:
Code: Select all
x=as.Date("2008-01-08", "%Y-%m-%d")
x
"2008-01-08"
is.date(x)
FALSE
class(x)
"Date"

That "is.date()" function should return "TRUE". It doesn't. But that's ok, because R knows it's a date, as evidenced by "class(x)" returning "Date". The is.date function has a bug in it. So does "as.date()". (note the lower-case "d" in that function.) You used the as.Date function (capital "D") in the conversion above. The as.date function is in the base package; the as.Date function is in the "Survival" package you downloaded earlier. The as.date() function in the Base package doesn't work. The as.Date() function in the Survival package does.
Jim Plante
Jim Plante
Certified General
 
Posts: 2343
Joined: Sat Aug 11, 2007 1:51 am
Location: Selmer, TN

Re: Using R for statistical analysis

Postby Jim Plante on Sat Nov 29, 2008 11:39 am

Now that we've got ourselves a real date vector in that props data frame, let's explore it a little. I'm getting tired of typing prop$ every time I want something out of the data frame, so let's fix that:
Code: Select all
attach(prop)
Now we can simply use the column names alone, like so:
Code: Select all
sdate
SaleDate
Each of those will crank out all 710 values found in their respective columns. Let's compare the date in row #701 with that in row #707:
Code: Select all
SaleDate[701]
# [1] "1994-09-06"
SaleDate[707]
# [1] "2005-10-25"
SaleDate[701] > SaleDate[707]
# [1] FALSE
SaleDate[701] < SaleDate[707]
# [1] TRUE
SaleDate[707] - SaleDate[701]
# Time difference of 4067 days
I've added hash marks to the output portion of this so you can copy and paste the code into R's command window. You can see that it will compare the dates accurately, and that it performs date arithmetic on them correctly. How about addition and subtraction? Try this:
Code: Select all
SaleDate[707]-90
#[1] "2005-07-27"
Row 707 has October 25, 2005 as a date. Subtracting 90 calendar days from it yields July 27, 2005--the right answer.
Jim Plante
Jim Plante
Certified General
 
Posts: 2343
Joined: Sat Aug 11, 2007 1:51 am
Location: Selmer, TN

Re: Using R for statistical analysis

Postby Jim Plante on Sun Nov 30, 2008 10:52 am

Okay, my posts to this thread are going to be kind of sparse and irregular. Those of you who want to forge ahead can read the files attached below. The guys who wrote these are professional educators, and can communicate this material much more clearly than I can. If you run into something you don't understand, ask here.
Attachments
R-intro.pdf
Start with this one. It's 100 pages, and explains the basics of R's operations.
(660.63 KiB) Downloaded 4 times
Farnsworth-EconometricsInR.pdf
This one repeats some of the stuff in the R-intro file, but is well worthwhile. It covers some of the stuff we'll be doing. 75 pages.
(460.19 KiB) Downloaded 4 times
Jim Plante
Jim Plante
Certified General
 
Posts: 2343
Joined: Sat Aug 11, 2007 1:51 am
Location: Selmer, TN


Return to Cool Tools

Who is online

Users browsing this forum: No registered users and 0 guests