Titanic: Machine Learning from Disaster
Getting Started With Python
ろな種類があり、それぞれ利点と欠点があります。ここでは、簡単に使えるスクリプト言語として、Python version
2.7を使っていきます。もしまだPythonをインストールしていない場合、「 Python web site (https://www.python.org/downloads/)」にアクセスし、記載されている内容に従ってください。あるいは「 Anaconda (https://www.continuum.io/downloads)」を使うと、データサイエンスにかなり有用なライブラリが既にバンドルされています。(Anacondaのさらに良いところとしては、一行一行順番に実行するのに便利なインタフェースを提供する iPython(インタラクティブなPython)が含まれていることです。)
他のケース:もし Python version 3.xを使う場合、2.x系と文法が違います。チュートリアルで出るエラーへの対応については、「 as people point out in the forum (https://www.kaggle.com/c/titanic/forums/t/4937/titanic-problem-with-getting-started-with-python)」を参照してください。
」、あるいは「 ipython 」、
「 ipython notebook
の大きな利点としては、Pythonが持つパッケージです。これらのパッケージの中で、Kaggleのコンペで最も利用できるものは「 Numpy」、「
Scipy」、「 Pandas」、「 matplotlib」、「 csv
import numpy
Python: train.csvを読む
Pythonには、csvを呼んでメモリに格納する良いcsv readerがあります。各列をリストに入れているように読めます。下記のようにすばやく配列に入れることもできます。
# The first thing to do is to import the relevant packages # that I will need for my script, # these include the Numpy (for maths and arrays) # and csv for reading and writing csv files # If i want to use something from this I need to call # csv.[function] or np.[function] first import csv as csv import numpy as np # Open up the csv file in to a Python object csv_file_object = csv.reader(open('../csv/train.csv', 'rb')) header = csv_file_object.next() # The next() command just skips the # first line which is a header data=[] # Create a variable called 'data'. for row in csv_file_object: # Run through each row in the csv file, data.append(row) # adding each row to the data variable data = np.array(data) # Then convert from a list to an array # Be aware that each item is currently # a string in this format
「print data
」と入力すると以下のように表示されます。[['1' '0' '3' ..., '7.25' '' 'S'] ['2' '1' '1' ..., '71.2833' 'C85' 'C'] ['3' '1' '3' ..., '7.925' '' 'S'] ..., ['889' '0' '3' ..., '23.45' '' 'S'] ['890' '1' '1' ..., '30' 'C148' 'C'] ['891' '0' '3' ..., '7.75' '' 'Q']]
見 ると、ヘッダが無く、ただ値だけのようです。そして、各値は引用符(”)でくくられているのが分かるでしょう。つまり、各値は文字列型(String型) で格納されているということです。残念なことに、データの一部は「...」になっています。1行目をすべて見たい場合には、「
print data[0] 」と入力します。
['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171' '7.25' '' 'S']
print data[-1] 」と入力します。
['891' '0' '3' 'Dooley, Mr. Patrick' 'male' '32' '0' '0' '370376' '7.75' '' 'Q']
print data[0,3] 」と入力します。
Braund, Mr. Owen Harris
# The size() function counts how many elements are in # in the array and sum() (as you would expects) sums up # the elements in the array. number_passengers = np.size(data[0::,1].astype(np.float)) number_survived = np.sum(data[0::,1].astype(np.float)) proportion_survivors = number_survived / number_passengers
「Numpy」は幾つか使いやすい機能があります。たとえば、 性別の列を探し、どの要素が「female(あるいはmale。femaleに該当しないもの)」に該当するかを探し、幾人の女性、あるいは男性が生き残ったかを確認することができます。
women_only_stats = data[0::,4] == "female" # This finds where all # the elements in the gender # column that equals “female” men_only_stats = data[0::,4] != "female" # This finds where all the # elements do not equal # female (i.e. male)
# Using the index from above we select the females and males separately women_onboard = data[women_only_stats,1].astype(np.float) men_onboard = data[men_only_stats,1].astype(np.float) # Then we finds the proportions of them that survived proportion_women_survived = \ np.sum(women_onboard) / np.size(women_onboard) proportion_men_survived = \ np.sum(men_onboard) / np.size(men_onboard) # and then print it out print 'Proportion of women who survived is %s' % proportion_women_survived print 'Proportion of men who survived is %s' % proportion_men_survived
test_file = open('../csv/test.csv', 'rb') test_file_object = csv.reader(test_file) header = test_file_object.next()
prediction_file = open("genderbasedmodel.csv", "wb") prediction_file_object = csv.writer(prediction_file)
prediction_file_object.writerow(["PassengerId", "Survived"]) for row in test_file_object: # For each row in test.csv if row[3] == 'female': # is it a female, if yes then prediction_file_object.writerow([row[0],'1']) # predict 1 else: # or else if male, prediction_file_object.writerow([row[0],'0']) # predict 0 test_file.close() prediction_file.close()
デー タページ(訳注:test.csvとかダウンロードできるところ)に、「gendermodel.py」というファイル名でこれらのステップを記載した ファイルをダウンロードできます。Pythonを使うことの利点として、たとえばもし新たな訓練データを入手した際に、ここまでのステップをもう一度すぐ に実施できることがあります。
Getting Started With Python
Getting Started with Python: Kaggle's Titanic Competition
our work with Excel: we have been able to successfully download the
competition data and submit 2 models: one based on just the gender and
another on the gender, class, and price of ticket. This is good for an
initial submission, however problems arise when we move to slightly more
complicated models and the time taken to formulate approaches in Excel
takes longer. What do you do if you want to make a more complicated
model but don’t have the time to manually find the proportion of
survivors for a given variable? We should make the computer do the hard
work for us!
I want to add more variables but it takes so much time!
scripts are a great way to speed up your calculations and avoid the
arduous task of manually calculating the pivot table ratios. There are
many languages out there, each with its own advantages and
disadvantages. Here we are going to use Python version 2.7, which is an
easy to use scripting language. If you do not have this installed,
please visit the Python web site and folllow the instructions there -- or, you could install one specific distribution of Python called Anaconda that
already bundles the most useful libraries for data science. (Another
advantage of Anaconda is that it includes iPython (Interactive Python)
which makes the interface easier for stepping through lines of
programming one by one.)
in either case: if you use Python version 3.x, you may discover some
Python syntax has changed in that version, which can cause errors
on this tutorial as people point out in the forum.
When you have things installed, to begin just type
, or ipython
, or ipython notebook
of the great advantages of Python is its packages. Of these packages,
the most useful (for Kaggle competitions) are the Numpy, Scipy, Pandas,
matplotlib and csv package. In order to check whether you have these,
just go to your python command line and type
import numpy
(and so on). If you don’t you will need them! This tutorial is going to
guide you through making the same submissions as before, only this time
using Python.
Python: Reading in your train.csv
has a nice csv reader, which reads each line of a file into memory. You
can read in each row and just append a list. From there, you can
quickly turn it into an array.
# The first thing to do is to import the relevant packages # that I will need for my script, # these include the Numpy (for maths and arrays) # and csv for reading and writing csv files # If i want to use something from this I need to call # csv.[function] or np.[function] first import csv as csv import numpy as np # Open up the csv file in to a Python object csv_file_object = csv.reader(open('../csv/train.csv', 'rb')) header = csv_file_object.next() # The next() command just skips the # first line which is a header data=[] # Create a variable called 'data'. for row in csv_file_object: # Run through each row in the csv file, data.append(row) # adding each row to the data variable data = np.array(data) # Then convert from a list to an array # Be aware that each item is currently # a string in this format
Although you've seen this data before in Excel, just to be sure let's look at how it is stored now in Python. Type
print data
and the output should be something like[['1' '0' '3' ..., '7.25' '' 'S'] ['2' '1' '1' ..., '71.2833' 'C85' 'C'] ['3' '1' '3' ..., '7.925' '' 'S'] ..., ['889' '0' '3' ..., '23.45' '' 'S'] ['890' '1' '1' ..., '30' 'C148' 'C'] ['891' '0' '3' ..., '7.75' '' 'Q']]
can see this is an array with just values (no descriptive header). And
you can see that each value is being shown in quotes, which means it is
stored as a string. Unfortunately in the output above, the full set of
columns is being obscured with "...," so let's print the first row to
see it clearly. Type
print data[0]
['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171' '7.25' '' 'S']
and to see the last row, type
print data[-1]
['891' '0' '3' 'Dooley, Mr. Patrick' 'male' '32' '0' '0' '370376' '7.75' '' 'Q']
and to see the 1st row, 4th column, type
print data[0,3]
Braund, Mr. Owen Harris
I have my data now I want to play with it
Now if you want to call a specific column of data, say, the gender column, I can just type
remembering that "0::" means all (from start to end), and Python starts
indices from 0 (not 1). You should be aware that the csv reader works
by default with strings, so you will need to convert to floats in order
to do numerical calculations. For example, you can turn the Pclass
variable into floats by usingdata[0::,2].astype(np.float)
. Using this, we can calculate the proportion of survivors on the Titanic: # The size() function counts how many elements are in # in the array and sum() (as you would expects) sums up # the elements in the array. number_passengers = np.size(data[0::,1].astype(np.float)) number_survived = np.sum(data[0::,1].astype(np.float)) proportion_survivors = number_survived / number_passengers
has some lovely functions. For example, we can search the gender
column, find where any elements equal female (and for males, 'do not
equal female'), and then use this to determine the number of females and
males that survived:
women_only_stats = data[0::,4] == "female" # This finds where all # the elements in the gender # column that equals “female” men_only_stats = data[0::,4] != "female" # This finds where all the # elements do not equal # female (i.e. male)
use these two new variables as a "mask" on our original train data,
so we can select only those women, and only those men on board, then
calculate the proportion of those who survived:
# Using the index from above we select the females and males separately women_onboard = data[women_only_stats,1].astype(np.float) men_onboard = data[men_only_stats,1].astype(np.float) # Then we finds the proportions of them that survived proportion_women_survived = \ np.sum(women_onboard) / np.size(women_onboard) proportion_men_survived = \ np.sum(men_onboard) / np.size(men_onboard) # and then print it out print 'Proportion of women who survived is %s' % proportion_women_survived print 'Proportion of men who survived is %s' % proportion_men_survived
Now that I have my indication that women were much more likely to survive, I am done with the training set.
Reading the test data and writing the gender model as a csv
As before, we need to read in the test file by opening a python object to read and another to write. First, we read in the test.csv file and skip the header line:
test_file = open('../csv/test.csv', 'rb') test_file_object = csv.reader(test_file) header = test_file_object.next()
let's open a pointer to a new file so we can write to it (this file
does not exist yet). Call it something descriptive so that it is
recognizable when we upload it:
prediction_file = open("genderbasedmodel.csv", "wb") prediction_file_object = csv.writer(prediction_file)
We now want to read in the test file row by row, see if it is female or male, and write our survival prediction to a new file.
prediction_file_object.writerow(["PassengerId", "Survived"]) for row in test_file_object: # For each row in test.csv if row[3] == 'female': # is it a female, if yes then prediction_file_object.writerow([row[0],'1']) # predict 1 else: # or else if male, prediction_file_object.writerow([row[0],'0']) # predict 0 test_file.close() prediction_file.close()
Now you have a file called 'genderbasedmodel.csv', which you can submit!
the Data page you will find all of the steps above in a single python
script named 'gendermodel.py'. One advantage of python is that you can
quickly run all of the steps you did again in the future -- if you
receive a new training file, for example.
0 件のコメント: