机器学习，如何使用fetch_openml加载泰坦尼克数据集？

我们相信：世界是美好的，你是我也是。平行空间的世界里面，不同版本的生活也在继续...

在sklearn包里面，并不存在titanic.csv数据集文件本身。但是，可以通过fetch_openml()函数扩展到很多数据集，其中就包括titanic数据集。并且，在openml网站上，这个泰坦尼克数据集还存在多个版本。那么，如何区分加载openml数据集的多个版本呢？这就是本文要讨论的问题。

大家好，这里是苏南大叔的“黑客马拉松夺旗赛”博客，这里讲述苏南大叔和计算机代码之间的故事。本文主要分析的对象是“泰坦尼克数据集”。测试环境：python@3.6.8，pandas@1.1.5，scikit-learn@1.3.2。

泰坦尼克数据集

首先，和鸢尾花数据集一样，泰坦尼克数据集也存在着多个版本，不同的网站上提供着不同的数据源。本文聚焦于openml网站上，提供的两个版本的泰坦尼克数据源。这就是本文的主要实验对象。值得注意的是：本文的数据源格式并不是csv，也不是txt，而是一种新的.arff格式数据。所以普通的csv数据集的加载方式，并不能解析出来数据。

泰坦尼克数据集的字段含义，可以参考文章：

https://xctf.net/say/titanic.html

本文的函数fetch_openml()的参数及返回值，可以类比：

https://xctf.net/say/sklearn-load_iris.html

泰塔尼克数据集版本一，14字段

说明文档：

https://www.openml.org/search?type=data&sort=runs&id=40945&status=active

下载地址：

https://www.openml.org/data/download/16826755/phpMYEkMl

数据字段：

@attribute 'pclass' numeric
@attribute 'survived' {0,1}
@attribute 'name' string
@attribute 'sex' {'female','male'}
@attribute 'age' numeric
@attribute 'sibsp' numeric
@attribute 'parch' numeric
@attribute 'ticket' string
@attribute 'fare' numeric
@attribute 'cabin' string
@attribute 'embarked' {'C','Q','S'}
@attribute 'boat' string
@attribute 'body' numeric
@attribute 'home.dest' string

加载代码：

from sklearn.datasets import fetch_openml
t = fetch_openml("titanic", parser="auto", version=1)

泰塔尼克数据集版本二，4字段

说明文档：

https://www.openml.org/search?type=data&sort=runs&id=40704&status=active

下载地址：

https://www.openml.org/data/download/4965305/titanic.arff

数据字段：

@ATTRIBUTE Class NUMERIC
@ATTRIBUTE Age NUMERIC
@ATTRIBUTE Sex NUMERIC
@ATTRIBUTE class {-1,1}

这里的第一个字段Class是target，那么就代表了survived字段。
加载代码：

from sklearn.datasets import fetch_openml
t = fetch_openml("titanic", parser="auto", version=2)

可能遇到的问题（参数）

执行代码：

from sklearn.datasets import fetch_openml
t1 = fetch_openml("titanic")

警告信息：

FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.

解决方案，加个额外的参数parser='auto'即可。

from sklearn.datasets import fetch_openml
t = fetch_openml("titanic", parser='auto')

警告信息：

UserWarning: Multiple active versions of the dataset matching the name titanic exist. Versions may be fundamentally different, returning version 2.

解决方案，加个额外的参数version=2即可。

from sklearn.datasets import fetch_openml
t = fetch_openml("titanic", parser="auto", version=2)

返回值又是个"bunch"类型。

<class 'sklearn.utils._bunch.Bunch'>

{'data':        Class    Age    Sex
0    -1.8700 -0.228  0.521
1    -0.9230 -0.228 -1.920
2    -0.9230 -0.228 -1.920
3     0.9650 -0.228  0.521
4     0.0214 -0.228  0.521
...      ...    ...    ...
2196  0.9650 -0.228  0.521
2197 -0.9230 -0.228  0.521
2198 -1.8700 -0.228  0.521
2199  0.9650 -0.228  0.521
2200 -0.9230 -0.228 -1.920

[2201 rows x 3 columns], 'target': 0       -1
1        1
2        1
3        1
4       -1
        ..
2196    -1
2197    -1
2198    -1
2199    -1
2200     1
Name: class, Length: 2201, dtype: category
Categories (2, object): ['-1', '1'], 'frame':        Class    Age    Sex class
0    -1.8700 -0.228  0.521    -1
1    -0.9230 -0.228 -1.920     1
2    -0.9230 -0.228 -1.920     1
3     0.9650 -0.228  0.521     1
4     0.0214 -0.228  0.521    -1
...      ...    ...    ...   ...
2196  0.9650 -0.228  0.521    -1
2197 -0.9230 -0.228  0.521    -1
2198 -1.8700 -0.228  0.521    -1
2199  0.9650 -0.228  0.521    -1
2200 -0.9230 -0.228 -1.920     1

[2201 rows x 4 columns], 'categories': None, 'feature_names': ['Class', 'Age', 'Sex'], 'target_names': ['class'], 'DESCR': 'PMLB version of the Titanic dataset, which only uses 3 features. See version 1 for the complete version: https://www.openml.org/d/40945\n\nDownloaded from openml.org.', 'details': {'id': '40704', 'name': 'Titanic', 'version': '2', 'description_version': '1', 'format': 'ARFF', 'upload_date': '2017-04-06T12:38:28', 'licence': 'public', 'url': 'https://api.openml.org/data/v1/download/4965305/Titanic.arff', 'parquet_url': 'https://openml1.win.tue.nl/datasets/0004/40704/dataset_40704.pq', 'file_id': '4965305', 'default_target_attribute': 'class', 'tag': ['Computer Systems', 'derived', 'Machine Learning'], 'visibility': 'public', 'minio_url': 'https://openml1.win.tue.nl/datasets/0004/40704/dataset_40704.pq', 'status': 'active', 'processing_date': '2018-10-04 07:15:38', 'md5_checksum': '08416114dd85d0ebd932fcb1d87650c1'}, 'url': 'https://www.openml.org/d/40704'}

返回数据分解

按照鸢尾花数据集load_irir()的套路，这个fetch_openml()也可以返回下面的值：

data【用于计算的特征】

print(type(t["data"]),t["data"])

输出：

<class 'pandas.core.frame.DataFrame'>
      Class    Age    Sex
0    -1.8700 -0.228  0.521
1    -0.9230 -0.228 -1.920
2    -0.9230 -0.228 -1.920
3     0.9650 -0.228  0.521
4     0.0214 -0.228  0.521
...      ...    ...    ...
2196  0.9650 -0.228  0.521
2197 -0.9230 -0.228  0.521
2198 -1.8700 -0.228  0.521
2199  0.9650 -0.228  0.521
2200 -0.9230 -0.228 -1.920

[2201 rows x 3 columns]

target【计算得出的目标】

print(type(t["target"]),t["target"])

输出：

<class 'pandas.core.series.Series'> 
0       -1
1        1
2        1
3        1
4       -1
        ..
2196    -1
2197    -1
2198    -1
2199    -1
2200     1
Name: class, Length: 2201, dtype: category
Categories (2, object): ['-1', '1']

target_names

print(type(t["target_names"]),t["target_names"])

输出：

<class 'list'> ['class']

feature_names

print(type(t.feature_names),t.feature_names)

输出：

<class 'list'> ['Class', 'Age', 'Sex']

frame【数据加目标】

print(type(t.frame),t.frame)

输出：

<class 'pandas.core.frame.DataFrame'>

       Class    Age    Sex class
0    -1.8700 -0.228  0.521    -1
1    -0.9230 -0.228 -1.920     1
2    -0.9230 -0.228 -1.920     1
3     0.9650 -0.228  0.521     1
4     0.0214 -0.228  0.521    -1
...      ...    ...    ...   ...
2196  0.9650 -0.228  0.521    -1
2197 -0.9230 -0.228  0.521    -1
2198 -1.8700 -0.228  0.521    -1
2199  0.9650 -0.228  0.521    -1
2200 -0.9230 -0.228 -1.920     1

descr

print(type(t.DESCR),t.DESCR)

输出：

<class 'str'>

PMLB version of the Titanic dataset, which only uses 3 features. See version 1 for the complete version: https://www.openml.org/d/40945
Downloaded from openml.org.

参数之as_frame，默认true

as_frame用于控制返回值类型，不过整体依然是bunch类型。

from sklearn.datasets import fetch_openml
t = fetch_openml("titanic", parser="auto", version=2)
t = fetch_openml("titanic", parser="auto", version=2, as_frame=False)

# print(type(t), t)
print(type(t["data"]))
print(type(t["target"]))
print(type(t["target_names"]))
print(type(t["feature_names"]))
print(type(t["frame"]))

as_frame	data	target	target_names	feature_names	frame
默认/true	DataFrame	Series	list	list	DataFrame
false	ndarray	ndarray	list	list	NoneType

参数之return_X_y，默认false

一个return_X_y=True，毁灭上面所有结论。这里的X就是原来的data，y就是target，X+y就是原来的frame。

from sklearn.datasets import fetch_openml
t = fetch_openml("titanic", parser="auto", version=2)
print(type(t))

X,y = fetch_openml("titanic", parser="auto", version=2, return_X_y=True)
print(type(X), X)
print(type(y), y)

输出：

<class 'sklearn.utils._bunch.Bunch'>

<class 'pandas.core.frame.DataFrame'>
      Class    Age    Sex
0    -1.8700 -0.228  0.521
1    -0.9230 -0.228 -1.920
2    -0.9230 -0.228 -1.920
3     0.9650 -0.228  0.521
4     0.0214 -0.228  0.521
...      ...    ...    ...
2196  0.9650 -0.228  0.521
2197 -0.9230 -0.228  0.521
2198 -1.8700 -0.228  0.521
2199  0.9650 -0.228  0.521
2200 -0.9230 -0.228 -1.920

[2201 rows x 3 columns]

<class 'pandas.core.frame.DataFrame'>
      Class    Age    Sex
0    -1.8700 -0.228  0.521
1    -0.9230 -0.228 -1.920
2    -0.9230 -0.228 -1.920
3     0.9650 -0.228  0.521
4     0.0214 -0.228  0.521
...      ...    ...    ...
2196  0.9650 -0.228  0.521
2197 -0.9230 -0.228  0.521
2198 -1.8700 -0.228  0.521
2199  0.9650 -0.228  0.521
2200 -0.9230 -0.228 -1.920

[2201 rows x 3 columns]

<class 'pandas.core.series.Series'> 
0       -1
1        1
2        1
3        1
4       -1
        ..
2196    -1
2197    -1
2198    -1
2199    -1
2200     1
Name: class, Length: 2201, dtype: category
Categories (2, object): ['-1', '1']

结语

更多机器学习的相关经验文字，欢迎参考苏南大叔的博客文章：

https://xctf.net/tag/机器学习/

如果本文对您有帮助，或者节约了您的时间，欢迎打赏瓶饮料，建立下友谊关系。

本博客不欢迎：各种镜像采集行为。请尊重原创文章内容，转载请保留作者链接。

【福利】腾讯云最新爆款活动！1核2G云服务器首年50元！

【源码】本文代码片段及相关软件，请点此获取更多信息

【绝密】秘籍文章入口，仅传授于有缘之人 sklearn 机器学习

	原创不易，转载请保留链接，谢绝镜像采集
	如果能解决您的困扰，那么想必定是极好的
	快来这里！大家都在这儿等你讨论这个问题