Im使用DBSCAN聚类算法执行NLP任务,以查看这些单词之间是否存在任何聚类.
我拥有的数据集具有以下格式
dbscan_df1 = 0.218139 -0.187614 0.369169 -0.092737 -0.557485 0.081599 -0.128570 0.176931 0.017034 0.037340 ... 0.144367 -0.016442 -0.035220 -0.054845 -0.066294 0.478536 -0.009905 0.020157 'to'//
0.337206 -0.198735 0.111028 -0.260347 -0.398793 -0.154748 0.000725 -0.066050 0.053052 0.049676 ... 0.472701 0.218401 -0.048758 -0.178615 0.202199 0.206250 0.120965 0.644875 'antigens'//
.
.
.//
0.361761 -0.415790 0.122305 0.254493 0.088273 0.322763 0.041017 -0.041002 -0.169154 -0.420434 ... -0.131940 -0.104786 -0.072195 -0.322311 0.065886 -0.074053 -0.090491 0.276367 'sailboat'
所以有65列,其中前64列是每个唯一单词的表示,然后我做以下操作来为dbscan算法准备数据集
# Select all columns apart from Unique Words one
dbscan_df = dbscan_df2.loc[:, dbscan_df1.columns != 'Unique Words']
最后,我将使用本教程应用dbscan算法https://www.reneshbedre.com/blog/dbscan-python.html,以确定its参数
# Compute the parameters of DBSCAN method
# For multidimensional dataset, minPts should be 2 * number of dimensions
# To determine the optimal ε parameter, I will compute the kNN distances of an input dataset using the kNN method
# n_neighbors = 129 as kneighbors function returns distance of point to itself (i.e. first column will be zeros)
nbrs = NearestNeighbors(n_neighbors=129).fit(dbscan_df)
# Find the k-neighbors of a point
neigh_dist, neigh_ind = nbrs.kneighbors(dbscan_df)
# sort the neighbor distances (lengths to points) in ascending order
# axis = 0 represents sort along first axis i.e. sort along row
sort_neigh_dist = np.sort(neigh_dist, axis=0)
k_dist = sort_neigh_dist[:, 128]
plt.plot(k_dist)
plt.axhline(y=5, linewidth=1, linestyle='dashed', color='k')
plt.ylabel("k-NN distance")
plt.xlabel("Sorted observations (4th NN)")
plt.show()
# Fit DBSCAN method
dbscan_clustering = DBSCAN(min_samples=129, eps=5, n_jobs=-1).fit(dbscan_df)
Counter(dbscan_clustering.labels_)
output: Counter({0: 1201, -1: 1})
问题:有什么方法可以看到哪些单词的标签是0,哪些单词的标号是1例如,我可以看看单词antigens是有标签1还是标签1吗?