文本分类从入门到精通
本文整理自笔者年前在知乎上的一个回答:

大数据舆情情感分析,如何提取情感并使用什么样的工具?(贴情感标签)
1、我将数据筛选预处理好,然后分好词。
2、是不是接下来应该与与情感词汇本库对照,生成结合词频和情感词库的情感关键词库。
3、将信息与情感关键词库进行比对,对信息加以情感标记。
4、我想问实现前三步,需要什么工具的什么功能呢?据说用spss和武汉大学的ROST WordParser。该如何使用呢?https://www.zhihu.com/question/31471793/answer/542401478

情感分析说白了,就是一个文本(多)分类问题,我看一般的情感分析都是2类(正负面)或者3类(正面、中性和负面)。其实,这种粒度是远远不够的。本着“Talk is cheap, show you my code”的原则,我不扯咸淡,直接上代码给出解决方案(而且是经过真实文本数据验证了的:我用一个14个分类的例子来讲讲各类文本分类模型—从传统的机器学习文本分类模型到现今流行的基于深度学习的文本分类模型,最后给出一个超NB的模型集成,效果最优。

在这篇文章中,笔者将讨论自然语言处理中文本分类的相关问题。笔者将使用一个复旦大学开源的文本分类语料库,对文本分类的一般流程和常用模型进行探讨。首先,笔者会创建一个非常基础的初始模型,然后使用不同的特征进行改进。 接下来,笔者还将讨论如何使用深度神经网络来解决NLP问题,并在文章末尾以一般关于集成的一些想法结束这篇文章。

本文覆盖的文本分类方法有:

TF-IDF
Count Features
Logistic Regression
Naive Bayes
SVM
Xgboost
Grid Search
Word Vectors
Dense Network
LSTM
GRU
Ensembling
NOTE: 笔者并不能保证你学习了本notebook之后就能在NLP相关比赛中获得非常高的分数。 但是,如果你正确地“吃透”它,并根据实际情况适时作出一些调整,你可以获得非常高的分数。
废话不多说,让我们开始导入一些我将要使用的重要python模块。

#载入接下来分析用的库
import pandas as pd
import numpy as np
import xgboost as xgb
from tqdm import tqdm
from sklearn.svm import SVC
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
from nltk import word_tokenize
Using TensorFlow backend.
接下来是加载并检视数据集

data=pd.read_excel(‘/home/kesci/input/Chinese_NLP6474/复旦大学中文文本分类语料.xlsx’,‘sheet1’)
data.head()
分类 正文
0 艺术 【 文献号 】1-2432\n【原文出处】出版发行研究\n【原刊地名】京\n【原刊期号】1…
1 艺术 【 文献号 】1-2435\n【原文出处】扬州师院学报:社科版\n【原刊期号】199504…
2 艺术 【 文献号 】1-2785\n【原文出处】南通师专学报:社科版\n【原刊期号】199503…
3 艺术 【 文献号 】1-3021\n【原文出处】社会科学战线\n【原刊地名】长春\n【原刊期号】…
4 艺术 【 文献号 】1-3062\n【原文出处】上海文化\n【原刊期号】199505\n【原刊页…
data.info()
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 9249 entries, 0 to 9248
Data columns (total 2 columns):
分类 9249 non-null object
正文 9249 non-null object
dtypes: object(2)
memory usage: 144.6+ KB
data.分类.unique()
array([‘艺术’, ‘文学’, ‘哲学’, ‘通信’, ‘能源’, ‘历史’, ‘矿藏’, ‘空间’, ‘教育’, ‘交通’, ‘计算机’,
‘环境’, ‘电子’, ‘农业’, ‘体育’, ‘时政’, ‘医疗’, ‘经济’, ‘法律’], dtype=object)
对文本数据的正文字段进行分词,这里是在Linux上运行的,可以开启jieba的并行分词模式,分词速度是平常的好多倍,具体看你的CPU核心数。

import jieba
jieba.enable_parallel(64) #并行分词开启
data[‘文本分词’] = data[‘正文’].apply(lambda i:jieba.cut(i) )
Building prefix dict from the default dictionary …
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.402 seconds.
Prefix dict has been built succesfully.
Process ForkPoolWorker-3:
Process ForkPoolWorker-4:
Process ForkPoolWorker-2:
Traceback (most recent call last):
Traceback (most recent call last):
Process ForkPoolWorker-1:
File “/opt/conda/lib/python3.5/multiprocessing/process.py”, line 252, in _bootstrap
self.run()
File “/opt/conda/lib/python3.5/multiprocessing/process.py”, line 252, in _bootstrap
self.run()
File “/opt/conda/lib/python3.5/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
File “/opt/conda/lib/python3.5/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/opt/conda/lib/python3.5/multiprocessing/pool.py”, line 108, in worker
task = get()
File “/opt/conda/lib/python3.5/multiprocessing/pool.py”, line 108, in worker
task = get()
File “/opt/conda/lib/python3.5/multiprocessing/process.py”, line 252, in _bootstrap
self.run()
Traceback (most recent call last):
File “/opt/conda/lib/python3.5/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/opt/conda/lib/python3.5/multiprocessing/queues.py”, line 335, in get
res = self._reader.recv_bytes()
File “/opt/conda/lib/python3.5/multiprocessing/queues.py”, line 334, in get
with self._rlock:
File “/opt/conda/lib/python3.5/multiprocessing/process.py”, line 252, in _bootstrap
self.run()
File “/opt/conda/lib/python3.5/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/opt/conda/lib/python3.5/multiprocessing/connection.py”, line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File “/opt/conda/lib/python3.5/multiprocessing/pool.py”, line 108, in worker
task = get()
File “/opt/conda/lib/python3.5/multiprocessing/pool.py”, line 108, in worker
task = get()
File “/opt/conda/lib/python3.5/multiprocessing/queues.py”, line 334, in get
with self._rlock:
File “/opt/conda/lib/python3.5/multiprocessing/queues.py”, line 334, in get
with self._rlock:
File “/opt/conda/lib/python3.5/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
File “/opt/conda/lib/python3.5/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
File “/opt/conda/lib/python3.5/multiprocessing/connection.py”, line 407, in _recv_bytes
buf = self._recv(4)
KeyboardInterrupt
KeyboardInterrupt
File “/opt/conda/lib/python3.5/multiprocessing/connection.py”, line 379, in _recv
chunk = read(handle, remaining)
File “/opt/conda/lib/python3.5/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
KeyboardInterrupt
KeyboardInterrupt
值得注意的是,分词是任何中文文本分类的起点,分词的质量会直接影响到后面的模型效果。在这里,作为演示,笔者有点偷懒,其实你还可以:

设置可靠的自定义词典,以便分词更精准;
采用分词效果更好的分词器,如pyltp、THULAC、Hanlp等;
编写预处理类,就像下面要谈到的数字特征归一化,去掉文本中的#@¥%……&等等。
data[‘文本分词’] =[’ '.join(i) for i in data[‘文本分词’]]
data.head()
分类 正文 文本分词
0 艺术 【 文献号 】1-2432\n【原文出处】出版发行研究\n【原刊地名】京\n【原刊期号】1…  【 文献号 】 1 - 2432 \n 【 原文 出处 】 出版发行 研究 \n…
1 艺术 【 文献号 】1-2435\n【原文出处】扬州师院学报:社科版\n【原刊期号】199504…  【 文献号 】 1 - 2435 \n 【 原文 出处 】 扬州 师院 学报 :…
2 艺术 【 文献号 】1-2785\n【原文出处】南通师专学报:社科版\n【原刊期号】199503…  【 文献号 】 1 - 2785 \n 【 原文 出处 】 南通 师专 学报 :…
3 艺术 【 文献号 】1-3021\n【原文出处】社会科学战线\n【原刊地名】长春\n【原刊期号】…  【 文献号 】 1 - 3021 \n 【 原文 出处 】 社会科学 战线 \n…
4 艺术 【 文献号 】1-3062\n【原文出处】上海文化\n【原刊期号】199505\n【原刊页…  【 文献号 】 1 - 3062 \n 【 原文 出处 】 上海 文化 \n 【…
这是一个典型的文本多分类问题,需要将文本划分到给定的14个主题上。
针对该问题,笔者采用了kaggle上通用的 Multi-Class Log-Loss 作为评测指标(Evaluation Metric).

def multiclass_logloss(actual, predicted, eps=1e-15):
“”“对数损失度量(Logarithmic Loss Metric)的多分类版本。
:param actual: 包含actual target classes的数组
:param predicted: 分类预测结果矩阵, 每个类别都有一个概率
“””
# Convert ‘actual’ to a binary array if it’s not already:
if len(actual.shape) == 1:
actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
for i, val in enumerate(actual):
actual2[i, val] = 1
actual = actual2

clip = np.clip(predicted, eps, 1 - eps)
rows = actual.shape[0]
vsota = np.sum(actual * np.log(clip))
return -1.0 / rows * vsota

接下来用scikit-learn中的LabelEncoder将文本标签(Text Label)转化为数字(Integer)

lbl_enc = preprocessing.LabelEncoder()
y = lbl_enc.fit_transform(data.分类.values)
在进一步研究之前,我们必须将数据分成训练和验证集。 我们可以使用scikit-learn的model_selection模块中的train_test_split来完成它。

xtrain, xvalid, ytrain, yvalid = train_test_split(data.文本分词.values, y,
stratify=y,
random_state=42,
test_size=0.1, shuffle=True)
print (xtrain.shape)
print (xvalid.shape)
(8324,)
(925,)
构建基础模型(Basic Models)
让我们先创建一个非常基础的模型。

这个非常基础的模型(very first model)基于 TF-IDF (Term Frequency - Inverse Document Frequency)+逻辑斯底回归(Logistic Regression)。

笔者将scikit-learn中的TfidfVectorizer类稍稍改写下,以便将文本中的数字特征统一表示成"#NUMBER",达到一定的降噪效果。

def number_normalizer(tokens):
“”" 将所有数字标记映射为一个占位符(Placeholder)。
对于许多实际应用场景来说,以数字开头的tokens不是很有用,
但这样tokens的存在也有一定相关性。 通过将所有数字都表示成同一个符号,可以达到降维的目的。
“”"
return (“#NUMBER” if token[0].isdigit() else token for token in tokens)

class NumberNormalizingVectorizer(TfidfVectorizer):
def build_tokenizer(self):
tokenize = super(NumberNormalizingVectorizer, self).build_tokenizer()
return lambda doc: list(number_normalizer(tokenize(doc)))

利用刚才创建的NumberNormalizingVectorizer类来提取文本特征,注意里面各类参数的含义,自己去sklearn官方网站找教程看

stwlist=[line.strip() for line in open(‘/home/kesci/input/stopwords7085/停用词汇总.txt’,
‘r’,encoding=‘utf-8’).readlines()]
tfv = NumberNormalizingVectorizer(min_df=3,
max_df=0.5,
max_features=None,
ngram_range=(1, 2),
use_idf=True,
smooth_idf=True,
stop_words = stwlist)

使用TF-IDF来fit训练集和测试集(半监督学习)

tfv.fit(list(xtrain) + list(xvalid))
xtrain_tfv = tfv.transform(xtrain)
xvalid_tfv = tfv.transform(xvalid)
#利用提取的TFIDF特征来fit一个简单的Logistic Regression
clf = LogisticRegression(C=1.0,solver=‘lbfgs’,multi_class=‘multinomial’)
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))
#print(classification_report(predictions, yvalid))
logloss: 0.607
做完第一个基础模型后,得出的 multiclass logloss 是0.607.

但笔者“贪婪”,想要获得更好的分数。 基于相同模型采用不同的特征,再看看结果如何。

我们也可以使用词汇计数(Word Counts)作为功能,而不是使用TF-IDF。 这可以使用scikit-learn中的CountVectorizer轻松完成。

ctv = CountVectorizer(min_df=3,
max_df=0.5,
ngram_range=(1,2),
stop_words = stwlist)

使用Count Vectorizer来fit训练集和测试集(半监督学习)

ctv.fit(list(xtrain) + list(xvalid))
xtrain_ctv = ctv.transform(xtrain)
xvalid_ctv = ctv.transform(xvalid)
#利用提取的word counts特征来fit一个简单的Logistic Regression

clf = LogisticRegression(C=1.0,solver=‘lbfgs’,multi_class=‘multinomial’)
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

#print(classification_report(predictions, yvalid))
logloss: 0.732
貌似效果不佳,multiclass logloss达到了0.732!!!

接下来,让我们尝试一个非常简单的模型- 朴素贝叶斯,它在以前是非常有名的。

让我们看看当我们在这个数据集上使用朴素贝叶时会发生什么:

#利用提取的TFIDF特征来fitNaive Bayes
clf = MultinomialNB()
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

#print(classification_report(predictions, yvalid))
logloss: 0.841
朴素贝叶斯模型的表现也不咋地!让我们在基于词汇计数的基础上使用朴素贝叶斯模型,看会发生什么?

#利用提取的word counts特征来fitNaive Bayes
clf = MultinomialNB()
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

#print(classification_report(predictions, yvalid))
logloss: 3.780
3.780,这次效果差到爆! 传统文本分类算法里还有一个名叫支持向量机(SVM)。 SVM曾是很多机器学习爱好者的“最爱”。 因此,我们必须在此数据集上尝试SVM。

由于SVM需要花费大量时间,因此在应用SVM之前,我们将使用奇异值分解(Singular Value Decomposition )来减少TF-IDF中的特征数量。

同时,在使用SVM之前,我们还需要将数据标准化(Standardize Data )

#使用SVD进行降维,components设为120,对于SVM来说,SVD的components的合适调整区间一般为120~200
svd = decomposition.TruncatedSVD(n_components=120)
svd.fit(xtrain_tfv)
xtrain_svd = svd.transform(xtrain_tfv)
xvalid_svd = svd.transform(xvalid_tfv)

#对从SVD获得的数据进行缩放
scl = preprocessing.StandardScaler()
scl.fit(xtrain_svd)
xtrain_svd_scl = scl.transform(xtrain_svd)
xvalid_svd_scl = scl.transform(xvalid_svd)
现在是时候应用SVM模型进行文本分类了。 在运行以下单元格后,你可以去喝杯茶了—因为这将耗费大量的时间…

调用下SVM模型

clf = SVC(C=1.0, probability=True) # since we need probabilities
clf.fit(xtrain_svd_scl, ytrain)
predictions = clf.predict_proba(xvalid_svd_scl)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

#print(classification_report(predictions, yvalid))
logloss: 0.347
看起来,SVM在这些数据上表现还行。

在采用更高级的算法前,让我们再试试Kaggle上应用最流行的算法:xgboost!

基于tf-idf特征,使用xgboost

clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8,
subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_tfv.tocsc(), ytrain)
predictions = clf.predict_proba(xvalid_tfv.tocsc())

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

#print(classification_report(predictions, yvalid))
logloss: 0.182
效果不错,比SVM还牛呢!

基于word counts特征,使用xgboost

clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8,
subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_ctv.tocsc(), ytrain)
predictions = clf.predict_proba(xvalid_ctv.tocsc())

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

#print(classification_report(predictions, yvalid))
logloss: 0.154

基于tf-idf的svd特征,使用xgboost

clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8,
subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_svd, ytrain)
predictions = clf.predict_proba(xvalid_svd)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

#print(classification_report(predictions, yvalid))
logloss: 0.394

再对经过数据标准化(Scaling)的tf-idf-svd特征使用xgboost

clf = xgb.XGBClassifier(nthread=10)
clf.fit(xtrain_svd_scl, ytrain)
predictions = clf.predict_proba(xvalid_svd_scl)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

#print(classification_report(predictions, yvalid))
logloss: 0.373
XGBoost的效果似乎挺棒的! 但我觉得还可以进一步优化,因为我还没有做过任何超参数优化。 我很懒,所以我会告诉你该怎么做,你可以自己做!)。 这将在下一节中讨论:

网格搜索(Grid Search)
网格搜索是一种超参数优化的技巧。 如果知道这个技巧,你可以通过获取最优的参数组合来产生良好的文本分类效果。

在本节中,我将讨论使用基于逻辑回归模型的网格搜索。

在开始网格搜索之前,我们需要创建一个评分函数,这可以通过scikit-learn的make_scorer函数完成的。

mll_scorer = metrics.make_scorer(multiclass_logloss, greater_is_better=False, needs_proba=True)
接下来,我们需要一个pipeline。 为了演示,我将使用由SVD(进行特征缩放)和逻辑回归模型组成的pipeline。

#SVD初始化
svd = TruncatedSVD()

Standard Scaler初始化

scl = preprocessing.StandardScaler()

再一次使用Logistic Regression

lr_model = LogisticRegression()

创建pipeline

clf = pipeline.Pipeline([(‘svd’, svd),
(‘scl’, scl),
(‘lr’, lr_model)])
接下来我们需要一个参数网格(A Grid of Parameters):

param_grid = {‘svd__n_components’ : [120, 180],
‘lr__C’: [0.1, 1.0, 10],
‘lr__penalty’: [‘l1’, ‘l2’]}
因此,对于SVD,我们评估120和180个分量(Components),对于逻辑回归,我们评估三个不同的学习率C值,其中惩罚函数为l1和l2。 现在,我们可以开始对这些参数进行网格搜索咯。

网格搜索模型(Grid Search Model)初始化

model = GridSearchCV(estimator=clf, param_grid=param_grid, scoring=mll_scorer,
verbose=10, n_jobs=-1, iid=True, refit=True, cv=2)

#fit网格搜索模型
model.fit(xtrain_tfv, ytrain) #为了减少计算量,这里我们仅使用xtrain
print(“Best score: %0.3f” % model.best_score_)
print(“Best parameters set:”)
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
print(“\t%s: %r” % (param_name, best_parameters[param_name]))
Fitting 2 folds for each of 12 candidates, totalling 24 fits

OSError Traceback (most recent call last)
in ()
4
5 #fit网格搜索模型
----> 6 model.fit(xtrain_tfv, ytrain) #为了减少计算量,这里我们仅使用xtrain
7 print(“Best score: %0.3f” % model.best_score_)
8 print(“Best parameters set:”)

/opt/conda/lib/python3.5/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
637 error_score=self.error_score)
638 for parameters, (train, test) in product(candidate_params,
–> 639 cv.split(X, y, groups)))
640
641 # if one choose to see train score, “out” will contain train score info

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in call(self, iterable)
787 # consumption.
788 self._iterating = False
–> 789 self.retrieve()
790 # Make sure that we get a last message telling us we are done
791 elapsed_time = time.time() - self._start_time

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
697 try:
698 if getattr(self._backend, ‘supports_timeout’, False):
–> 699 self._output.extend(job.get(timeout=self.timeout))
700 else:
701 self._output.extend(job.get())

/opt/conda/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
642 return self._value
643 else:
–> 644 raise self._value
645
646 def _set(self, i, obj):

/opt/conda/lib/python3.5/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
422 break
423 try:
–> 424 put(task)
425 except Exception as e:
426 job, idx = task[:2]

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py in send(obj)
369 def send(obj):
370 buffer = BytesIO()
–> 371 CustomizablePickler(buffer, self._reducers).dump(obj)
372 self._writer.send_bytes(buffer.getvalue())
373 self._send = send

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py in call(self, a)
238 print(“Memmaping (shape=%r, dtype=%s) to new file %s” % (
239 a.shape, a.dtype, filename))
–> 240 for dumped_filename in dump(a, filename):
241 os.chmod(dumped_filename, FILE_PERMISSIONS)
242

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py in dump(value, filename, compress, protocol, cache_size)
482 elif is_filename:
483 with open(filename, ‘wb’) as f:
–> 484 NumpyPickler(f, protocol=protocol).dump(value)
485 else:
486 NumpyPickler(filename, protocol=protocol).dump(value)

/opt/conda/lib/python3.5/pickle.py in dump(self, obj)
406 if self.proto >= 4:
407 self.framer.start_framing()
–> 408 self.save(obj)
409 self.write(STOP)
410 self.framer.end_framing()

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py in save(self, obj)
276
277 # And then array bytes are written right after the wrapper.
–> 278 wrapper.write_array(obj, self)
279 return
280

/opt/conda/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py in write_array(self, array, pickler)
91 buffersize=buffersize,
92 order=self.order):
—> 93 pickler.file_handle.write(chunk.tostring(‘C’))
94
95 def read_array(self, unpickler):

OSError: [Errno 28] No space left on device
最终得分跟我们之前的SVM的结果相近。 这种技术可用于对xgboost甚至多项式朴素贝叶斯进行超参数调优。 我们将在这里使用tfidf数据,如下所示:

nb_model = MultinomialNB()

创建pipeline

clf = pipeline.Pipeline([(‘nb’, nb_model)])

搜索参数设置

param_grid = {‘nb__alpha’: [0.001, 0.01, 0.1, 1, 10, 100]}

网格搜索模型(Grid Search Model)初始化

model = GridSearchCV(estimator=clf, param_grid=param_grid, scoring=mll_scorer,
verbose=10, n_jobs=-1, iid=True, refit=True, cv=2)

fit网格搜索模型

model.fit(xtrain_tfv, ytrain) # 为了减少计算量,这里我们仅使用xtrain
print(“Best score: %0.3f” % model.best_score_)
print(“Best parameters set:”)
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
print(“\t%s: %r” % (param_name, best_parameters[param_name]))
相比于之前的朴素贝叶斯,本次得分提高了8%!

自从2013年谷歌的Tomas Mikolov团队发明了word2vec以后,word2vec就成为了处理NLP问题的标配。word2vec训练向量空间模型的速度比以往的方法都快。许多新兴的词嵌入基于人工神经网络,而不是过去的n元语法模型和非监督式学习。

接下来,让我们来深入研究一下如何使用word2vec来进行NLP文本分类。

基于word2vec的词嵌入
在不深入细节的情况下,笔者将解释如何创建语句向量(Sentence Vectors),以及如何基于它们在其上创建机器学习模型。鄙人是GloVe向量,word2vec和fasttext的粉丝(但平时还是用word2vec较多)。在这篇文章中,笔者使用的文本分类模型是基于Word2vec词向量模型(100维)。

训练word2vec词向量:

X=data[‘文本分词’]
X=[i.split() for i in X]
X[:2]
[[‘\ufeff’,
‘【’,
‘文献号’,
‘】’,
‘1’,
‘-’,
‘2432’,
‘【’,
‘原文’,
‘出处’,
‘】’,
‘出版发行’,
‘研究’,
‘【’,
‘原刊’,
‘地名’,
‘】’,
‘京’,
‘【’,
‘原刊’,
‘期号’,
‘】’,
‘199601’,
‘【’,
‘原刊’,
‘页’,
‘号’,
‘】’,
‘40’,
‘-’,
‘41’,
‘【’,
‘分’,
‘类’,
‘号’,
‘】’,
‘Z1’,
‘【’,
‘分’,
‘类’,
‘名’,
‘】’,
‘出版’,
‘工作’,
‘、’,
‘图书’,
‘评介’,
‘【’,
‘作者’,
‘】’,
‘王益’,
‘【’,
‘复印’,
‘期号’,
‘】’,
‘199604’,
‘【’,
‘标题’,
‘】’,
‘美国’,
‘出版社’,
‘怎样’,
‘经营’,
‘管理’,
‘?’,
‘—’,
‘—’,
‘介绍’,
‘《’,
‘图书’,
‘出版’,
‘的’,
‘艺术’,
‘和’,
‘科学’,
‘》’,
‘【’,
‘正文’,
‘】’,
‘美国’,
‘出版’,
‘的’,
‘有关’,
‘图书’,
‘出版’,
‘的’,
‘专业书籍’,
‘,’,
‘概论’,
‘性’,
‘的’,
‘有’,
‘好几本’,
‘,’,
‘专讲’,
‘经营’,
‘管理’,
‘的’,
‘并不多’,
‘,’,
‘《’,
‘图书’,
‘出版’,
‘的’,
‘艺术’,
‘和’,
‘科学’,
‘》’,
‘(’,
‘The’,
‘Art’,
‘andScienceo’,
‘f’,
‘Book’,
‘Pblishing’,
‘)’,
‘是’,
‘经常’,
‘被’,
‘人’,
‘推荐’,
‘的’,
‘一本’,
‘。’,
‘作者’,
‘小’,
‘赫伯特’,
‘·’,
‘S’,
‘·’,
‘贝利’,
‘(’,
‘Herbert’,
‘·’,
‘S’,
‘·’,
‘Bailey’,
‘,’,
‘Jr’,
‘.’,
‘)’,
‘,’,
‘在’,
‘大学’,
‘文学系’,
‘毕业’,
‘之后’,
‘,’,
‘1946’,
‘年’,
‘进入’,
‘普林’,
‘斯顿’,
‘大学’,
‘出版社’,
‘当’,
‘了’,
‘8’,
‘年’,
‘编辑’,
‘,’,
‘1954’,
‘年’,
‘出任’,
‘该’,
‘出版社’,
‘社长’,
‘,’,
‘直至’,
‘1986’,
‘年’,
‘退休’,
‘,’,
‘从事’,
‘出版’,
‘工作’,
‘共’,
‘40’,
‘年’,
‘。’,
‘1970’,
‘年’,
‘,’,
‘他’,
‘写’,
‘了’,
‘这’,
‘本书’,
‘,’,
‘1980’,
‘年’,
‘再版’,
‘,’,
‘199’,
‘0’,
‘年三版’,
‘。’,
‘这’,
‘本书’,
‘,’,
‘广泛’,
‘地被’,
‘大学’,
‘采用’,
‘作’,
‘出版’,
‘课程’,
‘教材’,
‘,’,
‘也’,
‘被’,
‘出版社’,
‘工作人员’,
‘选作’,
‘参考’,
‘读物’,
‘。’,
‘从’,
‘1970’,
‘年’,
‘至’,
‘1990’,
‘年’,
‘,’,
‘时隔’,
‘20’,
‘年’,
‘,’,
‘他’,
‘认为’,
‘他’,
‘所’,
‘阐述’,
‘的’,
‘基本’,
‘原则’,
‘仍’,
‘是’,
‘正确’,
‘的’,
‘,’,
‘因而’,
‘出版’,
‘时’,
‘基本上’,
‘没有’,
‘修改’,
‘。’,
‘这’,
‘本书’,
‘是’,
‘作者’,
‘长期实践’,
‘经验’,
‘的’,
‘总结’,
‘,’,
‘他’,
‘也’,
‘从’,
‘管理科学’,
‘和’,
‘财会’,
‘科学’,
‘书籍’,
‘中’,
‘吸取’,
‘了’,
‘营养’,
‘,’,
‘并’,
‘听取’,
‘了’,
‘其他’,
‘出版家’,
‘的’,
‘意见’,
‘,’,
‘它’,
‘是’,
‘写给’,
‘出版社’,
‘的’,
‘社长’,
‘们’,
‘读’,
‘的’,
‘,’,
‘也’,
‘是’,
‘写给’,
‘出版社’,
‘所有’,
‘的’,
‘工作人员’,
‘读’,
‘的’,
‘。’,
‘有关’,
‘出版’,
‘工作’,
‘的’,
‘方方面面’,
‘,’,
‘它’,
‘都’,
‘涉及’,
‘到’,
‘了’,
‘。’,
‘他’,
‘对’,
‘各’,
‘项’,
‘工作’,
‘的’,
‘甜酸苦辣’,
‘好像’,
‘都’,
‘有’,
‘切身’,
‘的’,
‘体会’,
‘,’,
‘他’,
‘了解’,
‘其中’,
‘的’,
‘主要矛盾’,
‘和’,
‘麻烦’,
‘,’,
‘并’,
‘对’,
‘如何’,
‘解决’,
‘这些’,
‘问题’,
‘提出’,
‘了’,
‘很’,
‘好’,
‘的’,
‘建议’,
‘。’,
‘他’,
‘看’,
‘问题’,
‘全面’,
‘、’,
‘客观’,
‘,’,
‘立论’,
‘公正’,
‘,’,
‘处处’,
‘迸发出’,
‘智慧’,
‘的’,
‘火花’,
‘。’,
‘凡读’,
‘过’,
‘这’,
‘本书’,
‘的’,
‘人’,
‘,’,
‘都’,
‘会’,
‘感到’,
‘得益’,
‘。’,
‘美国’,
‘《’,
‘出版商’,
‘周刊’,
‘》’,
‘曾’,
‘在’,
‘书评’,
‘中’,
‘把’,
‘它’,
‘誉为’,
‘“’,
‘出版业’,
‘经营’,
‘管理’,
‘方面’,
‘不可’,
‘缺少’,
‘的’,
‘有’,
‘说服力’,
‘的’,
‘研究’,
‘著作’,
‘”’,
‘。’,
‘本书’,
‘从’,
‘论述’,
‘出版’,
‘工作’,
‘中’,
‘的’,
‘理性’,
‘和’,
‘非理性’,
‘开始’,
‘,’,
‘作者’,
‘认为’,
‘,’,
‘出版社’,
‘的’,
‘经营管理者’,
‘是’,
‘一个’,
‘有’,
‘理性’,
‘的’,
‘人’,
‘,’,
‘在’,
‘理性’,
‘的’,
‘环境’,
‘中’,
‘与’,
‘有’,
‘理性’,
‘的’,
‘人们’,
‘一道’,
‘工作’,
‘,’,
‘追求’,
‘可能’,
‘是’,
‘复杂’,
‘的’,
‘但’,
‘至少’,
‘可以’,
‘明确’,
‘表示’,
‘的’,
‘目标’,
‘,’,
‘而’,
‘整个’,
‘出版’,
‘活动’,
‘,’,
‘又’,
‘沉浸’,
‘在’,
‘非理性’,
‘的’,
‘大海’,
‘中’,
‘。’,
‘经营’,
‘管理’,
‘必须’,
‘把’,
‘非理性’,
‘因素’,
‘也’,
‘考虑’,
‘进去’,
‘,’,
‘而’,
‘不能’,
‘企图’,
‘将’,
‘其’,
‘纳入’,
‘既定’,
‘秩序’,
‘的’,
‘框架’,
‘之内’,
‘,’,
‘对’,
‘非理性’,
‘因素’,
‘,’,
‘要’,
‘在’,
‘内部’,
‘和’,
‘外部’,
‘和’,
‘它’,
‘生活’,
‘、’,
‘工作’,
‘在’,
‘一起’,
‘,’,
‘参与’,
‘进去’,
‘,’,
‘并且’,
‘试图’,
‘理解’,
‘它’,
‘,’,
‘甚至’,
‘促进’,
‘它’,
‘、’,
‘鼓励’,
‘它’,
‘、’,
‘批评’,
‘它’,
‘,’,
‘把’,
‘它’,
‘和’,
‘出版社’,
‘的’,
‘健康’,
‘现象’,
‘联系’,
‘起来’,
‘。’,
‘作者’,
‘认为’,
‘,’,
‘非理性’,
‘也’,
‘是’,
‘一种’,
‘重要’,
‘的’,
‘动力’,
‘,’,
‘许多’,
‘培育’,
‘出版业’,
‘成长’,
‘壮大’,
‘的’,
‘创造力’,
‘来自’,
‘非理性’,
‘,’,
‘来自’,
‘作’,
‘者’,
‘的’,
‘下意识’,
‘思想’,
‘的’,
‘活动’,
‘和’,
‘要求’,
‘;’,
‘我们’,
‘称之为’,
‘市场’,
‘的’,
‘读者群’,
‘同样’,
‘也’,
‘被’,
‘他们’,
‘特有’,
‘的’,
‘下意识’,
‘的’,
‘思想’,
‘和’,
‘要求’,
‘所’,
‘驱动’,
‘。’,
‘这些’,
‘观点’,
‘,’,
‘对’,
‘我们’,
‘来说’,
‘,’,
‘非常’,
‘陌生’,
‘,’,
‘但’,
‘可以’,
‘姑妄听之’,
‘,’,
‘以便’,
‘开阔’,
‘我们’,
‘的’,
‘思路’,
‘。’,
‘作者’,
‘在’,
‘随后’,
‘部分’,
‘中’,
‘论及’,
‘的’,
‘出版’,
‘工作’,
‘中’,
‘的’,
‘一些’,
‘原则’,
‘问题’,
‘,’,
‘例如’,
‘出版’,
‘的’,
‘文化’,
‘功能’,
‘和’,
‘企业’,
‘经营’,
‘的’,
‘关系’,
‘,’,
‘出版业’,
‘与’,
‘其他’,
‘行业’,
‘的’,
‘异同’,
‘,’,
‘盈利性’,
‘出版社’,
‘与’,
‘非盈利’,
‘出版社’,
‘的’,
‘区别’,
‘,’,
‘社会效益’,
‘与’,
‘经济效益’,
‘的’,
‘关系’,
‘,’,
‘质量’,
‘与’,
‘数量’,
‘的’,
‘关系’,
‘,’,
‘出版社’,
‘与’,
‘作者’,
‘的’,
‘关系’,
‘,’,
‘出版社’,
‘中领’,
‘导与’,
‘被’,
‘领导’,
‘的’,
‘关系’,
‘,’,
‘出版社’,
‘中’,
‘部门’,
‘与’,
‘部门’,
‘之间’,
‘的’,
‘沟通’,
‘与’,
‘协作’,
‘等’,
‘,’,
‘其’,
‘观点’,
‘倒’,
‘是’,
‘我们’,
‘很’,
‘熟悉’,
‘而且’,
‘容易’,
‘理解’,
‘的’,
‘;’,
‘甚至’,
‘有些’,
‘观点’,
‘与’,
‘我们’,
‘的’,
‘看法’,
‘是’,
‘相似’,
‘的’,
‘。’,
‘出版社’,
‘的’,
‘经营’,
‘管理’,
‘相当’,
‘复杂’,
‘,’,
‘既有’,
‘生产’,
‘,’,
‘又’,
‘有’,
‘销售’,
‘,’,
‘既有’,
‘精神’,
‘生产’,
‘,’,
‘又’,
‘有’,
‘物质’,
‘生产’,
‘,’,
‘而且’,
‘精神’,
‘生产’,
‘还是’,
‘主要’,
‘的’,
‘。’,
‘要’,
‘熟悉’,
‘出版社’,
‘的’,
‘全盘’,
‘业务’,
‘很’,
‘不’,
‘容易’,
‘,’,
‘但’,
‘作者’,
‘做到’,
‘了’,
‘这’,
‘一点’,
‘。’,
‘他’,
‘出身’,
‘于’,
‘编辑’,
‘,’,
‘认为’,
‘编辑’,
‘工作’,
‘也’,
‘在’,
‘经营’,
‘管理’,
‘的’,
‘范围’,
‘之内’,
‘,’,
‘但’,
‘在’,
‘本书’,
‘中’,
‘,’,
‘他’,
‘对’,
‘编辑’,
‘工作’,
‘却’,
‘着墨’,
‘不’,
‘多’,
‘。’,
‘他’,
‘强调’,
‘编辑’,
‘工’,
‘作’,
‘的’,
‘重要性’,
‘,’,
‘认为’,
‘出版’,
‘史是’,
‘出版’,
‘了’,
‘杰出’,
‘的’,
‘书’,
‘的’,
‘杰出’,
‘出版社’,
‘的’,
‘历史’,
‘;’,
‘出版社’,
‘所以’,
‘出名’,
‘,’,
‘是因为’,
‘出版’,
‘了’,
‘杰出’,
‘的’,
‘著作’,
‘,’,
‘而’,
‘不是’,
‘由于’,
‘经营’,
‘技巧’,
‘的’,
‘高明’,
‘,’,
‘当’,
‘然他’,
‘并’,
‘不’,
‘忽视’,
‘经营’,
‘管理’,
‘的’,
‘重要性’,
‘,’,
‘本书’,
‘毕竟’,
‘绝大部分’,
‘篇幅’,
‘是’,
‘谈’,
‘生产’,
‘、’,
‘销售’,
‘、’,
‘人事’,
‘、’,
‘财务’,
‘等’,
‘方面’,
‘的’,
‘事情’,
‘。’,
‘作者’,
‘精通’,
‘出版社’,
‘的’,
‘所有’,
‘各项’,
‘业务’,
‘,’,
‘并’,
‘不是’,
‘洞察’,
‘出版社’,
‘所有’,
‘各项’,
‘业务’,
‘的’,
‘细节’,
‘,’,
‘而是’,
‘对’,
‘各项’,
‘业务’,
‘都’,
‘有’,
‘深刻’,
‘的’,
‘了解’,
‘,’,
‘能够’,
‘进行’,
‘十分’,
‘精辟’,
‘的’,
‘分析’,
‘。’,
‘他’,
‘注意’,
‘出书’,
‘的’,
‘系统’,
‘,’,
‘认为’,
‘出版社’,
‘的’,
‘全部’,
‘出版物’,
‘应该’,
‘实际上’,
‘是’,
‘一套’,
‘或’,
‘若干套’,
‘丛书’,
‘。’,
‘出书’,
‘要’,
‘十分注意’,
‘质量’,
‘。’,
‘质量第一’,
‘,’,
‘但’,
‘并非’,
‘质量’,
‘是’,
‘唯一’,
‘。’,
‘有些’,
‘书’,
‘即使’,
‘是’,
‘能’,
‘赚钱’,
‘的’,
‘好书’,
‘,’,
‘但’,
‘如果’,
‘不’,
‘符合’,
‘出版社’,
‘出书’,
‘的’,
‘兴趣’,
‘、’,
‘品位’,
‘和’,
‘专业’,
‘范围’,
‘,’,
‘也’,
‘不’,
‘应该’,
‘出版’,
‘,’,
‘以免’,
‘有损于’,
‘出版社’,
‘的’,
‘形象’,
‘。’,
‘编辑’,
‘要’,
‘时刻’,
‘想到’,
‘是’,
‘为’,
‘作者’,
‘出书’,
‘而’,
‘不是’,
‘为’,
‘自己’,
‘出书’,
‘。’,
‘编辑’,
‘在’,
‘作者’,
‘面前’,
‘代表’,
‘出版社’,
‘但’,
‘要’,
‘注意’,
‘到’,
‘自己’,
‘不是’,
‘出版社’,
‘的’,
‘老板’,
‘。’,
‘美术设计’,
‘必须’,
‘在’,
‘经济实用’,
‘的’,
‘约束’,
‘下去’,
‘发掘’,
‘美的’,
‘特性’,
‘,’,
‘一本’,
‘精’,
‘美’,
‘图书’,
‘的’,
‘特性’,
‘就’,
‘在于’,
…],
[‘\ufeff’,
‘【’,
‘文献号’,
‘】’,
‘1’,
‘-’,
‘2435’,
‘【’,
‘原文’,
‘出处’,
‘】’,
‘扬州’,
‘师院’,
‘学报’,
‘:’,
‘社科’,
‘版’,
‘【’,
‘原刊’,
‘期号’,
‘】’,
‘199504’,
‘【’,
‘原刊’,
‘页’,
‘号’,
‘】’,
‘61’,
‘-’,
‘63’,
‘【’,
‘分’,
‘类’,
‘号’,
‘】’,
‘Z1’,
‘【’,
‘分’,
‘类’,
‘名’,
‘】’,
‘出版’,
‘工作’,
‘、’,
‘图书’,
‘评介’,
‘【’,
‘作者’,
‘】’,
‘王菊延’,
‘【’,
‘复印’,
‘期号’,
‘】’,
‘199604’,
‘【’,
‘标题’,
‘】’,
‘评’,
‘《’,
‘朱自清’,
‘散文’,
‘艺术论’,
‘》’,
‘【’,
‘正文’,
‘】’,
‘一’,
‘朱自清’,
‘是’,
‘中国’,
‘现代’,
‘文学史’,
‘上’,
‘负有’,
‘盛名’,
‘的’,
‘散文家’,
‘。’,
‘近’,
‘70’,
‘年来’,
‘,’,
‘文学’,
‘评论界’,
‘对’,
‘其’,
‘生活道路’,
‘、’,
‘思想’,
‘发展’,
‘和’,
‘艺术创作’,
‘的’,
‘研究’,
‘,’,
‘无疑’,
‘是’,
‘引人瞩目’,
‘的’,
‘。’,
‘但’,
‘也’,
‘毋庸讳言’,
‘,’,
‘囿于’,
‘理论体系’,
‘的’,
‘封闭’,
‘,’,
‘批评’,
‘视角’,
‘的’,
‘逼仄’,
‘以及’,
‘批评’,
‘方法’,
‘的’,
‘陈旧’,
‘,’,
‘这种’,
‘研究’,
‘实际上’,
‘长期’,
‘未能’,
‘攀抵’,
‘其’,
‘应有’,
‘的’,
‘高度’,
‘,’,
‘显露出’,
‘后滞’,
‘于’,
‘当前’,
‘文学’,
‘观念’,
‘开放’,
‘、’,
‘变革’,
‘形势’,
‘的’,
‘窘状’,
‘。’,
‘为了’,
‘弥补’,
‘上述’,
‘缺憾’,
‘,’,
‘吴周文’,
‘、’,
‘张王’,
‘飞’,
‘、’,
‘林道立’,
‘通力合作’,
‘,’,
‘适时地’,
‘推出’,
‘了’,
‘《’,
‘朱自清’,
‘散文’,
‘艺术论’,
‘》’,
‘(’,
‘江苏教育出版社’,
‘1’,
‘994’,
‘年’,
‘7’,
‘月’,
‘出版’,
‘)’,
‘。’,
‘这部’,
‘专著’,
‘的’,
‘面世’,
‘,’,
‘使’,
‘“’,
‘山重水复’,
‘”’,
‘的’,
‘朱自清’,
‘研究’,
‘平添’,
‘活力’,
‘,’,
‘步入’,
‘“’,
‘柳暗花明’,
‘”’,
‘的’,
‘新’,
‘境界’,
‘。’,
‘《’,
‘朱自清’,
‘散文’,
‘艺术论’,
‘》’,
‘具备’,
‘坚实’,
‘而’,
‘合理’,
‘的’,
‘批评’,
‘构架’,
‘,’,
‘除去’,
‘序言’,
‘、’,
‘附录’,
‘和’,
‘后记’,
‘以外’,
‘,’,
‘共分’,
‘11’,
‘章’,
‘。’,
‘首章’,
‘“’,
‘朱自清’,
‘论’,
‘”’,
‘,’,
‘对’,
‘全书’,
‘作’,
‘了’,
‘总’,
‘的’,
‘提挈’,
‘,’,
‘使’,
‘读者’,
‘对’,
‘这位’,
‘散文家’,
‘的’,
‘生平’,
‘、’,
‘思想’,
‘、’,
‘创作’,
‘历程’,
‘等’,
‘,’,
‘先’,
‘获取’,
‘一个’,
‘完整’,
‘的’,
‘印象’,
‘。’,
‘第’,
‘2’,
‘章至’,
‘第’,
‘10’,
‘章’,
‘,’,
‘旁征博引’,
‘、’,
‘条分缕析’,
‘,’,
‘对’,
‘朱自清’,
‘散文’,
‘“’,
‘审美’,
‘创造’,
‘工程’,
‘”’,
‘进行’,
‘多’,
‘侧面’,
‘、’,
‘深层次’,
‘的’,
‘探讨’,
‘与’,
‘阐释’,
‘;’,
‘让’,
‘读者’,
‘在’,
‘充分’,
‘了解’,
‘创作’,
‘主体’,
‘美学’,
‘原则’,
‘、’,
‘审美’,
‘心理’,
‘、’,
‘艺术’,
‘思维’,
‘方式’,
‘以及’,
‘审美’,
‘理想’,
‘与’,
‘人格’,
‘理想’,
‘的’,
‘基础’,
‘上’,
‘,’,
‘去’,
‘进’,
‘一步’,
‘认识’,
‘朱自清’,
‘“’,
‘美文’,
‘”’,
‘结构’,
‘、’,
‘语言’,
‘、’,
‘技巧’,
‘、’,
‘风格’,
‘等’,
‘方面’,
‘的’,
‘个性特征’,
‘。’,
‘第’,
‘11’,
‘章’,
‘“’,
‘名作’,
‘新论’,
‘”’,
‘,’,
‘是’,
‘与’,
‘上述’,
‘宏观’,
‘概览’,
‘相辅相成’,
‘的’,
‘微观’,
‘剖析’,
‘,’,
‘使’,
‘朱自清’,
‘那些’,
‘脍炙人口’,
‘的’,
‘散文’,
‘名篇’,
‘更添’,
‘异彩’,
‘,’,
‘给’,
‘读者’,
‘以新’,
‘的’,
‘阅读’,
‘感受’,
‘。’,
‘本书’,
‘作者’,
‘努力’,
‘运用’,
‘哲学’,
‘、’,
‘美学’,
‘、’,
‘文艺’,
‘心理学’,
‘交叉’,
‘结合’,
‘的’,
‘“’,
‘切入’,
‘”’,
‘方法’,
‘,’,
‘将’,
‘朱自清’,
‘及’,
‘其’,
‘散文’,
‘的’,
‘研究’,
‘,’,
‘从’,
‘“’,
‘文本’,
‘”’,
‘的’,
‘层面’,
‘拓展’,
‘到’,
‘“’,
‘人学’,
‘”’,
‘的’,
‘层面’,
‘,’,
‘从’,
‘社会学’,
‘批评’,
‘的’,
‘模式’,
‘升华’,
‘到’,
‘审美’,
‘创造’,
‘的’,
‘模式’,
‘。’,
‘这样’,
‘,’,
‘《’,
‘朱自清’,
‘散文’,
‘艺术论’,
‘》’,
‘便’,
‘实现’,
‘了’,
‘作者’,
‘的’,
‘初衷’,
‘—’,
‘—’,
‘回到’,
‘现代’,
‘散文’,
‘“’,
‘意在’,
‘表现’,
‘自己’,
‘”’,
‘的’,
‘美学’,
‘原则’,
‘,’,
‘回到’,
‘文学’,
‘审美’,
‘创造’,
‘的’,
‘自身’,
‘,’,
‘回到’,
‘文学批评’,
‘的’,
‘哲学’,
‘与’,
‘美学’,
‘。’,
‘二’,
‘艺术’,
‘创造’,
‘是’,
‘一项’,
‘繁复’,
‘的’,
‘工程’,
‘,’,
‘对’,
‘其’,
‘作出’,
‘精当’,
‘的’,
‘研究’,
‘,’,
‘必须’,
‘将’,
‘探索’,
‘的’,
‘目光’,
‘首先’,
‘投向’,
‘创作’,
‘主体’,
‘,’,
‘准确’,
‘地’,
‘阐释’,
‘作家’,
‘丰富’,
‘、’,
‘独特’,
‘的’,
‘艺术’,
‘心灵’,
‘,’,
‘是’,
‘如何’,
‘与’,
‘外部’,
‘客观’,
‘世界’,
‘相遇’,
‘合’,
‘、’,
‘相’,
‘投契’,
‘的’,
‘。’,
‘《’,
‘朱自清’,
‘散文’,
‘艺术论’,
‘》’,
‘正是’,
‘在’,
‘这’,
‘一点’,
‘上’,
‘显示’,
‘了’,
‘对’,
‘既往’,
‘研究’,
‘水准’,
‘的’,
‘超越’,
‘。’,
‘在’,
‘中国’,
‘现代’,
‘文学史’,
‘上’,
‘,’,
‘朱自清’,
‘第一次’,
‘提出’,
‘了’,
‘现代’,
‘散文’,
‘的’,
‘一条’,
‘美学’,
‘原则’,
‘:’,
‘“’,
‘意在’,
‘表现’,
‘自己’,
‘。’,
‘”’,
‘当然’,
‘,’,
‘明晰’,
‘地’,
‘指出’,
‘这一’,
‘有目共睹’,
‘的’,
‘现象’,
‘,’,
‘并’,
‘不十’,
‘分’,
‘困难’,
‘,’,
‘但’,
‘进一步’,
‘深刻’,
‘地’,
‘剖析’,
‘其’,
‘产生’,
‘的’,
‘历史背景’,
‘与’,
‘现实’,
‘背景’,
‘,’,
‘并’,
‘结合’,
‘具体’,
‘作品’,
‘揭示’,
‘朱自清’,
‘早期’,
‘散文’,
‘创作’,
‘中’,
‘“’,
‘苦恼’,
‘意识’,
‘”’,
‘的’,
‘三种’,
‘形态’,
‘,’,
‘以’,
‘证明’,
‘这’,
‘条’,
‘看似’,
‘漫不经心’,
‘提出’,
‘的’,
‘美学’,
‘原则’,
‘,’,
‘实乃’,
‘“’,
‘朱自清’,
‘散文’,
‘创作’,
‘审美’,
‘经验’,
‘的’,
‘一个’,
‘总结’,
‘”’,
‘,’,
‘便’,
‘不是’,
‘一般’,
‘评论者’,
‘所’,
‘能’,
‘轻易’,
‘企及’,
‘的’,
‘了’,
‘。’,
‘值得’,
‘称道’,
‘的’,
‘是’,
‘,’,
‘本书’,
‘对’,
‘“’,
‘意在’,
‘表现’,
‘自己’,
‘”’,
‘的’,
‘美学’,
‘原则’,
‘所’,
‘具备’,
‘的’,
‘理论’,
‘上’,
‘、’,
‘创作’,
‘上’,
‘的’,
‘不容忽视’,
‘的’,
‘意义’,
‘,’,
‘作’,
‘了’,
‘如下’,
‘精辟’,
‘的’,
‘概括’,
‘:’,
‘“’,
‘建立’,
‘了’,
‘以’,
‘作者’,
‘个性’,
‘为’,
‘本位’,
‘的’,
‘现代’,
‘散文’,
‘观念’,
‘”’,
‘,’,
‘“’,
‘沟通’,
‘了’,
‘现代’,
‘散文’,
‘与’,
‘外国’,
‘随笔’,
‘体’,
‘散文’,
‘之间’,
‘内在’,
‘的’,
‘精神’,
‘联系’,
‘”’,
‘,’,
‘“’,
‘揭示’,
‘了’,
‘现代’,
‘散文’,
‘表现’,
‘作者’,
‘人格’,
‘色彩’,
‘与’,
‘深层’,
‘精神’,
‘世界’,
‘的’,
‘审美’,
‘价值’,
‘”’,
‘等等’,
‘。’,
‘窃以为’,
‘,’,
‘将’,
‘朱自清’,
‘散文’,
‘创作’,
‘的’,
‘美学’,
‘原则’,
‘置放’,
‘于’,
‘如此’,
‘突出’,
‘的’,
‘地位’,
‘,’,
‘且’,
‘又’,
‘给予’,
‘如许’,
‘客观’,
‘、’,
‘准确’,
‘、’,
‘全面’,
‘、’,
‘富有’,
‘启迪’,
‘性’,
‘的’,
‘评价’,
‘,’,
‘这’,
‘在’,
‘朱自清’,
‘研究’,
‘史上’,
‘,’,
‘似乎’,
‘是’,
‘开拓’,
‘性’,
‘的’,
‘。’,
‘凡’,
‘熟读’,
‘朱自清’,
‘艺术性’,
‘散文’,
‘的’,
‘人’,
‘,’,
‘莫不’,
‘为’,
‘作品’,
‘中’,
‘那些’,
‘新颖’,
‘、’,
‘独特’,
‘、’,
‘生动’,
‘、’,
‘怪异’,
‘的’,
‘艺术’,
‘感觉’,
‘所’,
‘折服’,
‘。’,
‘然而’,
‘,’,
‘通常’,
‘的’,
‘情况’,
‘是’,
‘,’,
‘有些’,
‘评论者’,
‘在’,
‘获得’,
‘了’,
‘审美’,
‘鉴赏’,
‘的’,
‘愉悦’,
‘之后’,
‘,’,
‘便’,
‘停止’,
‘“’,
‘追踪’,
‘”’,
‘,’,
‘消弥’,
‘了’,
‘再’,
‘向前’,
‘跨越’,
‘一步’,
‘,’,
‘作’,
‘深层次’,
‘理论’,
‘探讨’,
‘的’,
‘欲念’,
‘。’,
‘《’,
‘朱自清’,
‘散文’,
‘艺术论’,
‘》’,
‘则’,
‘不然’,
‘,’,
‘在’,
‘精细’,
‘地’,
‘梳理’,
‘出朱’,
‘自清’,
‘艺术’,
‘感觉’,
‘的’,
‘种种’,
‘表现’,
‘形态’,
‘以后’,
‘,’,
‘又’,
‘执着’,
‘地向’,
‘纵深’,
‘锲进’,
‘,’,
‘终至’,
‘成功’,
‘地’,
‘追寻’,
‘出’,
‘这种’,
‘艺术’,
‘感觉’,
‘的’,
‘真正’,
‘源头’,
‘。’,
‘作者’,
‘认为’,
‘:’,
‘五官’,
‘协调’,
‘并用’,
‘、’,
‘极富’,
‘穿透力’,
‘的’,
‘独到’,
‘观察’,
‘,’,
‘求异’,
‘思维’,
‘、’,
‘沉醉’,
‘致幻’,
‘的’,
‘创造性’,
‘想象’,
‘,’,
‘以及’,
‘由’,
‘个人’,
‘遥远’,
‘回忆’,
‘、’,
‘附丽’,
‘着’,
‘美好’,
‘人格’,
‘理想’,
‘的’,
‘自然’,
‘景物’,
‘和’,
‘精神’,
‘深处’,
‘的’,
‘痛苦’,
‘所’,
‘引发’,
‘的’,
‘审美’,
‘心理’,
‘感’,
‘应’,
‘,’,
‘是’,
‘朱自清’,
‘散文’,
‘中’,
‘艺术’,
‘感觉’,
‘生成’,
‘、’,
‘活跃’,
‘的’,
‘三个’,
‘原因’,
‘。’,
‘显然’,
‘,’,
‘这样’,
‘的’,
‘阐述’,
‘,’,
‘既’,
‘没有’,
‘彻底’,
‘偏离’,
‘习见’,
‘的’,
‘理论体系’,
‘所’,
‘规范’,
‘的’,
‘“’,
‘通用’,
‘”’,
‘途径’,
‘,’,
‘又’,
‘推陈出新’,
‘、’,
‘别开生面’,
‘地以’,
‘审美’,
‘心理学’,
‘的’,
‘普遍规律’,
‘为’,
‘依据’,
‘、’,
‘作’,
‘后盾’,
‘,’,
‘因此’,
‘,’,
‘它’,
‘之’,
‘成为’,
‘一种’,
‘与’,
‘创作’,
‘主体’,
‘的’,
‘实践’,
‘过程’,
‘相吻合’,
‘的’,
‘科学论断’,
‘,’,
‘便是’,
‘必然’,
‘的’,
‘了’,
‘。’,
‘在’,
‘进行’,
‘艺术’,
‘创造’,
‘的’,
‘过程’,
‘中’,
‘,’,
‘艺术’,
‘思维’,
‘虽隐变’,
‘遁形’,
‘,’,
‘却’,
‘分明’,
…]]

训练word2vec词向量:

import gensim

model = gensim.models.Word2Vec(X,min_count =5,window =8,size=100) # X是经分词后的文本构成的list,也就是tokens的列表的列表
embeddings_index = dict(zip(model.wv.index2word, model.wv.vectors))

print(‘Found %s word vectors.’ % len(embeddings_index))
Found 119775 word vectors.
/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:5: DeprecationWarning: Call to deprecated syn0 (Attribute will be removed in 4.0.0, use self.wv.vectors instead).
“”"
X是经分词后的文本构成的list,也就是tokens的列表的列表。

注意,Word2Vec还有3个值得关注的参数,iter是模型训练时迭代的次数,假如参与训练的文本量较少,就需要把这个参数调大一些;sg是模型训练算法的类别,1 代表 skip-gram,;0代表 CBOW;window控制窗口,它指当前词和预测词之间的最大距离,如果设得较小,那么模型学习到的是词汇间的功能性特征(词性相异),如果设置得较大,会学习到词汇之间的相似性特征(词性相同)的大小,假如语料够多,笔者一般会设置得大一些,8~10。

model[‘汽车’]
#该函数会将语句转化为一个标准化的向量(Normalized Vector)
#import nltk
#nltk.download(‘punkt’)

def sent2vec(s):
import jieba
jieba.enable_parallel() #并行分词开启
words = str(s).lower()
#words = word_tokenize(words)
words = jieba.lcut(words)
words = [w for w in words if not w in stwlist]
#words = [w for w in words if w.isalpha()]
M = []
for w in words:
try:
#M.append(embeddings_index[w])
M.append(model[w])
except:
continue
M = np.array(M)
v = M.sum(axis=0)
if type(v) != np.ndarray:
return np.zeros(300)
return v / np.sqrt((v ** 2).sum())

对训练集和验证集使用上述函数,进行文本向量化处理

xtrain_w2v = [sent2vec(x) for x in tqdm(xtrain)]
xvalid_w2v = [sent2vec(x) for x in tqdm(xvalid)]
0%| | 0/8324 [00:00<?, ?it/s]/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:17: DeprecationWarning: Call to deprecated __getitem__ (Method will be removed in 4.0.0, use self.wv.getitem() instead).

0%| | 1/8324 [00:00<2:09:34, 1.07it/s]
0%| | 2/8324 [00:02<2:44:04, 1.18s/it]
0%| | 3/8324 [00:03<2:22:56, 1.03s/it]
0%| | 4/8324 [00:04<2:37:37, 1.14s/it]
0%| | 5/8324 [00:06<3:07:50, 1.35s/it]
0%| | 6/8324 [00:09<4:20:49, 1.88s/it]
0%| | 7/8324 [00:11<4:10:20, 1.81s/it]
0%| | 8/8324 [00:15<5:48:11, 2.51s/it]
0%| | 9/8324 [00:17<5:42:13, 2.47s/it]
0%| | 10/8324 [00:19<5:15:02, 2.27s/it]
0%| | 11/8324 [00:20<4:30:08, 1.95s/it]
0%| | 12/8324 [00:21<3:52:30, 1.68s/it]
0%| | 13/8324 [00:22<3:12:46, 1.39s/it]
0%| | 14/8324 [00:26<4:41:25, 2.03s/it]
0%| | 15/8324 [00:26<3:32:36, 1.54s/it]
0%| | 16/8324 [00:27<2:47:44, 1.21s/it]
0%| | 17/8324 [00:28<3:14:09, 1.40s/it]
0%| | 18/8324 [00:29<2:58:13, 1.29s/it]
0%| | 19/8324 [00:30<2:17:34, 1.01it/s]
0%| | 20/8324 [00:34<4:19:28, 1.87s/it]
0%| | 21/8324 [00:35<3:47:49, 1.65s/it]
0%| | 22/8324 [00:37<3:59:01, 1.73s/it]
0%| | 23/8324 [00:39<4:32:04, 1.97s/it]
0%| | 24/8324 [00:42<5:07:37, 2.22s/it]
0%| | 25/8324 [00:44<4:44:37, 2.06s/it]
0%| | 26/8324 [00:47<5:23:35, 2.34s/it]
0%| | 27/8324 [00:48<4:40:51, 2.03s/it]
0%| | 28/8324 [00:48<3:31:07, 1.53s/it]
0%| | 29/8324 [00:50<3:25:29, 1.49s/it]
0%| | 30/8324 [00:53<4:32:17, 1.97s/it]
0%| | 31/8324 [00:56<5:02:36, 2.19s/it]
0%| | 32/8324 [00:58<4:57:28, 2.15s/it]
0%| | 33/8324 [00:59<4:38:57, 2.02s/it]
0%| | 34/8324 [01:00<4:01:08, 1.75s/it]
0%| | 35/8324 [01:01<3:14:44, 1.41s/it]
0%| | 36/8324 [01:03<3:25:54, 1.49s/it]
0%| | 37/8324 [01:04<3:19:34, 1.44s/it]
0%| | 38/8324 [01:04<2:33:22, 1.11s/it]
0%| | 39/8324 [01:07<3:38:28, 1.58s/it]
0%| | 40/8324 [01:08<3:28:11, 1.51s/it]
0%| | 41/8324 [01:09<2:56:40, 1.28s/it]
1%| | 42/8324 [01:11<3:40:43, 1.60s/it]
1%| | 43/8324 [01:12<2:55:38, 1.27s/it]
1%| | 44/8324 [01:15<3:55:57, 1.71s/it]
1%| | 45/8324 [01:16<3:57:42, 1.72s/it]
1%| | 46/8324 [01:18<3:43:01, 1.62s/it]
1%| | 47/8324 [01:19<3:39:32, 1.59s/it]
1%| | 48/8324 [01:24<5:37:29, 2.45s/it]
1%| | 49/8324 [01:25<4:58:31, 2.16s/it]
1%| | 50/8324 [01:28<5:26:51, 2.37s/it]
1%| | 51/8324 [01:29<4:33:49, 1.99s/it]
1%| | 52/8324 [01:30<3:40:53, 1.60s/it]
1%| | 53/8324 [01:35<5:56:30, 2.59s/it]
1%| | 54/8324 [01:36<4:56:23, 2.15s/it]
1%| | 55/8324 [01:40<6:24:10, 2.79s/it]
1%| | 56/8324 [01:43<6:36:42, 2.88s/it]
1%| | 57/8324 [01:45<5:37:46, 2.45s/it]
1%| | 58/8324 [01:47<5:40:20, 2.47s/it]
1%| | 59/8324 [01:49<5:12:11, 2.27s/it]
1%| | 60/8324 [01:51<4:53:31, 2.13s/it]
1%| | 61/8324 [01:51<3:47:01, 1.65s/it]
1%| | 62/8324 [01:53<3:29:19, 1.52s/it]
1%| | 63/8324 [01:55<4:02:15, 1.76s/it]
1%| | 64/8324 [01:57<4:03:45, 1.77s/it]
1%| | 65/8324 [01:58<3:48:37, 1.66s/it]
1%| | 66/8324 [01:59<3:11:53, 1.39s/it]
1%| | 67/8324 [02:01<3:52:49, 1.69s/it]
1%| | 68/8324 [02:02<3:29:00, 1.52s/it]
1%| | 69/8324 [02:04<3:16:53, 1.43s/it]
1%| | 70/8324 [02:07<4:31:35, 1.97s/it]
1%| | 71/8324 [02:09<4:31:50, 1.98s/it]
1%| | 72/8324 [02:09<3:29:28, 1.52s/it]
1%| | 73/8324 [02:10<2:49:00, 1.23s/it]
1%| | 74/8324 [02:11<2:41:49, 1.18s/it]
1%| | 75/8324 [02:11<2:13:26, 1.03it/s]
1%| | 76/8324 [02:13<2:30:22, 1.09s/it]
1%| | 77/8324 [02:17<4:22:50, 1.91s/it]
1%| | 78/8324 [02:19<4:20:07, 1.89s/it]
1%| | 79/8324 [02:19<3:36:32, 1.58s/it]
1%| | 80/8324 [02:20<3:17:08, 1.43s/it]
1%| | 81/8324 [02:22<3:31:54, 1.54s/it]
1%| | 82/8324 [02:23<2:46:05, 1.21s/it]
1%| | 83/8324 [02:25<3:31:50, 1.54s/it]
1%| | 84/8324 [02:27<3:59:59, 1.75s/it]
1%| | 85/8324 [02:28<3:03:33, 1.34s/it]
1%| | 86/8324 [02:31<4:28:44, 1.96s/it]
1%| | 87/8324 [02:33<4:09:25, 1.82s/it]
1%| | 88/8324 [02:34<3:56:17, 1.72s/it]
1%| | 89/8324 [02:37<5:07:15, 2.24s/it]
1%| | 90/8324 [02:39<4:20:12, 1.90s/it]
1%| | 91/8324 [02:40<4:06:46, 1.80s/it]
1%| | 92/8324 [02:42<4:09:03, 1.82s/it]
1%| | 93/8324 [02:43<3:28:32, 1.52s/it]
1%| | 94/8324 [02:43<2:49:58, 1.24s/it]
1%| | 95/8324 [02:45<3:20:07, 1.46s/it]
1%| | 96/8324 [02:48<3:59:03, 1.74s/it]
1%| | 97/8324 [02:51<4:46:38, 2.09s/it]
1%| | 98/8324 [02:52<4:04:23, 1.78s/it]
1%| | 99/8324 [02:53<4:00:55, 1.76s/it]
1%| | 100/8324 [02:55<4:05:15, 1.79s/it]
1%| | 101/8324 [02:57<4:13:13, 1.85s/it]
1%| | 102/8324 [02:59<3:53:09, 1.70s/it]
1%| | 103/8324 [03:01<4:17:50, 1.88s/it]
1%| | 104/8324 [03:04<5:06:08, 2.23s/it]
1%|▏ | 105/8324 [03:07<5:17:54, 2.32s/it]
1%|▏ | 106/8324 [03:09<5:13:29, 2.29s/it]
1%|▏ | 107/8324 [03:09<4:03:47, 1.78s/it]
1%|▏ | 108/8324 [03:12<4:19:24, 1.89s/it]
1%|▏ | 109/8324 [03:13<4:12:23, 1.84s/it]
1%|▏ | 110/8324 [03:15<4:05:46, 1.80s/it]
1%|▏ | 111/8324 [03:15<3:11:38, 1.40s/it]
1%|▏ | 112/8324 [03:18<3:50:43, 1.69s/it]
1%|▏ | 113/8324 [03:19<3:45:38, 1.65s/it]
1%|▏ | 114/8324 [03:22<4:12:42, 1.85s/it]
1%|▏ | 115/8324 [03:22<3:18:36, 1.45s/it]
1%|▏ | 116/8324 [03:24<3:33:39, 1.56s/it]
1%|▏ | 117/8324 [03:26<4:07:31, 1.81s/it]
1%|▏ | 118/8324 [03:27<3:24:55, 1.50s/it]
1%|▏ | 119/8324 [03:30<4:08:37, 1.82s/it]
1%|▏ | 120/8324 [03:30<3:16:04, 1.43s/it]
1%|▏ | 121/8324 [03:33<3:52:00, 1.70s/it]
1%|▏ | 122/8324 [03:35<4:09:54, 1.83s/it]
1%|▏ | 123/8324 [03:37<4:18:24, 1.89s/it]
1%|▏ | 124/8324 [03:41<6:09:49, 2.71s/it]
2%|▏ | 125/8324 [03:43<5:24:37, 2.38s/it]
2%|▏ | 126/8324 [03:43<4:05:17, 1.80s/it]
2%|▏ | 127/8324 [03:45<4:16:18, 1.88s/it]
2%|▏ | 128/8324 [03:47<4:16:53, 1.88s/it]
2%|▏ | 129/8324 [03:49<4:11:04, 1.84s/it]
2%|▏ | 130/8324 [03:50<3:40:11, 1.61s/it]
2%|▏ | 131/8324 [03:51<2:56:56, 1.30s/it]
2%|▏ | 132/8324 [03:52<3:09:18, 1.39s/it]
2%|▏ | 133/8324 [03:53<2:49:30, 1.24s/it]
2%|▏ | 134/8324 [03:56<3:49:51, 1.68s/it]
2%|▏ | 135/8324 [03:58<4:17:23, 1.89s/it]
2%|▏ | 136/8324 [03:59<3:31:41, 1.55s/it]
2%|▏ | 137/8324 [04:00<3:27:41, 1.52s/it]
2%|▏ | 138/8324 [04:04<4:36:21, 2.03s/it]
2%|▏ | 139/8324 [04:06<4:37:43, 2.04s/it]
2%|▏ | 140/8324 [04:09<5:13:16, 2.30s/it]
2%|▏ | 141/8324 [04:13<6:26:02, 2.83s/it]
2%|▏ | 142/8324 [04:14<5:26:51, 2.40s/it]
2%|▏ | 143/8324 [04:16<5:11:06, 2.28s/it]
2%|▏ | 144/8324 [04:17<4:30:16, 1.98s/it]
2%|▏ | 145/8324 [04:19<4:01:55, 1.77s/it]
2%|▏ | 146/8324 [04:20<3:47:04, 1.67s/it]
2%|▏ | 147/8324 [04:22<3:48:08, 1.67s/it]
2%|▏ | 148/8324 [04:23<3:41:55, 1.63s/it]
2%|▏ | 149/8324 [04:25<3:40:27, 1.62s/it]
2%|▏ | 150/8324 [04:27<3:39:42, 1.61s/it]
2%|▏ | 151/8324 [04:30<4:47:55, 2.11s/it]
2%|▏ | 152/8324 [04:30<3:45:43, 1.66s/it]
2%|▏ | 153/8324 [04:32<3:35:30, 1.58s/it]
2%|▏ | 154/8324 [04:33<3:16:27, 1.44s/it]
2%|▏ | 155/8324 [04:36<4:28:58, 1.98s/it]
2%|▏ | 156/8324 [04:39<5:11:38, 2.29s/it]
2%|▏ | 157/8324 [04:40<4:27:47, 1.97s/it]
2%|▏ | 158/8324 [04:43<4:47:22, 2.11s/it]
2%|▏ | 159/8324 [04:45<5:05:57, 2.25s/it]
2%|▏ | 160/8324 [04:48<5:20:29, 2.36s/it]
2%|▏ | 161/8324 [04:50<4:49:23, 2.13s/it]
2%|▏ | 162/8324 [04:53<5:49:48, 2.57s/it]
2%|▏ | 163/8324 [04:55<5:28:10, 2.41s/it]
2%|▏ | 164/8324 [04:57<4:52:33, 2.15s/it]
2%|▏ | 165/8324 [04:59<4:58:06, 2.19s/it]
2%|▏ | 166/8324 [05:03<5:52:21, 2.59s/it]
2%|▏ | 167/8324 [05:04<4:59:04, 2.20s/it]
2%|▏ | 168/8324 [05:08<6:25:24, 2.84s/it]
2%|▏ | 169/8324 [05:09<5:20:22, 2.36s/it]
2%|▏ | 170/8324 [05:12<5:17:19, 2.34s/it]
2%|▏ | 171/8324 [05:14<5:24:19, 2.39s/it]
2%|▏ | 172/8324 [05:17<5:19:56, 2.35s/it]
2%|▏ | 173/8324 [05:19<5:38:47, 2.49s/it]
2%|▏ | 174/8324 [05:21<4:55:55, 2.18s/it]
2%|▏ | 175/8324 [05:23<5:13:18, 2.31s/it]
2%|▏ | 176/8324 [05:25<4:40:53, 2.07s/it]
2%|▏ | 177/8324 [05:29<6:08:34, 2.71s/it]
2%|▏ | 178/8324 [05:31<5:16:20, 2.33s/it]
2%|▏ | 179/8324 [05:32<4:45:57, 2.11s/it]
2%|▏ | 180/8324 [05:33<4:02:07, 1.78s/it]
2%|▏ | 181/8324 [05:34<3:10:12, 1.40s/it]
2%|▏ | 182/8324 [05:35<3:19:20, 1.47s/it]
2%|▏ | 183/8324 [05:37<3:45:23, 1.66s/it]
2%|▏ | 184/8324 [05:41<5:02:21, 2.23s/it]
2%|▏ | 185/8324 [05:44<5:35:38, 2.47s/it]
2%|▏ | 186/8324 [05:47<5:46:53, 2.56s/it]
2%|▏ | 187/8324 [05:48<5:07:54, 2.27s/it]
2%|▏ | 188/8324 [05:51<5:12:22, 2.30s/it]
2%|▏ | 189/8324 [05:55<6:24:45, 2.84s/it]
2%|▏ | 190/8324 [05:59<7:17:08, 3.22s/it]
2%|▏ | 191/8324 [06:02<7:00:03, 3.10s/it]
2%|▏ | 192/8324 [06:02<5:18:02, 2.35s/it]
2%|▏ | 193/8324 [06:03<4:05:24, 1.81s/it]
2%|▏ | 194/8324 [06:05<4:18:46, 1.91s/it]
2%|▏ | 195/8324 [06:07<4:23:56, 1.95s/it]
2%|▏ | 196/8324 [06:09<4:38:46, 2.06s/it]
2%|▏ | 197/8324 [06:13<5:37:54, 2.49s/it]
2%|▏ | 198/8324 [06:15<5:24:04, 2.39s/it]
2%|▏ | 199/8324 [06:17<4:45:10, 2.11s/it]
2%|▏ | 200/8324 [06:18<4:39:41, 2.07s/it]
2%|▏ | 201/8324 [06:24<6:44:10, 2.99s/it]
2%|▏ | 202/8324 [06:26<6:06:19, 2.71s/it]
2%|▏ | 203/8324 [06:26<4:43:45, 2.10s/it]
2%|▏ | 204/8324 [06:28<4:08:45, 1.84s/it]
2%|▏ | 205/8324 [06:29<3:53:06, 1.72s/it]
2%|▏ | 206/8324 [06:36<7:27:43, 3.31s/it]
2%|▏ | 207/8324 [06:38<6:25:36, 2.85s/it]
2%|▏ | 208/8324 [06:41<6:44:46, 2.99s/it]
3%|▎ | 209/8324 [06:43<5:53:05, 2.61s/it]
3%|▎ | 210/8324 [06:46<5:54:11, 2.62s/it]
3%|▎ | 211/8324 [06:48<5:31:30, 2.45s/it]
3%|▎ | 212/8324 [06:50<5:10:59, 2.30s/it]
3%|▎ | 213/8324 [06:53<5:39:58, 2.51s/it]
3%|▎ | 214/8324 [06:54<5:12:13, 2.31s/it]
3%|▎ | 215/8324 [06:55<4:08:36, 1.84s/it]
3%|▎ | 216/8324 [06:57<4:13:52, 1.88s/it]
3%|▎ | 217/8324 [06:58<3:55:07, 1.74s/it]
3%|▎ | 218/8324 [07:01<4:25:41, 1.97s/it]
3%|▎ | 219/8324 [07:06<6:13:06, 2.76s/it]
3%|▎ | 220/8324 [07:07<5:26:02, 2.41s/it]
3%|▎ | 221/8324 [07:08<4:37:14, 2.05s/it]
3%|▎ | 222/8324 [07:09<3:41:56, 1.64s/it]
3%|▎ | 223/8324 [07:12<4:14:16, 1.88s/it]
3%|▎ | 224/8324 [07:13<4:01:57, 1.79s/it]
3%|▎ | 225/8324 [07:17<5:25:45, 2.41s/it]
3%|▎ | 226/8324 [07:18<4:18:13, 1.91s/it]
3%|▎ | 227/8324 [07:20<4:22:01, 1.94s/it]
3%|▎ | 228/8324 [07:22<4:33:25, 2.03s/it]
3%|▎ | 229/8324 [07:25<5:25:24, 2.41s/it]
3%|▎ | 230/8324 [07:27<5:11:50, 2.31s/it]
3%|▎ | 231/8324 [07:29<4:27:58, 1.99s/it]
3%|▎ | 232/8324 [07:31<4:55:32, 2.19s/it]
3%|▎ | 233/8324 [07:32<3:56:13, 1.75s/it]
3%|▎ | 234/8324 [07:33<3:38:26, 1.62s/it]
3%|▎ | 235/8324 [07:36<4:40:45, 2.08s/it]
3%|▎ | 236/8324 [07:39<5:08:21, 2.29s/it]
3%|▎ | 237/8324 [07:41<5:04:02, 2.26s/it]
3%|▎ | 238/8324 [07:44<5:01:27, 2.24s/it]
3%|▎ | 239/8324 [07:45<4:26:28, 1.98s/it]
3%|▎ | 240/8324 [07:48<5:21:24, 2.39s/it]
3%|▎ | 241/8324 [07:50<5:08:44, 2.29s/it]
3%|▎ | 242/8324 [07:53<5:30:00, 2.45s/it]
3%|▎ | 243/8324 [07:57<6:20:49, 2.83s/it]
3%|▎ | 244/8324 [07:58<5:00:03, 2.23s/it]
3%|▎ | 245/8324 [08:00<5:16:25, 2.35s/it]
3%|▎ | 246/8324 [08:05<6:47:13, 3.02s/it]
3%|▎ | 247/8324 [08:07<5:56:35, 2.65s/it]
3%|▎ | 248/8324 [08:09<5:30:16, 2.45s/it]
3%|▎ | 249/8324 [08:13<6:33:04, 2.92s/it]
3%|▎ | 250/8324 [08:16<7:03:39, 3.15s/it]
3%|▎ | 251/8324 [08:17<5:24:13, 2.41s/it]
3%|▎ | 252/8324 [08:19<4:46:50, 2.13s/it]
3%|▎ | 253/8324 [08:21<4:44:48, 2.12s/it]
3%|▎ | 254/8324 [08:23<4:46:36, 2.13s/it]
3%|▎ | 255/8324 [08:27<5:59:35, 2.67s/it]
3%|▎ | 256/8324 [08:29<5:43:25, 2.55s/it]
3%|▎ | 257/8324 [08:30<4:27:53, 1.99s/it]
3%|▎ | 258/8324 [08:32<4:24:39, 1.97s/it]
3%|▎ | 259/8324 [08:33<3:47:41, 1.69s/it]
3%|▎ | 260/8324 [08:34<3:15:12, 1.45s/it]
3%|▎ | 261/8324 [08:35<3:09:25, 1.41s/it]
3%|▎ | 262/8324 [08:36<3:01:26, 1.35s/it]
3%|▎ | 263/8324 [08:38<3:26:56, 1.54s/it]
3%|▎ | 264/8324 [08:40<3:48:36, 1.70s/it]
3%|▎ | 265/8324 [08:43<4:27:37, 1.99s/it]
3%|▎ | 266/8324 [08:46<4:56:52, 2.21s/it]
3%|▎ | 267/8324 [08:49<5:26:06, 2.43s/it]
3%|▎ | 268/8324 [08:51<5:28:10, 2.44s/it]
3%|▎ | 269/8324 [08:54<5:55:46, 2.65s/it]
3%|▎ | 270/8324 [08:57<6:03:59, 2.71s/it]
3%|▎ | 271/8324 [09:00<5:57:09, 2.66s/it]
3%|▎ | 272/8324 [09:01<5:04:35, 2.27s/it]
3%|▎ | 273/8324 [09:02<4:37:17, 2.07s/it]
3%|▎ | 274/8324 [09:06<5:27:58, 2.44s/it]
3%|▎ | 275/8324 [09:09<5:41:16, 2.54s/it]
3%|▎ | 276/8324 [09:12<5:58:39, 2.67s/it]
3%|▎ | 277/8324 [09:16<7:02:26, 3.15s/it]
3%|▎ | 278/8324 [09:18<6:16:44, 2.81s/it]
3%|▎ | 279/8324 [09:21<6:36:57, 2.96s/it]
3%|▎ | 280/8324 [09:24<6:34:28, 2.94s/it]
3%|▎ | 281/8324 [09:25<5:14:19, 2.34s/it]
3%|▎ | 282/8324 [09:27<4:56:54, 2.22s/it]
3%|▎ | 283/8324 [09:29<5:04:28, 2.27s/it]
3%|▎ | 284/8324 [09:32<5:18:15, 2.38s/it]
3%|▎ | 285/8324 [09:37<7:13:32, 3.24s/it]
3%|▎ | 286/8324 [09:40<7:01:57, 3.15s/it]
3%|▎ | 287/8324 [09:42<6:13:24, 2.79s/it]
3%|▎ | 288/8324 [09:43<5:03:02, 2.26s/it]
3%|▎ | 289/8324 [09:44<4:02:31, 1.81s/it]
3%|▎ | 290/8324 [09:45<3:28:41, 1.56s/it]
3%|▎ | 291/8324 [09:48<4:22:10, 1.96s/it]
4%|▎ | 292/8324 [09:49<4:11:01, 1.88s/it]
4%|▎ | 293/8324 [09:51<4:14:46, 1.90s/it]
4%|▎ | 294/8324 [09:52<3:33:14, 1.59s/it]
4%|▎ | 295/8324 [09:56<5:04:35, 2.28s/it]
4%|▎ | 296/8324 [10:00<6:01:32, 2.70s/it]
4%|▎ | 297/8324 [10:01<4:59:27, 2.24s/it]
4%|▎ | 298/8324 [10:03<4:45:24, 2.13s/it]
4%|▎ | 299/8324 [10:06<5:06:20, 2.29s/it]
4%|▎ | 300/8324 [10:06<4:09:24, 1.87s/it]
4%|▎ | 301/8324 [10:09<4:35:48, 2.06s/it]
4%|▎ | 302/8324 [10:12<5:37:20, 2.52s/it]
4%|▎ | 303/8324 [10:15<5:41:59, 2.56s/it]
4%|▎ | 304/8324 [10:16<4:36:41, 2.07s/it]
4%|▎ | 305/8324 [10:19<5:09:16, 2.31s/it]
4%|▎ | 306/8324 [10:20<4:08:46, 1.86s/it]
4%|▎ | 307/8324 [10:23<4:56:49, 2.22s/it]
4%|▎ | 308/8324 [10:25<4:53:04, 2.19s/it]
4%|▎ | 309/8324 [10:27<4:35:17, 2.06s/it]
4%|▎ | 310/8324 [10:28<3:48:51, 1.71s/it]
4%|▎ | 311/8324 [10:29<3:35:14, 1.61s/it]
4%|▎ | 312/8324 [10:31<4:01:31, 1.81s/it]
4%|▍ | 313/8324 [10:36<5:48:18, 2.61s/it]
4%|▍ | 314/8324 [10:37<5:11:48, 2.34s/it]
4%|▍ | 315/8324 [10:39<4:50:31, 2.18s/it]
4%|▍ | 316/8324 [10:42<5:01:17, 2.26s/it]
4%|▍ | 317/8324 [10:44<4:44:44, 2.13s/it]
4%|▍ | 318/8324 [10:46<4:52:30, 2.19s/it]
4%|▍ | 319/8324 [10:47<4:29:29, 2.02s/it]
4%|▍ | 320/8324 [10:51<5:19:07, 2.39s/it]
4%|▍ | 321/8324 [10:54<5:49:13, 2.62s/it]
4%|▍ | 322/8324 [10:55<5:07:51, 2.31s/it]
4%|▍ | 323/8324 [10:57<4:28:18, 2.01s/it]
4%|▍ | 324/8324 [11:00<5:11:39, 2.34s/it]
4%|▍ | 325/8324 [11:03<5:44:31, 2.58s/it]
4%|▍ | 326/8324 [11:08<7:06:15, 3.20s/it]
4%|▍ | 327/8324 [11:10<6:50:26, 3.08s/it]
4%|▍ | 328/8324 [11:11<5:25:55, 2.45s/it]
4%|▍ | 329/8324 [11:14<5:32:03, 2.49s/it]
4%|▍ | 330/8324 [11:18<6:18:27, 2.84s/it]
4%|▍ | 331/8324 [11:20<6:09:37, 2.77s/it]
4%|▍ | 332/8324 [11:22<5:18:08, 2.39s/it]
4%|▍ | 333/8324 [11:23<4:42:13, 2.12s/it]
4%|▍ | 334/8324 [11:24<4:01:56, 1.82s/it]
4%|▍ | 335/8324 [11:31<7:24:22, 3.34s/it]
4%|▍ | 336/8324 [11:34<7:03:25, 3.18s/it]
4%|▍ | 337/8324 [11:36<6:18:42, 2.84s/it]
4%|▍ | 338/8324 [11:40<7:17:19, 3.29s/it]
4%|▍ | 339/8324 [11:43<6:39:06, 3.00s/it]
4%|▍ | 340/8324 [11:44<5:37:37, 2.54s/it]
4%|▍ | 341/8324 [11:48<6:43:56, 3.04s/it]
4%|▍ | 342/8324 [11:50<5:41:29, 2.57s/it]
4%|▍ | 343/8324 [11:55<7:30:58, 3.39s/it]
4%|▍ | 344/8324 [11:56<5:55:28, 2.67s/it]
4%|▍ | 345/8324 [11:59<5:48:47, 2.62s/it]
4%|▍ | 346/8324 [12:03<6:38:53, 3.00s/it]
4%|▍ | 347/8324 [12:06<6:47:09, 3.06s/it]
4%|▍ | 348/8324 [12:08<6:30:04, 2.93s/it]
4%|▍ | 349/8324 [12:14<7:58:38, 3.60s/it]
4%|▍ | 350/8324 [12:15<6:20:02, 2.86s/it]
4%|▍ | 351/8324 [12:16<5:04:07, 2.29s/it]
4%|▍ | 352/8324 [12:19<5:36:40, 2.53s/it]
4%|▍ | 353/8324 [12:21<5:05:09, 2.30s/it]
4%|▍ | 354/8324 [12:22<4:39:52, 2.11s/it]
4%|▍ | 355/8324 [12:25<5:10:11, 2.34s/it]
4%|▍ | 356/8324 [12:30<6:33:22, 2.96s/it]
4%|▍ | 357/8324 [12:33<6:50:29, 3.09s/it]
4%|▍ | 358/8324 [12:35<6:19:41, 2.86s/it]
4%|▍ | 359/8324 [12:40<7:43:24, 3.49s/it]
4%|▍ | 360/8324 [12:44<7:52:16, 3.56s/it]
4%|▍ | 361/8324 [12:45<6:20:57, 2.87s/it]
4%|▍ | 362/8324 [12:48<6:05:35, 2.75s/it]
4%|▍ | 363/8324 [12:51<6:24:42, 2.90s/it]
4%|▍ | 364/8324 [12:53<5:54:13, 2.67s/it]
4%|▍ | 365/8324 [12:55<5:21:33, 2.42s/it]
4%|▍ | 366/8324 [12:59<6:22:15, 2.88s/it]
4%|▍ | 367/8324 [13:01<5:46:53, 2.62s/it]
4%|▍ | 368/8324 [13:04<6:03:56, 2.74s/it]
4%|▍ | 369/8324 [13:08<7:13:07, 3.27s/it]
4%|▍ | 370/8324 [13:10<6:12:50, 2.81s/it]
4%|▍ | 371/8324 [13:14<6:43:12, 3.04s/it]
4%|▍ | 372/8324 [13:17<7:11:06, 3.25s/it]
4%|▍ | 373/8324 [13:21<7:13:39, 3.27s/it]
4%|▍ | 374/8324 [13:24<7:00:56, 3.18s/it]
5%|▍ | 375/8324 [13:25<5:43:11, 2.59s/it]
5%|▍ | 376/8324 [13:30<7:32:27, 3.42s/it]
5%|▍ | 377/8324 [13:32<6:11:35, 2.81s/it]
5%|▍ | 378/8324 [13:37<7:32:41, 3.42s/it]
5%|▍ | 379/8324 [13:38<6:07:03, 2.77s/it]
5%|▍ | 380/8324 [13:39<5:03:48, 2.29s/it]
5%|▍ | 381/8324 [13:43<6:30:40, 2.95s/it]
5%|▍ | 382/8324 [13:46<6:22:46, 2.89s/it]
5%|▍ | 383/8324 [13:47<5:13:14, 2.37s/it]
5%|▍ | 384/8324 [13:50<5:21:36, 2.43s/it]
5%|▍ | 385/8324 [13:52<5:08:37, 2.33s/it]
5%|▍ | 386/8324 [13:58<7:18:47, 3.32s/it]
5%|▍ | 387/8324 [14:00<6:27:25, 2.93s/it]
5%|▍ | 388/8324 [14:02<6:06:23, 2.77s/it]
5%|▍ | 389/8324 [14:05<6:26:49, 2.92s/it]
5%|▍ | 390/8324 [14:10<7:51:06, 3.56s/it]
5%|▍ | 391/8324 [14:15<8:36:11, 3.90s/it]
5%|▍ | 392/8324 [14:18<8:11:09, 3.72s/it]
5%|▍ | 393/8324 [14:22<7:49:55, 3.56s/it]
5%|▍ | 394/8324 [14:25<7:46:57, 3.53s/it]
5%|▍ | 395/8324 [14:27<6:55:54, 3.15s/it]
5%|▍ | 396/8324 [14:28<5:39:09, 2.57s/it]
5%|▍ | 397/8324 [14:31<5:39:13, 2.57s/it]
5%|▍ | 398/8324 [14:33<5:23:03, 2.45s/it]
5%|▍ | 399/8324 [14:38<6:39:59, 3.03s/it]
5%|▍ | 400/8324 [14:41<6:49:09, 3.10s/it]
5%|▍ | 401/8324 [14:44<6:35:22, 2.99s/it]
5%|▍ | 402/8324 [14:47<6:59:05, 3.17s/it]
5%|▍ | 403/8324 [14:49<5:53:17, 2.68s/it]
5%|▍ | 404/8324 [14:54<7:28:22, 3.40s/it]
5%|▍ | 405/8324 [15:00<9:31:16, 4.33s/it]
5%|▍ | 406/8324 [15:03<8:44:44, 3.98s/it]
5%|▍ | 407/8324 [15:09<10:01:31, 4.56s/it]
5%|▍ | 408/8324 [15:15<11:01:48, 5.02s/it]
5%|▍ | 409/8324 [15:21<11:10:15, 5.08s/it]
5%|▍ | 410/8324 [15:24<9:59:44, 4.55s/it]
5%|▍ | 411/8324 [15:30<10:44:35, 4.89s/it]
5%|▍ | 412/8324 [15:32<9:18:25, 4.23s/it]
5%|▍ | 413/8324 [15:34<7:36:04, 3.46s/it]
5%|▍ | 414/8324 [15:38<7:42:31, 3.51s/it]
5%|▍ | 415/8324 [15:41<7:48:38, 3.56s/it]
5%|▍ | 416/8324 [15:46<8:22:11, 3.81s/it]
xtrain_w2v = np.array(xtrain_w2v)
xvalid_w2v = np.array(xvalid_w2v)
让我们看看xgboost在Word2vec词向量特征的表现如何:

基于word2vec特征在一个简单的Xgboost模型上进行拟合

clf = xgb.XGBClassifier(nthread=10, silent=False)
clf.fit(xtrain_w2v, ytrain)
predictions = clf.predict_proba(xvalid_w2v)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

基于word2vec特征在一个简单的Xgboost模型上进行拟合

clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8,
subsample=0.8, nthread=10, learning_rate=0.1, silent=False)
clf.fit(xtrain_w2v, ytrain)
predictions = clf.predict_proba(xvalid_w2v)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))
print(classification_report(predictions, yvalid))
我们可以看到,简单的对参数进行微调,就提高基于GloVe词向量特征的xgboost得分! 相信我,你还可以从中继续“压榨”出更优秀的表现!

深度学习(Deep Learning)
这是一个深度学习大行其道的时代! 文本分类问题在它的指引下得到了突飞猛进的发展! 在这里,我们将在GloVe功能上训练LSTM和简单的全连接网络(Dense Network)。
让我们先从全连接网络开始:

在使用神经网络前,对数据进行缩放

scl = preprocessing.StandardScaler()
xtrain_w2v_scl = scl.fit_transform(xtrain_w2v)
xvalid_w2v_scl = scl.transform(xvalid_w2v)

对标签进行binarize处理

ytrain_enc = np_utils.to_categorical(ytrain)
yvalid_enc = np_utils.to_categorical(yvalid)
#创建1个3层的序列神经网络(Sequential Neural Net)
model = Sequential()

model.add(Dense(300, input_dim=300, activation=‘relu’))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(Dense(300, activation=‘relu’))
model.add(Dropout(0.3))
model.add(BatchNormalization())

model.add(Dense(3))
model.add(Activation(‘softmax’))

模型编译

model.compile(loss=‘categorical_crossentropy’, optimizer=‘adam’)
model.fit(xtrain_w2v_scl, y=ytrain_enc, batch_size=64,
epochs=5, verbose=1,
validation_data=(xvalid_w2v_scl, yvalid_enc))
你需要不断的对神经网络的参数进行调优,添加更多层,增加Dropout以获得更好的结果。 在这里,笔者只是简单的实现下,追求速度而不是最终效果,并且它比没有任何优化的xgboost取得了更好的结果:)

为了更进一步,笔者使用LSTM,我们需要对文本数据进行Tokenize:

使用 keras tokenizer

token = text.Tokenizer(num_words=None)
max_len = 70

token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

#对文本序列进行zero填充
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index
#基于已有的数据集中的词汇创建一个词嵌入矩阵(Embedding Matrix)
embedding_matrix = np.zeros((len(word_index) + 1, 100))
for word, i in tqdm(word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector

基于前面训练的Word2vec词向量,使用1个两层的LSTM模型

model = Sequential()
model.add(Embedding(len(word_index) + 1,
100,
weights=[embedding_matrix],
input_length=max_len,
trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation=‘relu’))
model.add(Dropout(0.8))

model.add(Dense(1024, activation=‘relu’))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation(‘softmax’))
model.compile(loss=‘categorical_crossentropy’, optimizer=‘adam’)
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, verbose=1, validation_data=(xvalid_pad, yvalid_enc))
现在,我们看到分数小于0.5。 我跑了很多个epochs都没有获得最优的结果,但我们可以使用early stopping来停止在最佳的迭代节点。

那我们该如何使用early stopping?

好吧,其实很简单的。 让我们再次compile模型:

基于前面训练的Word2vec词向量,使用1个两层的LSTM模型

model = Sequential()
model.add(Embedding(len(word_index) + 1,
100,
weights=[embedding_matrix],
input_length=max_len,
trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation=‘relu’))
model.add(Dropout(0.8))

model.add(Dense(1024, activation=‘relu’))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation(‘softmax’))
model.compile(loss=‘categorical_crossentropy’, optimizer=‘adam’)

#在模型拟合时,使用early stopping这个回调函数(Callback Function)
earlystop = EarlyStopping(monitor=‘val_loss’, min_delta=0, patience=3, verbose=0, mode=‘auto’)
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100,
verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])
一个可能的问题是:为什么我会使用这么多的dropout? 嗯,fit模型时,没有或很少的dropout,你会出现过拟合(Overfit)😃

让我们看看双向长短时记忆(Bi-Directional LSTM)是否可以给我们带来更好的结果。 对于Keras来说,使用Bilstm小菜一碟:)

基于前面训练的Word2vec词向量,构建1个2层的Bidirectional LSTM

model = Sequential()
model.add(Embedding(len(word_index) + 1,
100,
weights=[embedding_matrix],
input_length=max_len,
trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(Bidirectional(LSTM(100, dropout=0.3, recurrent_dropout=0.3)))

model.add(Dense(1024, activation=‘relu’))
model.add(Dropout(0.8))

model.add(Dense(1024, activation=‘relu’))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation(‘softmax’))
model.compile(loss=‘categorical_crossentropy’, optimizer=‘adam’)

#在模型拟合时,使用early stopping这个回调函数(Callback Function)
earlystop = EarlyStopping(monitor=‘val_loss’, min_delta=0, patience=3, verbose=0, mode=‘auto’)
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100,
verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])
很接近最优结果了! 让我们尝试两层的GRU:

基于前面训练的Word2vec词向量,构建1个2层的GRU模型

model = Sequential()
model.add(Embedding(len(word_index) + 1,
100,
weights=[embedding_matrix],
input_length=max_len,
trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(GRU(100, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))
model.add(GRU(100, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation=‘relu’))
model.add(Dropout(0.8))

model.add(Dense(1024, activation=‘relu’))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation(‘softmax’))
model.compile(loss=‘categorical_crossentropy’, optimizer=‘adam’)

#在模型拟合时,使用early stopping这个回调函数(Callback Function)
earlystop = EarlyStopping(monitor=‘val_loss’, min_delta=0, patience=3, verbose=0, mode=‘auto’)
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100,
verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])
太好了! 比我们以前的模型好多了! 持续优化,模型的性能将不断提高。

在文本分类的比赛中,想要获得最高分,你应该拥有1个合成的模型。 让我们来看看吧!

模型集成(Model Ensembling)
集多个文本分类模型之长,合成一个很棒的分类融合模型。

#创建一个Ensembling主类,具体使用方法见下一个cell
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, KFold
import pandas as pd
import os
import sys
import logging

logging.basicConfig(
level=logging.DEBUG,
format=“[%(asctime)s] %(levelname)s %(message)s”,
datefmt=“%H:%M:%S”, stream=sys.stdout)
logger = logging.getLogger(name)

class Ensembler(object):
def init(self, model_dict, num_folds=3, task_type=‘classification’, optimize=roc_auc_score,
lower_is_better=False, save_path=None):
“”"
Ensembler init function
:param model_dict: 模型字典
:param num_folds: ensembling所用的fold数量
:param task_type: 分类(classification) 还是回归(regression)
:param optimize: 优化函数,比如 AUC, logloss, F1等,必须有2个函数,即y_test 和 y_pred
:param lower_is_better: 优化函数(Optimization Function)的值越低越好还是越高越好
:param save_path: 模型保存路径
“”"

    self.model_dict = model_dict
    self.levels = len(self.model_dict)
    self.num_folds = num_folds
    self.task_type = task_type
    self.optimize = optimize
    self.lower_is_better = lower_is_better
    self.save_path = save_path

    self.training_data = None
    self.test_data = None
    self.y = None
    self.lbl_enc = None
    self.y_enc = None
    self.train_prediction_dict = None
    self.test_prediction_dict = None
    self.num_classes = None

def fit(self, training_data, y, lentrain):
    """
    :param training_data: 二维表格形式的训练数据
    :param y: 二进制的, 多分类或回归
    :return: 用于预测的模型链(Chain of Models)

    """

    self.training_data = training_data
    self.y = y

    if self.task_type == 'classification':
        self.num_classes = len(np.unique(self.y))
        logger.info("Found %d classes", self.num_classes)
        self.lbl_enc = LabelEncoder()
        self.y_enc = self.lbl_enc.fit_transform(self.y)
        kf = StratifiedKFold(n_splits=self.num_folds)
        train_prediction_shape = (lentrain, self.num_classes)
    else:
        self.num_classes = -1
        self.y_enc = self.y
        kf = KFold(n_splits=self.num_folds)
        train_prediction_shape = (lentrain, 1)

    self.train_prediction_dict = {}
    for level in range(self.levels):
        self.train_prediction_dict[level] = np.zeros((train_prediction_shape[0],
                                                      train_prediction_shape[1] * len(self.model_dict[level])))

    for level in range(self.levels):

        if level == 0:
            temp_train = self.training_data
        else:
            temp_train = self.train_prediction_dict[level - 1]

        for model_num, model in enumerate(self.model_dict[level]):
            validation_scores = []
            foldnum = 1
            for train_index, valid_index in kf.split(self.train_prediction_dict[0], self.y_enc):
                logger.info("Training Level %d Fold # %d. Model # %d", level, foldnum, model_num)

                if level != 0:
                    l_training_data = temp_train[train_index]
                    l_validation_data = temp_train[valid_index]
                    model.fit(l_training_data, self.y_enc[train_index])
                else:
                    l0_training_data = temp_train[0][model_num]
                    if type(l0_training_data) == list:
                        l_training_data = [x[train_index] for x in l0_training_data]
                        l_validation_data = [x[valid_index] for x in l0_training_data]
                    else:
                        l_training_data = l0_training_data[train_index]
                        l_validation_data = l0_training_data[valid_index]
                    model.fit(l_training_data, self.y_enc[train_index])

                logger.info("Predicting Level %d. Fold # %d. Model # %d", level, foldnum, model_num)

                if self.task_type == 'classification':
                    temp_train_predictions = model.predict_proba(l_validation_data)
                    self.train_prediction_dict[level][valid_index,
                    (model_num * self.num_classes):(model_num * self.num_classes) +
                                                   self.num_classes] = temp_train_predictions

                else:
                    temp_train_predictions = model.predict(l_validation_data)
                    self.train_prediction_dict[level][valid_index, model_num] = temp_train_predictions
                validation_score = self.optimize(self.y_enc[valid_index], temp_train_predictions)
                validation_scores.append(validation_score)
                logger.info("Level %d. Fold # %d. Model # %d. Validation Score = %f", level, foldnum, model_num,
                            validation_score)
                foldnum += 1
            avg_score = np.mean(validation_scores)
            std_score = np.std(validation_scores)
            logger.info("Level %d. Model # %d. Mean Score = %f. Std Dev = %f", level, model_num,
                        avg_score, std_score)

        logger.info("Saving predictions for level # %d", level)
        train_predictions_df = pd.DataFrame(self.train_prediction_dict[level])
        train_predictions_df.to_csv(os.path.join(self.save_path, "train_predictions_level_" + str(level) + ".csv"),
                                    index=False, header=None)

    return self.train_prediction_dict

def predict(self, test_data, lentest):
    self.test_data = test_data
    if self.task_type == 'classification':
        test_prediction_shape = (lentest, self.num_classes)
    else:
        test_prediction_shape = (lentest, 1)

    self.test_prediction_dict = {}
    for level in range(self.levels):
        self.test_prediction_dict[level] = np.zeros((test_prediction_shape[0],
                                                     test_prediction_shape[1] * len(self.model_dict[level])))
    self.test_data = test_data
    for level in range(self.levels):
        if level == 0:
            temp_train = self.training_data
            temp_test = self.test_data
        else:
            temp_train = self.train_prediction_dict[level - 1]
            temp_test = self.test_prediction_dict[level - 1]

        for model_num, model in enumerate(self.model_dict[level]):

            logger.info("Training Fulldata Level %d. Model # %d", level, model_num)
            if level == 0:
                model.fit(temp_train[0][model_num], self.y_enc)
            else:
                model.fit(temp_train, self.y_enc)

            logger.info("Predicting Test Level %d. Model # %d", level, model_num)

            if self.task_type == 'classification':
                if level == 0:
                    temp_test_predictions = model.predict_proba(temp_test[0][model_num])
                else:
                    temp_test_predictions = model.predict_proba(temp_test)
                self.test_prediction_dict[level][:, (model_num * self.num_classes): (model_num * self.num_classes) +
                                                                                    self.num_classes] = temp_test_predictions

            else:
                if level == 0:
                    temp_test_predictions = model.predict(temp_test[0][model_num])
                else:
                    temp_test_predictions = model.predict(temp_test)
                self.test_prediction_dict[level][:, model_num] = temp_test_predictions

        test_predictions_df = pd.DataFrame(self.test_prediction_dict[level])
        test_predictions_df.to_csv(os.path.join(self.save_path, "test_predictions_level_" + str(level) + ".csv"),
                                   index=False, header=None)

    return self.test_prediction_dict

#为每个level的集成指定使用数据:
train_data_dict = {0: [xtrain_tfv, xtrain_ctv, xtrain_tfv, xtrain_ctv], 1: [xtrain_glove]}
test_data_dict = {0: [xvalid_tfv, xvalid_ctv, xvalid_tfv, xvalid_ctv], 1: [xvalid_glove]}

model_dict = {0: [LogisticRegression(), LogisticRegression(), MultinomialNB(alpha=0.1), MultinomialNB()],

          1: [xgb.XGBClassifier(silent=True, n_estimators=120, max_depth=7)]}

ens = Ensembler(model_dict=model_dict, num_folds=3, task_type=‘classification’,
optimize=multiclass_logloss, lower_is_better=True, save_path=‘’)

ens.fit(train_data_dict, ytrain, lentrain=xtrain_w2v.shape[0])
preds = ens.predict(test_data_dict, lentest=xvalid_w2v.shape[0])

检视损失率

multiclass_logloss(yvalid, preds[1])
print(classification_report(predictions, yvalid))
因此,我们看到集成模型在很大程度上提高了分数!但要注意,集成模型只有在参与集成的模型势均力敌 - 表现都不差的情况下才能取得良好的效果,不然会出现拖后腿的情况,导致模型的整体性能还不如单个模型的要好~

由于本文只是一个教程,更多的技术细节还没有深入下去,对此,你可以利用空余时间多多优化下,也可以尝试其他方法,比如:

基于CNN的文本分类,达到的效果类似于N-gram,效率奇高

基于attention机制的BiLSTM、Hierarchical LSTM等

基于ELMO、BERT等预训练模型来提取高质量的文本特征,再喂给分类器

以上就是笔者的分享,希望大家喜欢,也希望大家踊跃留言,发表看法和意见,我会持续更新的。

Note:需要训练语料的朋友请关注我的公众号【Social Listening与文本挖掘】,在后台回复 “语料”即可得到训练语料的下载链接。

笔者在和鲸(科赛)上的notebook附加资料 :

基于attention的情感分析,https://www.kesci.com/home/project/5c2f055881e912002b833620
【NLP文本表示】如何科学的在Tensorflow里使用词嵌入 ,https://www.kesci.com/home/project/5b6acf1d9889570010c88af1
基于Position_Embedding和 Attention机制进行文本分类,https://www.kesci.com/home/project/5c0d2a65864a0d002b5428fa
【BERT-至今最强大的NLP大杀器!】基于BERT的文本分类,https://www.kesci.com/home/project/5bfaa482954d6e001067396d
NLP分析利器】利用Foolnltk进行自然语言处理,https://www.kesci.com/home/project/5b863f1131902f000f64adce
文本挖掘】基于DBSCAN的文本聚类,https://www.kesci.com/home/project/5c19f99de17d84002c658466

文本分类从入门到精通—代码展示

相关文章

  1. Matplotlib双轴教程

    Matplotlib通过twinx()和twiny()函数支持此功能。 在下面的示例中&#xff0c;绘图具有双y轴&#xff0c;一个显示exp(x)&#xff0c;另一个显示log(x) - #! /usr/bin/env python #codingutf-8 import matplotlib.pyplot as plt import numpy as np import math plt.rcParams[…...

    2023/6/1 14:49:43
  2. 30个必会python技巧

    直接交换2个数字的位置 Python 提供了一种直观的方式在一行代码中赋值和交换&#xff08;变量值&#xff09;。如下所示&#xff1a; x, y 10, 20 print(x, y) x, y y, x print(x, y) #1 (10, 20) #2 (20, 10) 在上面代码中&#xff0c;赋值的右侧形成了一个新元组&a…...

    2023/5/30 12:29:29
  3. mulesoft Module 12 quiz解析

    mulesoft Module 12 quiz解析1. A flow has a JMS Publish consume operation followed by a JMS Publish operation. Both of these operations have the default configurations.2. A Flow Reference component sends a non-empty JSON object payload to another flow named…...

    2023/6/5 4:27:51
  4. python基于PHP+MySQL的图书馆自习室预约占座系统

    随着我国高等教育的普及和高校生源的扩招,很多学校都出现了一个很严重的问题,那就是自习室和图书馆座位不够用,出现了一座难求的情况。为了能够让高校的这些自习室和图书馆的座位得到合理的利用,我通过现代化的手段还发了一套图书馆自习室预约占座系统。通过本系统可以让高校的…...

    2023/5/27 13:10:59
  5. 智慧林草信息化解决方案(森林防火应急指挥系统)

    森林防火全方位监视监测体系 根据国家林草局对森林防火的管理要求&#xff0c;拟搭建了以天基网为基础的遥感、导航卫星&#xff0c;无人机临近空间高精度地理信息数据结合地面林火远程视频监控、森林防火进山路口视频监控和护林员巡山护林相结合的空、天、地、人“四位一体”…...

    2023/5/24 8:37:28
  6. jsp就业管理系统Myeclipse开发mysql数据库web结构java编程计算机网页项目

    一、源码特点 JSP 就业管理系统 是一套完善的web设计系统&#xff0c;对理解JSP java编程开发语言有帮助&#xff0c;系统具有完整的源代码和数据库&#xff0c;系统主要采用B/S模式开发。开发环境为TOMCAT7.0,Myeclipse8.5开 发&#xff0c;数据库为Mysql&#xff0c;使用ja…...

    2023/6/9 1:03:14
  7. 力扣203 - 移除链表元素【LeetCode转VS调试技巧教学】

    指针原来这么危险~一、题目描述二、思路分析三、代码详解way1【不带头结点】DeBug调试教学way2【带头结点】四、整体代码展示【需要自取】方法一&#xff1a;不带哨兵位【无头结点】方法二&#xff1a;带哨兵位【有头结点】五、总结与提炼一、题目描述 原题传送门&#x1f6aa…...

    2023/6/4 8:52:13
  8. Java程序猿搬砖笔记(十)

    文章目录synchronized锁定的对象Java线程内存模型(JMM)定义的8中原子操作MySQL分区学习根据hash中的键模糊删除union和ordery by一起使用报"Incorrect usage of UNION and ORDER BY"错误的解决方法原因解决方法Spring Cloud Config多服务共享公共配置解决SpringBoot启…...

    2023/6/1 11:34:24
  9. trick1-注意力机制使用

    前言 使用注意力机制&#xff1a;se_block, cbam_block, eca_block, CA_Block 一、注意力机制attention.py构建 在YOLO系列nets里面创建一个注意力机制模块&#xff0c;即attention.py&#xff0c;包括四种注意力&#xff0c;分别是:se_block, cbam_block, eca_block, CA_Blo…...

    2023/6/8 10:54:55
  10. SAGA GIS使用教程

    SAGA GIS使用教程 ——以地形湿度指数&#xff08;topographic wetness index, TWI&#xff09;和水流功率指数&#xff08;stream power index, SPI&#xff09;为例 SAGA GIS简介与下载 SAGA GIS是免费GIS软件。它在制图中并不是特别有用&#xff0c;但它在地形分析中非常便…...

    2023/6/5 15:08:52
  11. 1分钟 Serverless搭建高性能网盘

    准备 在体验本场景之前,需要开通以下服务: 函数计算 FC:阿里云登录 - 欢迎登录阿里云,安全稳定的云计算服务平台 硬盘挂在 NAS:阿里云登录 - 欢迎登录阿里云,安全稳定的云计算服务平台 另外:本场景可能会产生费用,主要包括: 1. 硬盘挂载存储费用:计费概述 2. 函…...

    2023/5/24 7:51:21
  12. jQuery学习:属性

    .attr(属性) 读取属性值 // 读取第一个div的title属性console.log($(div:first).attr(title)); .attr(属性 &#xff0c;属性值) 设置属性 // 给所有div设置name属性 value xxx$(div).attr(name,xxx); 移除属性removeAttr(属性) // 给所有div移除title属性$(div).removeAttr…...

    2023/5/24 6:59:08
  13. 【Vue.js设计与实现】第3章 Vue.js 3 的设计思路

    前言&#xff1a; 本文是我看的Vue.js设计与实现这本书第一篇 框架设计概览 的第3章 Vue.js 3 的设计与思路的一些总结与收获。 第一篇 框架设计概览 有三个章节&#xff1a;权衡的艺术、框架设计的核心要素、Vue.js 3 的设计思路。在第1章中&#xff0c;讲的是框架设计是权衡的…...

    2023/5/23 9:03:18
  14. 获取网站评论

    import requests from lxml import etree import csvurl *********** resq requests.get(url) html etree.HTML(resq.text)uls html.xpath(/html/body/div[2]/div[1]/div[5]/div[4]/ul/li) # 要获取的路径fb open(评价.csv, modew) # 创建文件 f csv.writer(fb) for ul…...

    2023/5/29 5:39:19
  15. 算法篇------动态规划1

    文章目录动态规划的概念和理解题目1---斐波那契数列题目2-------三角形&#xff08;中等&#xff09;题目3----拆分词句&#xff08;较难&#xff09;题目4-----路径总数&#xff08;简单&#xff09;题目5------最小路径和&#xff08;中等&#xff09;题目6-----背包问题动态…...

    2023/5/23 11:40:25
  16. 内核对设备树的处理__从源头分析__内核head.S对dtb的简单处理

    我们的uboot把设备树参数传给内核&#xff0c;那么内核怎么处理这些设备树文件呢&#xff0c;我们需要从内核的第一个执行文件head.S开始分析&#xff0c; bootloader启动内核时,会设置r0,r1,r2三个寄存器, r0一般设置为0; r1一般设置为machine id (在使用设备树时该参数没有被…...

    2023/5/28 0:41:15
  17. 高性能、低成本的高防IP产品现实吗?

    2018&#xff0c;一起Memcached反射放大攻击的流量峰值达到了1.7 T bps&#xff0c;当时很多安全媒体用了核弹级这个词汇&#xff0c;影响程度已经令人咋舌。然而&#xff0c;今年&#xff0c;DDoS攻击流量峰值再创新高。 智能技术开启了人类社会发展的崭新一页&#xff0c;生…...

    2023/5/30 20:28:44
  18. 精读DDD:service

    精读DDD&#xff0c;今天再次理解一下service&#xff0c;概念以及应用到实际工作中出现的一些错误。 之前的一些博客&#xff1a; ABP学习笔记&#xff1a;领域服务 和 应用服务 区别_董厂长的博客-CSDN博客_领域服务和应用服务区别 ABP&#xff1a;是否应该在一个应用服务中调…...

    2023/5/27 22:45:30
  19. 【蓝桥杯专项】初识0/1背包问题(Java)

    ✨哈喽&#xff0c;进来的小伙伴们&#xff0c;你们好耶&#xff01;✨ &#x1f6f0;️&#x1f6f0;️系列专栏:【蓝桥杯专项】 ✈️✈️本篇内容:初识0/1背包问题&#xff01; &#x1f680;&#x1f680;码云仓库gitee&#xff1a;Java数据结构代码存放! ⛵⛵作者简介&…...

    2023/6/1 5:20:35
  20. 初探动态规划

    目录 前言 什么是动态规划 详解动规例题 递归树 递推形式 动态规划和贪心的区别 动态规划和分治的区别 后话 前言 最近备战蓝桥杯&#xff0c;刚好刷到了动态规划的专题。和之前的大部分算法不同&#xff0c;动态规划是一种极其巧妙的算法思想&#xff0c;没有固定的模…...

    2023/6/1 13:22:47

最新文章

  1. JDK8-1-Lambda表达式(3)-函数式接口

    JDK8-1-Lambda表达式&#xff08;3&#xff09;-函数式接口 有且仅有一个抽象方法的接口称为函数式接口&#xff0c;上文 中 java.util.function.Predicate 接口就是一个函数式接口&#xff0c;Java 8中引入的函数式接口定义在 java.util.function 包下 java.util.function.P…...

    2023/6/9 13:12:09
  2. 在webpack中使用Eslint

    一、Eslint介绍 要在webpack中使用Eslint首先我们先了解下什么是Eslint 1. 什么是Eslint ESLint是一个用于在JavaScript代码中发现和报告问题的静态代码分析工具。它可以检测常见的编码错误&#xff0c;如拼写错误、变量未声明、使用未定义的变量等&#xff0c;还可以检测代…...

    2023/6/9 13:11:58
  3. 过五关斩六将,欧科云链荣膺2023安博会“创新产品优秀奖”

    6月7日&#xff0c;由中华人民共和国公安部指导、中华人民共和国商务部批准&#xff0c;公安部主管的中国安全防范产品行业协会主办和承办的2023中国国际社会公共安全产品博览会&#xff08;以下简称&#xff1a;安博会&#xff09;正式开幕。 此次安博会&#xff0c;欧科云链携…...

    2023/6/9 13:11:30
  4. 【大数据之路5-2】Hive 全调优

    Hive 全调优 1. 调优概述2. 调优具体细节1. Hive 建表设计层面1. 利用分区表优化2. 利用分桶表优化3. 选择合适的文件存储格式4. 选择合适的压缩格式2. HQL 语法和运行参数层面1. 查看 Hive 执行计划2. 列裁剪3. 谓词下推4. 分区裁剪5. 合并小文件6. 合理设置 MapTask 并行度7.…...

    2023/6/9 13:11:10
  5. 《计算机组成原理》期末考试复习提纲+手写练习题+知识点总结(10个考点总结梳理+作业测试手写详细解答)

    待完善 &#xff08;一&#xff09;给出一个分别用补码和 IEEE754 法表示的 32 位数&#xff0c;如何求真值&#xff1f; &#xff08;二&#xff09; 给定两个定点小数的真值&#xff0c;如何求原码补码变形补码(扩展2个符号位&#xff0c;用于溢出判断)&#xff1f; 经典例…...

    2023/6/9 13:10:52
  6. 6.9 条件变量的使用及注意事项

    目录 条件变量 使用步骤&#xff1a; 初始化&#xff1a; 生产资源线程&#xff1a; 开始产生资源 消费者线程&#xff1a; 条件变量 应用场景&#xff1a;生产者消费者问题&#xff0c;是线程同步的一种手段。 必要性&#xff1a;为了实现等待某个资源&#xff0c;让线…...

    2023/6/9 13:10:25
  7. 安装Redis三主三从

    安装Redis 下载地址&#xff1a;https://redis.io/download/#redis-downloads 安装过程 cd /redis/redis-6.0.5make install修改配置文件 vim redis.confdaemonize yes dir "/data/soft/redis-7.0.11/6379/" cluster-enabled yes cluster-config-file "/da…...

    2023/6/9 13:10:10
  8. 市场类型与完全竞争市场

    短期完全竞争市场 区分市场类型的几条标准: 生产者的数量商品的同质性(差异化程度)进出市场的障碍信息是否完全市场类型: 完全垄断寡头垄断垄断性竞争完全竞争完全竞争市场的特征: 企业:数目多&规模小产品:同质化进出:自由 完全竞争市场上的企业是价格接受者最优产…...

    2023/6/9 13:09:51
  9. 互联网摸鱼日报(2023-06-09)

    互联网摸鱼日报(2023-06-09) InfoQ 热门话题 阿里大模型又有新进展&#xff1a;时间、空间可控的视频生成模型VideoComposer正式问世 趣丸科技媒体算法负责人马金龙确认出席 ArchSummit 深圳 百度智能云技术委员会主席王耀确认担任QCon联席主席并将发表主题演讲 继Stabilit…...

    2023/6/9 13:09:38
  10. ansible:command not foundnon-zero return code 解决方法

    问题现象 使用ansible命令行执行远程命令 使用command模块&#xff0c;出现报错&#xff1a;[Errno 2] No such file or directory 一样的命令换shell模块&#xff0c;出现报错&#xff1a;/bin/sh: ifconfig: command not foundnon-zero return code 问题原因 通过shell模块…...

    2023/6/9 13:09:24
  11. 拖拽(QT)

    一、拖放&#xff08;Drag and Drop&#xff09;的概念 拖放提供了一种简单的可视化机制&#xff0c;用户可以使用该机制在应用程序之间和内部传输信息。 拖放在功能上类似于粘贴板机制。 二、拖放类 这些类处理拖放和必要的mime类型编码和解码。 QDrag&#xff1a; 支持…...

    2023/6/9 13:08:47
  12. CRM系统排行榜TOP10——2023年度

    在当今竞争激烈的市场环境中&#xff0c;CRM客户系统是企业必备的管理工具&#xff0c;它可以帮助企业管理客户数据&#xff0c;优化业务流程&#xff0c;实现业绩增长。那么有哪些优秀的CRM系统呢&#xff1f;下面请看全球2023年CRM管理系统十大排行榜。 全球2023年CRM管理系…...

    2023/6/9 13:07:35
  13. 线程池源码解读及原理

    前言 大龄程序员老王 老王是一个已经北漂十多年的程序员&#xff0c;岁数大了&#xff0c;加班加不过年轻人&#xff0c;升迁也无望&#xff0c;于是拿着手里的一些积蓄&#xff0c;回老家转行创业。他选择了洗浴行业&#xff0c;开一家洗浴中心&#xff0c;没错&#xff0c;一…...

    2023/6/9 13:07:19
  14. Lambda表达式 函数式接口 Stream流

    目录 一. Lambda表达式 1. 函数式编程思想概述 2. Lambda的优化 3. Lambda的格式 标准格式: 参数和返回值: 省略格式: 4. Lambda的前提条件 二. 函数式接口 1. 概述 格式 FunctionalInterface注解 2. 常用函数式接口 Supplier接口 Consumer接口 Function接口 P…...

    2023/6/9 13:07:02
  15. 【Android】WMS(一)Window的类型和标志

    Window、WindowManager、WMS区别 Window&#xff1a; Window 是 Android 中的一个视图容器&#xff0c;代表整个屏幕或 Activity 的一部分。每个 Window 都有自己的 Surface 对象&#xff0c;Surface 对象具有绘制和渲染功能&#xff0c;可以显示 View 和其他元素。在 Androi…...

    2023/6/9 13:06:45
  16. Meetup 报名|06.17 StarRocks Friends 与你相约上海

    StarRocks & Friends 是由 StarRocks 社区发起的城市线下 meetup&#xff0c;旨在联合社区与行业的专家小伙伴们分享基于 StarRocks 的最佳实践、大数据分析的前沿技术和 StarRocks 生态融合等热门话题。 不远千里奔赴&#xff0c;只为与你相聚。这个夏天&#xff0c;让我们…...

    2023/6/9 13:06:31
  17. 30分钟Cadence原理图入门

    新建工程 点击Design Entry CIS图标&#xff0c;选择OrCAD Capture。 新建工程File->New->Project 设置工程名字和路径。 默认生成PAGE1 新建页 右键点击SCHEMATIC1->New Page&#xff0c;新建原理图页。 页面设置 修改原理图页大小 选择大小A、B、C、D、E或自定义…...

    2023/6/9 13:06:12
  18. springcloud使用nacos搭建注册中心

    nacos安装这里就不细说了&#xff0c;(Nacos下载以及搭建环境_你非柠檬为何心酸142的博客-CSDN博客) 大家也可以去网上安装好&#xff0c;这里主要讲搭建 &#xff0c;我们需要手动启动nacos, 输入(.\startup.cmd -m standalone),出现一下图标就代表ok 首先是父工程所需要的依…...

    2023/6/9 13:05:58
  19. 一款功能强大的报表引擎-VeryReport报表引擎

    在企业管理中&#xff0c;数据分析和决策制定是非常重要的环节。而报表则是这个过程中最常用的工具之一。但是&#xff0c;传统的报表设计与展现方式已经无法满足企业对于数据分析和报表展示的需求。为了解决这些问题&#xff0c;我们向大家推荐一款新一代Web报表软件——VeryR…...

    2023/6/9 13:05:45
  20. 【vue2方案选型记录】docx、pdf文档编辑等

    实现docx文档编辑功能方案记录 1.pageOffice&#xff0c;国产商用收费&#xff0c;功能和onlyOffice差不多&#xff0c;需要后端配置。 使用参考。 2.onlyOffice&#xff1b;开源免费&#xff0c;需要后端配置。使用参考。 3.Apryse的docx文档在线编辑&#xff0c;商用收费&am…...

    2023/6/9 13:05:28