Caret is a Markdown editor for Mac, Windows and Linux. It stands out with its clean interface, productivity features and obsessive attention to detail. Download for Linux. For continued use Purchase License at $29. The caret Package. 2 Visualizations. The featurePlot function is a wrapper for different lattice plots to visualize the data. For example, the following figures show the default plot for continuous outcomes generated using the featurePlot function. For classification data sets, the iris data are used for illustration. Str (iris) ## 'data.frame': 150 obs. Of 5 variables: ## $ Sepal.Length. Toys by Age Baby 2-4 Years 5-7 Years 8-11 Years 12-15 Years Teen Toys & Play Outdoor Play Swing Sets Bikes, Ride-Ons & Scooters Pretend Play STEM Toys Remote Control Toys Games & Puzzles Arts & Crafts. Caret 2.1 Markdown Editor. Caret is a Markdown editor. Code highlighting; Auto-completion; Context commands; Extendable selection; Preview; File navigation; Recent files; Customizable look; Keyboard navigation; Version 2.1. From 11 May 2017. Add inline image rendering; Improve appearance of headings; Improve code.
the scope of this blog post is to show how to do binary text classification using standard tools such as
tidytext
and caret
packages. One of if not the most common binary text classification task is the spam detection (spam vs non-spam) that happens in most email services but has many other application such as language identification (English vs non-English).In this post I’ll showcase 5 different classification methods to see how they compare with this data. The methods all land on the less complex side of the spectrum and thus does not include creating complex deep neural networks.
An expansion of this subject is multiclass text classification which I might write about in the future.
Packages
We load the packages we need for this project.
tidyverse
for general data science work, tidytext
for text manipulation and caret
for modeling.Caret 2 0 11 0
Data
The data we will be using for this demonstration will be some English1social media disaster tweets discussed in this article.It consist of a number of tweets regarding accidents mixed in with a selection control tweets (not about accidents). We start by loading in the data.
And for this exercise we will only look at the body of the text. Furthermore a handful of the tweets weren’t classified, marked
'Can't Decide'
so we are removing those as well. Since we are working with tweet data we have the constraint that most of tweets don’t actually have that much information in them as they are limited in characters and some only contain a couple of words.We will at this stage remove what appears to be urls using some regex and
str_replace_all
, and we will select the columns id
, disaster
and text
.First we take a quick look at the distribution of classes and we see if the classes are balanced
And we see that is fairly balanced so we don’t have to worry about sampling this time.
The representation we will be using in this post will be the bag-of-words representation in which we just count how many times each word appears in each tweet disregarding grammar and even word order (mostly).
We will construct a tf-idf vector model in which each unique word is represented as a column and each document (tweet in our case) is a row of the tf-idf values. This will create a very large matrix/data.frame (a column of each unique word in the total data set) which will overload a lot of the different models we can implement, furthermore will a lot of the words (or features in ML slang) not add considerably information. We have a trade off between information and computational speed.
First we will remove all the stop words, this will insure that common words that usually don’t carry meaning doesn’t take up space (and time) in our model. Next will we only look at words that appear in 10 different tweets. Lastly we will be looking at both unigrams and bigrams to hopefully get a better information extraction.
We will only look at words at appear in at least 10 different tweets.
we will right-join this to our data.frame before we will calculate the tf_idf and cast it to a document term matrix.
This leaves us with 2993 features. We create this meta data.frame which acts as a intermediate from our first data set since some tweets might have disappeared completely after the reduction.
We also create the index (based on the
meta
data.frame) to separate the data into a training and test set. Exifeditor 1 1 14.since a lot of the methods take data.frames as inputs we will take the time and create these here:
Now each row in the data.frame is a document/tweet (yay tidy principles!!).
Missing tweets
In the feature selection earlier we decided to turn our focus towards certain words and word-pairs, with that we also turned our focus AWAY from certain words. Since the tweets are fairly short in length it wouldn’t be surprising if a handful of the tweets completely skipped out focus as we noted earlier. Lets take a look at those tweets here.
We see that a lot of them appears to be part of urls that our regex didn’t detect, furthermore it appears that in those tweet the sole text was the url which wouldn’t have helped us in this case anyways.
Modeling
Now that we have the data all clean and tidy we will turn our heads towards modeling. We will be using the wonderful
caret
package which we will use to employ the following modelsThese where chosen because of their frequent use ( why SVM are good at text classification ) or because they are common in the classification field. They were also chosen because they where able to work with data with this number of variables in a reasonable time.
First time around we will not use a resampling method.
SVM
The first model will be the
svmLinearWeights2
model from the LiblineaR package. Where we specify default parameters.We predict on the test data set based on the fitted model. Bettertouchtool 1 52 download free.
lastly we calculate the confusion matrix using the
confusionMatrix
function in the caret
package.and we get an accuracy of 0.7461646.
Naive-Bayes
Carat 2018
The second model will be the
naive_bayes
model from the naivebayes package. Where we specify default parameters.We predict on the test data set based on the fitted model.
calculate the confusion matrix
and we get an accuracy of 0.5564854. Allmymusic 3 0 1 57.
LogitBoost
The third model will be the
LogitBoost
model from the caTools package. We don’t have to specify any parameters.We predict on the test data set based on the fitted model.
calculate the confusion matrix
and we get an accuracy of 0.632729.
Random forest
The fourth model will be the
ranger
model from the caTools package. Where we specify default parameters.We predict on the test data set based on the fitted model.
calculate the confusion matrix
and we get an accuracy of 0.7777778.
nnet
The fifth and final model will be the
nnet
model from the caTools package. Where we specify default parameters. We will also specify MaxNWts = 5000
such that it will work. It will need to be more then the number of columns multiplied the size.We predict on the test data set based on the fitted model.
calculate the confusion matrix
and we get an accuracy of 0.7173408.
Comparing models
To see how the different models stack out we combine the metrics together in a
data.frame
.visualizing the accuracy for the different models with the red line being the “No Information Rate” that is, having a model that just picks the model common class.
As you can see all but one approach does better then the “No Information Rate” on its first try before tuning the hyperparameters.
Tuning hyperparameters
After trying out the different models we saw quite a spread in performance. But it important to remember that the results might be because of good/bad default hyperparameters. There are a few different ways to handle this problem. I’ll show on of them here, grid search, on the SVM model so you get the idea.
We will be using 10-fold cross-validation and 3 repeats, which will slow down the procedure, but will try to limit and reduce overfitting. We will be using grid search approach to find optimal hyperparameters. For the sake of time have to fixed 2 of the hyperparameters and only let one vary. Remember that the time it takes to search though all combinations take a long time when then number of hyperparameters increase.
We have decided to limit the search around the
weight
parameter’s default value 1.and once it have finished running we can plot the train object to see which value is highest.
And we see that it appear to be just around 1. It is important to search multiple parameters at the SAME TIME as it can not be assumed that the parameters are independent of each others. Only reason I didn’t do that here was to same the time.
I will leave to you the reader to find out which of the models have the highest accuracy after doing parameter tuning.
I hope you have enjoyed this overview of binary text classification.
Format › Link
or ⌘k to detect image URLs3.4
07 Aug 2017
- add find in files ⌘⇧f
- improve UI and UX of headings popup
- improve files sidebar to show indications for unsaved files
- improve files sidebar to support space to preview (MacOS)
- improve all sidebars to support home and end
- improve parsing of block-level markup
- improve overall UI
3.3
- add support for image pasting
- add git highlighting in files sidebar
- add command File › Revert
- add on / off indications in command palette
- add images to open in quick preview - MacOS
- add context command that converts TOC to links
- add support for
$
inline math and$$
block-level math - improve path auto-completion
- improve design of headings sidebar
- improve design of popups
3.2
11 Jul 2017
- add type to search in sidebars
- add multiple selections in file sidebar
- add auto-jump to current heading in headings sidebar
- add drag and drop folder to open project
- add syntax highlighting for HTML blocks
- improve syntax highlighting engine
- improve command File › Open to default to current folder
- improve overall UI
3.1
- add sidebar for headings ⌘2
- add recent folders ⌘⇧e
- improve popups to be scrollable
- improve overall UI
Caret 2 0 115
3.0
27 Jun 2017
- add sidebar ⌘1
- add font preferences
- add support for liquid markup
- add File › Publish
- add Format › TOC and Format › Page Break
- improve handling of unsaved files
- improve UI / UX for navigation tools
- improve overall UI
2.1
- add inline image rendering
- improve appearance of headings
- improve code highlighting
- improve UI / UX for find in text
- improve scrolling performance
- improve selection behavior on double-click / triple-click
2.0
24 Mar 2017
Caret 2 0 11 Inches
![Caret 2 0 11 percent Caret 2 0 11 percent](https://i.stack.imgur.com/yZmUF.png)
- add support for proportional fonts
- add file commands: rename, delete, copy, paste
- add inline rendering for math expressions
- add full screen preview
- add command to add selection above / below
- add command to split selection into lines
- add command to duplicate line
- add command to delete line
- add command to open file in Finder / Explorer
- add support for column select - hold alt and drag
- add auto-completion for code in html / js / css blocks
- add inline matching for brackets in code blocks
- add inline matching for words in code blocks
- add basic support for .html, .css, .js files
- add command palette ctrl⇧P
- add preferences for keyboard shortcuts
1.15
- add command File › Export to HTML
- add command Go › To Last Edit
- add all basic element types to Format menu
- add support for page breaks -
+++
- improve algorithm for spellcheck auto-correct
- improve appearance of exported PDFs
- improve overall UI
1.14
07 Feb 2017
- add support for mermaid diagrams
- improve typing speed when writing with multiple cursors
- improve behavior of scrolling animation
- improve design of git indicators
- improve editor theme
1.13
- add preferences - ctrl,
- add preview theme - Dark
- add select each - ⌘⌃g on Mac and altf3 on PC
- add transpose in Mac version - ⌃t
- add git support
- improve startup time
- improve overall UI
1.12
22 Dec 2016
- add command select more alt↑ - extends selection
- add syntax assistance for strikethrough - select and hit
~
- improve select all doesn't move caret - hit esc to restore
- improve syntax assistance for inline markers
- improve overall UI
- improve spellchecker to more accurately detect language
- improve editor theme to use less color
1.11
- add auto-correct spelling option - Edit › Autocorrect Spelling
- add use URL from clipboard on Format › Link or ctrlk
- improve overall UI
1.10
02 Dec 2016
- add paste URL over selected text creates link
- add support for YAML front matter
- add support for more font sizes
- add highlighting for code in GitHub theme
- add highlighting for 'double space' line breaks
- improve typewriter mode
- improve structure of app menus
- improve font rendering on Linux
- improve overall UI
1.9
Carat 2010
- add multiple cursors - ctrlD
- add auto-completion for local paths in links and images
- add drag and drop support for images and documents
- add insert line before / after - ctrl↵ / ctrl⇧↵
- add syntax assistance for math fences
- add context actions for emoji
- improve overall UI
1.8
25 Oct 2016
- add completion for emoji
- add completion for languages in fenced code
- improve UI of navigation views
- improve spellchecker to ignore medial capitals
1.7
- add autosave - Preferences › Autosave
- add convert to comment - ctrl/ or Format › Comment
- improve app icon on Windows
- improve dark theme to make it more legible
1.6
26 Sep 2016
- add auto-completion for table cells and table rows
- add more context commands for links and lists
- improve search algorithm in navigation
- improve UI of navigation views
- improve⇧Enter to not trigger auto-completion
- improve app icons for Mac and Windows
- improve window buttons for Mac and Windows
1.5
- add custom themes for preview
- add dynamic indentation for task items
- add support for table of contents -
[[toc]]
- improve design of Find in Text - ctrlF
- improve app performance when preview is ON
1.4
11 Jul 2016
- add dynamic indentation for list items
- add syntax highlighting for code and math expressions
- improve theme of preview
- improve security of update requests - HTTPS
1.3
- improve editor engine to make it faster and more reliable
- improve theme to add emphasis to emphasis elements
- improve theme to remove emphasis from inline markers
- improve theme to highlight escape sequences
- improve design of find to make matches easier to spot
- improve design of file navigator
- improve design of preview
1.2
30 Mar 2016
- add support for multiple windows - ctrl⇧N
- improve design of navigation screens
1.1
- add live preview with scroll position synchronization - ctrlP
- add heading navigation - ctrlG
- add support for LaTeX math expressions -
$$e=mc^2$$
- add support for footnotes -
[^label]: footnote
- add support for interactive task lists -
- [x] task
- add auto-completion for HTML tags -
<u>underlined</u>
- add typewriter scrolling mode
- add special icons for special folders
- add spell check support for non-English languages
- add stats for char count and reading time - click to toggle
- improve design of file navigation screens
- improve text cursor appearance across the app
- improve auto-completion for ordered list items
- improve mouse cursor appearance on retina displays (OSX)
1.0
30 Oct 2015