Jekyll2017-01-23T23:24:37+00:00https://ka1ch2n.github.io//Kaichen’s BlogAlways like this.Notes on SVD2017-01-20T00:00:00+00:002017-01-20T00:00:00+00:00https://ka1ch2n.github.io/articles/Notes-on-SVD<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"></script>
<p>Singular Value Decompositions (<strong>SVD</strong>) have become very popular in the field of Collaborative Filtering. The winning entry for the famed Netflix Prize had a number of SVD models including SVD++ blended with Restricted Boltzmann Machines.</p>
<p>However, I’ve been confused by the implementation of SVD for a long time and decided to gain a deeper understanding of it. Here are my notes on SVD.</p>
<p>I think one of the most impressive thing in linear algebra is the connection between purely matrix and linear transformation in a complex space.</p>
<h3 id="the-singular-value-decomposition">The singular value decomposition</h3>
<p>Let’s say if we have a 2 * 2 matrix.</p>
<p>The geometric essence of the singular value decomposition for 2 * 2 matrices: for any 2 * 2 matrix, we may find an orthogonal grid that is transformed into another orthogonal grid.</p>
<p>We will express this fact using vectors: with an appropriate choice of orthogonal unit vectors v1 and v2, the vectors Mv1 and Mv2 are orthogonal.</p>
<p>We will use u1 and u2 to denote unit vectors in the direction of Mv1 and Mv2. The lengths of Mv1 and Mv2 - - denoted by σ1 and σ2 - - describe the amount that the grid is stretched in those particular directions. These numbers are called the singular values of M.</p>
<p>Therefore,</p>
<p>\(M v_1 = \sigma_1 u_1\)</p>
<p>\(M v_2 = \sigma_2 u_2\)</p>
<p>We may now give a simple description for how the matrix M treats a general vector x. Since the vectors v1 and v2 are orthogonal unit vectors, so</p>
<p>\(Mx = (v_1x)Mv_1 + (v_2x)Mv_2 \)</p>
<p>\(Mx = (v_1x)\sigma_1 u_1 + (v_2x) \sigma_2 u_2 \)</p>
<p>which leads to</p>
<p>\(Mx = u_1 \sigma_1 v_1^Tx + u_2 \sigma_2 v_2^Tx \)</p>
<p>\(M = u_1 \sigma_1 v_1^T + u_2 \sigma_2 v_2^T \)</p>
<p>usually expressed by</p>
<p>\(M = U \Sigma V^T\)</p>
<h3 id="applications">Applications</h3>
<p><strong>Data compression</strong></p>
<p>Singular value decompositions can be used to represent data efficiently. Suppose, for instance, that we wish to transmit the following image, which consists of an array of 15 * 25 black or white pixels.</p>
<center><img src="/images/svd/1.gif" alt="image" /></center>
<p>We will represent the image as a 15 * 25 matrix in which each entry is either a 0, representing a black pixel, or 1, representing white. As such, there are 375 entries in the matrix.</p>
<center><img src="/images/svd/2.gif" alt="image" /></center>
<p>if we perform a singular value decomposition on M, we find there are only three non-zero singular values.</p>
<p>σ1 = 14.72</p>
<p>σ2 = 5.22</p>
<p>σ3 = 3.31</p>
<p>Therefore, the matrix may be represented as</p>
<p>\(M = u_1 \sigma_1 v_1^T + u_2 \sigma_2 v_2^T + u_3 \sigma_3 v_3^t \)</p>
<p>This means that we have three vectors vi, each of which has 15 entries, three vectors ui, each of which has 25 entries, and three singular values σi. This implies that we may represent the matrix using only 123 numbers rather than the 375 that appear in the matrix. In this way, the singular value decomposition discovers the redundancy in the matrix and provides a format for eliminating it.</p>
<p><strong>Noise reduction</strong></p>
<p>Typically speaking, the large singular values point to where the interesting information is. For example, imagine we have used a scanner to enter this image into our computer. However, our scanner introduces some imperfections in the image.</p>
<center><img src="/images/svd/3.gif" alt="image" /></center>
<p>We may proceed in the same way: represent the data using a 15 * 25 matrix and perform a singular value decomposition. We find the following singular values:</p>
<p>σ1 = 14.15</p>
<p>σ2 = 4.67</p>
<p>σ3 = 3.00</p>
<p>σ4 = 0.21</p>
<p>σ5 = 0.19</p>
<p>…</p>
<p>σ15 = 0.05</p>
<p>Clearly, the first three singular values are the most important so we will assume that the others are due to the noise in the image and make the approximation</p>
<p>\(M \approx u_1 \sigma_1 v_1^T + u_2 \sigma_2 v_2^T + u_3 \sigma_3 v_3^t \)</p>
<p>This leads to an improved image.</p>
<p><strong>References:</strong></p>
<p><strong>Dan Kalman</strong>, <em>A Singularly Valuable Decomposition: The SVD of a Matrix, The College Mathematics Journal 27 (1996), 2-23.</em></p>Looking into the geometryThe first step towards deep learning2016-10-17T00:00:00+00:002016-10-17T00:00:00+00:00https://ka1ch2n.github.io/articles/The-first-step-towards-deep-learning<p>Recently, as planning to do some project related to deep learning (specifically image classification, detection etc.) in the near future, I spent some time dig into this field.</p>
<p>Based on the previous understanding of linear algebra, calculus, statistics and a few knowledge about machine learning, I had a tiny bit of conceptual understanding of deep learning, though I was completely unable to transfer any of my knowledge into code.</p>
<p>This is what I wanted to change.</p>
<h3 id="why-keras">Why Keras?</h3>
<p>The framework I started with is <a href="https://keras.io/">Keras</a>. Keras is an <strong>actual</strong> deep learning framework: a well-designed API that allows you use to build deep learning models by clipping together high-level building blocks. And since Keras runs on top of <a href="https://www.tensorflow.org/">TensorFlow</a> or <a href="https://github.com/Theano">Theano</a> , there is no performance cost to using Keras compared to using the one of these lower-level frameworks.</p>
<p>If you are familiar with <a href="www.numpy.org/">Numpy</a> and <a href="scikit-learn.org/">Scikit-Learn</a>, then a fair comparison would be to say that Theano and TensorFlow are closer to Numpy, while Keras is closer to Scikit-Learn.</p>
<p>However the comparison isn’t perfect, since Keras is <strong><em>more flexible</em></strong> than Scikit-Learn: it allows you to define your own machine learning models, rather than just use pre-defined models.
As for <a href="torch.ch/">Torch</a>, Keras has a significantly larger community than Torch, we can benefit from the extensive <strong>Python ecosystem</strong> in our workflow.</p>
<h3 id="build-a-neural-network-with-keras">Build a neural network with Keras</h3>
<p>Load dataset.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span>
<span class="kn">from</span> <span class="nn">keras.datasets</span> <span class="kn">import</span> <span class="n">mnist</span>
<span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Sequential</span>
<span class="kn">from</span> <span class="nn">keras.layers.core</span> <span class="kn">import</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">Dropout</span><span class="p">,</span> <span class="n">Activation</span>
<span class="kn">from</span> <span class="nn">keras.utils</span> <span class="kn">import</span> <span class="n">np_utils</span>
</code></pre>
</div>
<div class="highlighter-rouge"><pre class="highlight"><code>Using TensorFlow backend.
</code></pre>
</div>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">nb_classes</span> <span class="o">=</span> <span class="mi">10</span>
<span class="c"># the data, shuffled and split between tran and test sets </span>
<span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">),</span> <span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span> <span class="o">=</span> <span class="n">mnist</span><span class="o">.</span><span class="n">load_data</span><span class="p">()</span> <span class="c"># </span>
<span class="k">print</span><span class="p">(</span><span class="s">"X_train original shape"</span><span class="p">,</span> <span class="n">X_train</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"y_train original shape"</span><span class="p">,</span> <span class="n">y_train</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre>
</div>
<div class="highlighter-rouge"><pre class="highlight"><code>X_train original shape (60000, 28, 28)
y_train original shape (60000,)
</code></pre>
</div>
<p>Take a look at some examples of the training data.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">9</span><span class="p">):</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">cmap</span><span class="o">=</span><span class="s">'gray'</span><span class="p">,</span> <span class="n">interpolation</span><span class="o">=</span><span class="s">'none'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">"Class {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">y_train</span><span class="p">[</span><span class="n">i</span><span class="p">]))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre>
</div>
<p><img src="/images/Keras/output_2_0.png" alt="image" /></p>
<p>Format the data for training.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">X_train</span> <span class="o">=</span> <span class="n">X_train</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">60000</span><span class="p">,</span> <span class="mi">784</span><span class="p">)</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">X_test</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">10000</span><span class="p">,</span> <span class="mi">784</span><span class="p">)</span>
<span class="n">X_train</span> <span class="o">=</span> <span class="n">X_train</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'float32'</span><span class="p">)</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">X_test</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'float32'</span><span class="p">)</span>
<span class="n">X_train</span> <span class="o">/=</span> <span class="mi">255</span>
<span class="n">X_test</span> <span class="o">/=</span> <span class="mi">255</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Training matrix shape"</span><span class="p">,</span> <span class="n">X_train</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Testing matrix shape"</span><span class="p">,</span> <span class="n">X_test</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre>
</div>
<div class="highlighter-rouge"><pre class="highlight"><code>Training matrix shape (60000, 784)
Testing matrix shape (10000, 784)
</code></pre>
</div>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">Y_train</span> <span class="o">=</span> <span class="n">np_utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">(</span><span class="n">y_train</span><span class="p">,</span> <span class="n">nb_classes</span><span class="p">)</span>
<span class="n">Y_test</span> <span class="o">=</span> <span class="n">np_utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">nb_classes</span><span class="p">)</span>
</code></pre>
</div>
<p>Build the neural-network. Here we’ll do a simple 3 layer fully connected network.</p>
<p><img src="/images/Keras/network.png" alt="image" /></p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">784</span><span class="p">,)))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.2</span><span class="p">))</span> <span class="c"># Dropout helps protect the model from memorizing or "overfitting" the training data</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">512</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.2</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">10</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Activation</span><span class="p">(</span><span class="s">'softmax'</span><span class="p">))</span>
</code></pre>
</div>
<p>When compiing a model, Keras asks you to specify your loss function and your optimizer. The loss function we’ll use here is called categorical crossentropy, and is a loss function well-suited to comparing two probability distributions.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">model</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">,</span><span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">"accuracy"</span><span class="p">],</span> <span class="n">optimizer</span><span class="o">=</span><span class="s">'adam'</span><span class="p">)</span>
</code></pre>
</div>
<p>Train the model.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">Y_train</span><span class="p">,</span>
<span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span><span class="p">,</span> <span class="n">nb_epoch</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
<span class="n">verbose</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">validation_data</span><span class="o">=</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">Y_test</span><span class="p">))</span>
</code></pre>
</div>
<div class="highlighter-rouge"><pre class="highlight"><code>Train on 60000 samples, validate on 10000 samples
Epoch 1/4
60000/60000 [==============================] - 15s - loss: 0.2492 - acc: 0.9254 - val_loss: 0.1199 - val_acc: 0.9615
Epoch 2/4
60000/60000 [==============================] - 15s - loss: 0.0991 - acc: 0.9691 - val_loss: 0.0807 - val_acc: 0.9736
Epoch 3/4
60000/60000 [==============================] - 15s - loss: 0.0711 - acc: 0.9773 - val_loss: 0.0922 - val_acc: 0.9712
Epoch 4/4
60000/60000 [==============================] - 15s - loss: 0.0541 - acc: 0.9826 - val_loss: 0.0670 - val_acc: 0.9787
<keras.callbacks.History at 0x119233b70>
</code></pre>
</div>
<p>Finally, evaluate its performance</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">score</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">Y_test</span><span class="p">,</span><span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Test score:'</span><span class="p">,</span> <span class="n">score</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Test accuracy:'</span><span class="p">,</span> <span class="n">score</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
</code></pre>
</div>
<div class="highlighter-rouge"><pre class="highlight"><code>Test score: 0.0669518932859
Test accuracy: 0.9787
</code></pre>
</div>
<p>Inspecting the output</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code>
<span class="n">predicted_classes</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict_classes</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="c"># Check which items we got right / wrong</span>
<span class="n">correct_indices</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">nonzero</span><span class="p">(</span><span class="n">predicted_classes</span> <span class="o">==</span> <span class="n">y_test</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">incorrect_indices</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">nonzero</span><span class="p">(</span><span class="n">predicted_classes</span> <span class="o">!=</span> <span class="n">y_test</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
</code></pre>
</div>
<div class="highlighter-rouge"><pre class="highlight"><code> 9952/10000 [============================>.] - ETA: 0s
</code></pre>
</div>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">correct</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">correct_indices</span><span class="p">[:</span><span class="mi">9</span><span class="p">]):</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="n">correct</span><span class="p">]</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">28</span><span class="p">,</span><span class="mi">28</span><span class="p">),</span> <span class="n">cmap</span><span class="o">=</span><span class="s">'gray'</span><span class="p">,</span> <span class="n">interpolation</span><span class="o">=</span><span class="s">'none'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">"Predicted {}, Class {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">predicted_classes</span><span class="p">[</span><span class="n">correct</span><span class="p">],</span> <span class="n">y_test</span><span class="p">[</span><span class="n">correct</span><span class="p">]))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">incorrect</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">incorrect_indices</span><span class="p">[:</span><span class="mi">9</span><span class="p">]):</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="n">incorrect</span><span class="p">]</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">28</span><span class="p">,</span><span class="mi">28</span><span class="p">),</span> <span class="n">cmap</span><span class="o">=</span><span class="s">'gray'</span><span class="p">,</span> <span class="n">interpolation</span><span class="o">=</span><span class="s">'none'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">"Predicted {}, Class {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">predicted_classes</span><span class="p">[</span><span class="n">incorrect</span><span class="p">],</span> <span class="n">y_test</span><span class="p">[</span><span class="n">incorrect</span><span class="p">]))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre>
</div>
<p><img src="/images/Keras/output_10_0.png" alt="image" /></p>
<p><img src="/images/Keras/output_10_1.png" alt="image" /></p>
<h4 id="references">References:</h4>
<p><a href="cs231n.github.io">CS231n</a></p>
<p><a href="https://keras.io/">Keras Documentation</a></p>Train a neural network on the MNIST.Replace data manipulation in Excel with Python2016-09-23T00:00:00+00:002016-09-23T00:00:00+00:00https://ka1ch2n.github.io/articles/Replace-data-manipulation%20in-Excel-with-Python<p>The <strong>pivot table</strong> is a powerful tool to summarize and present data in <strong>Excel</strong>. However, in <strong>Python</strong>, <a href="http://pandas.pydata.org/">Pandas</a> has a function which allows you to quickly convert a DataFrame to a pivot table.</p>
<p>This function is <strong>very useful</strong> but sometimes it can be <strong>tricky</strong> to remember how to use it to get the data formatted in a way you need.</p>
<h1 id="read-in-the-data">Read in the data</h1>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
</code></pre>
</div>
<p>Read in the sales funnel data into <code class="highlighter-rouge">DataFrame</code>.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s">"../sales-funnel.xlsx"</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Account</th>
<th>Name</th>
<th>Rep</th>
<th>Manager</th>
<th>Product</th>
<th>Quantity</th>
<th>Price</th>
<th>Status</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>714466</td>
<td>Trantow-Barrows</td>
<td>Craig Booker</td>
<td>Debra Henley</td>
<td>CPU</td>
<td>1</td>
<td>30000</td>
<td>presented</td>
</tr>
<tr>
<th>1</th>
<td>714466</td>
<td>Trantow-Barrows</td>
<td>Craig Booker</td>
<td>Debra Henley</td>
<td>Software</td>
<td>1</td>
<td>10000</td>
<td>presented</td>
</tr>
<tr>
<th>2</th>
<td>714466</td>
<td>Trantow-Barrows</td>
<td>Craig Booker</td>
<td>Debra Henley</td>
<td>Maintenance</td>
<td>2</td>
<td>5000</td>
<td>pending</td>
</tr>
<tr>
<th>3</th>
<td>737550</td>
<td>Fritsch, Russel and Anderson</td>
<td>Craig Booker</td>
<td>Debra Henley</td>
<td>CPU</td>
<td>1</td>
<td>35000</td>
<td>declined</td>
</tr>
<tr>
<th>4</th>
<td>146832</td>
<td>Kiehn-Spinka</td>
<td>Daniel Hilton</td>
<td>Debra Henley</td>
<td>CPU</td>
<td>2</td>
<td>65000</td>
<td>won</td>
</tr>
</tbody>
</table>
</div>
<p>For convenience sake, I define the status column as a <code class="highlighter-rouge">category</code> and set the order we’d like to view. This isn’t strictly required but helps us keep the order we want as we work through analyzing the data.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">df</span><span class="p">[</span><span class="s">"Status"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">"Status"</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s">"category"</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">"Status"</span><span class="p">]</span><span class="o">.</span><span class="n">cat</span><span class="o">.</span><span class="n">set_categories</span><span class="p">([</span><span class="s">"won"</span><span class="p">,</span><span class="s">"pending"</span><span class="p">,</span><span class="s">"presented"</span><span class="p">,</span><span class="s">"declined"</span><span class="p">],</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre>
</div>
<h1 id="pivot-the-data">Pivot the data</h1>
<p>As we build up the pivot table, I think it’s easiest to take it <strong>one step at a time</strong>. Add items one at a time and check each step to verify you are getting the results you expect.</p>
<p>The simplest pivot table must have a dataframe and an index.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Name"</span><span class="p">])</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Account</th>
<th>Price</th>
<th>Quantity</th>
</tr>
<tr>
<th>Name</th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>Barton LLC</th>
<td>740150.0</td>
<td>35000.0</td>
<td>1.000000</td>
</tr>
<tr>
<th>Fritsch, Russel and Anderson</th>
<td>737550.0</td>
<td>35000.0</td>
<td>1.000000</td>
</tr>
<tr>
<th>Herman LLC</th>
<td>141962.0</td>
<td>65000.0</td>
<td>2.000000</td>
</tr>
<tr>
<th>Jerde-Hilpert</th>
<td>412290.0</td>
<td>5000.0</td>
<td>2.000000</td>
</tr>
<tr>
<th>Kassulke, Ondricka and Metz</th>
<td>307599.0</td>
<td>7000.0</td>
<td>3.000000</td>
</tr>
<tr>
<th>Keeling LLC</th>
<td>688981.0</td>
<td>100000.0</td>
<td>5.000000</td>
</tr>
<tr>
<th>Kiehn-Spinka</th>
<td>146832.0</td>
<td>65000.0</td>
<td>2.000000</td>
</tr>
<tr>
<th>Koepp Ltd</th>
<td>729833.0</td>
<td>35000.0</td>
<td>2.000000</td>
</tr>
<tr>
<th>Kulas Inc</th>
<td>218895.0</td>
<td>25000.0</td>
<td>1.500000</td>
</tr>
<tr>
<th>Purdy-Kunde</th>
<td>163416.0</td>
<td>30000.0</td>
<td>1.000000</td>
</tr>
<tr>
<th>Stokes LLC</th>
<td>239344.0</td>
<td>7500.0</td>
<td>1.000000</td>
</tr>
<tr>
<th>Trantow-Barrows</th>
<td>714466.0</td>
<td>15000.0</td>
<td>1.333333</td>
</tr>
</tbody>
</table>
</div>
<p>You can have multiple indexes as well. In fact, most of the pivot_table args can take multiple values via a list.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Name"</span><span class="p">,</span><span class="s">"Rep"</span><span class="p">,</span><span class="s">"Manager"</span><span class="p">])</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th></th>
<th>Account</th>
<th>Price</th>
<th>Quantity</th>
</tr>
<tr>
<th>Name</th>
<th>Rep</th>
<th>Manager</th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>Barton LLC</th>
<th>John Smith</th>
<th>Debra Henley</th>
<td>740150.0</td>
<td>35000.0</td>
<td>1.000000</td>
</tr>
<tr>
<th>Fritsch, Russel and Anderson</th>
<th>Craig Booker</th>
<th>Debra Henley</th>
<td>737550.0</td>
<td>35000.0</td>
<td>1.000000</td>
</tr>
<tr>
<th>Herman LLC</th>
<th>Cedric Moss</th>
<th>Fred Anderson</th>
<td>141962.0</td>
<td>65000.0</td>
<td>2.000000</td>
</tr>
<tr>
<th>Jerde-Hilpert</th>
<th>John Smith</th>
<th>Debra Henley</th>
<td>412290.0</td>
<td>5000.0</td>
<td>2.000000</td>
</tr>
<tr>
<th>Kassulke, Ondricka and Metz</th>
<th>Wendy Yule</th>
<th>Fred Anderson</th>
<td>307599.0</td>
<td>7000.0</td>
<td>3.000000</td>
</tr>
<tr>
<th>Keeling LLC</th>
<th>Wendy Yule</th>
<th>Fred Anderson</th>
<td>688981.0</td>
<td>100000.0</td>
<td>5.000000</td>
</tr>
<tr>
<th>Kiehn-Spinka</th>
<th>Daniel Hilton</th>
<th>Debra Henley</th>
<td>146832.0</td>
<td>65000.0</td>
<td>2.000000</td>
</tr>
<tr>
<th>Koepp Ltd</th>
<th>Wendy Yule</th>
<th>Fred Anderson</th>
<td>729833.0</td>
<td>35000.0</td>
<td>2.000000</td>
</tr>
<tr>
<th>Kulas Inc</th>
<th>Daniel Hilton</th>
<th>Debra Henley</th>
<td>218895.0</td>
<td>25000.0</td>
<td>1.500000</td>
</tr>
<tr>
<th>Purdy-Kunde</th>
<th>Cedric Moss</th>
<th>Fred Anderson</th>
<td>163416.0</td>
<td>30000.0</td>
<td>1.000000</td>
</tr>
<tr>
<th>Stokes LLC</th>
<th>Cedric Moss</th>
<th>Fred Anderson</th>
<td>239344.0</td>
<td>7500.0</td>
<td>1.000000</td>
</tr>
<tr>
<th>Trantow-Barrows</th>
<th>Craig Booker</th>
<th>Debra Henley</th>
<td>714466.0</td>
<td>15000.0</td>
<td>1.333333</td>
</tr>
</tbody>
</table>
</div>
<p>This is interesting but not particularly useful. What we probably want to do is look at this by Manager and Director.
It’s easy enough to do by changing the index.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Manager"</span><span class="p">,</span><span class="s">"Rep"</span><span class="p">])</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>Account</th>
<th>Price</th>
<th>Quantity</th>
</tr>
<tr>
<th>Manager</th>
<th>Rep</th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="3" valign="top">Debra Henley</th>
<th>Craig Booker</th>
<td>720237.0</td>
<td>20000.000000</td>
<td>1.250000</td>
</tr>
<tr>
<th>Daniel Hilton</th>
<td>194874.0</td>
<td>38333.333333</td>
<td>1.666667</td>
</tr>
<tr>
<th>John Smith</th>
<td>576220.0</td>
<td>20000.000000</td>
<td>1.500000</td>
</tr>
<tr>
<th rowspan="2" valign="top">Fred Anderson</th>
<th>Cedric Moss</th>
<td>196016.5</td>
<td>27500.000000</td>
<td>1.250000</td>
</tr>
<tr>
<th>Wendy Yule</th>
<td>614061.5</td>
<td>44250.000000</td>
<td>3.000000</td>
</tr>
</tbody>
</table>
</div>
<p>Now we start to get a glimpse of what a pivot table can do for us.</p>
<p>For this purpose, the Account and <code class="highlighter-rouge">Quantity</code> columns aren’t really useful. Let’s remove it by explicitly defining the columns we care about using the values field.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Manager"</span><span class="p">,</span><span class="s">"Rep"</span><span class="p">],</span><span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="s">"Price"</span><span class="p">])</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>Price</th>
</tr>
<tr>
<th>Manager</th>
<th>Rep</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="3" valign="top">Debra Henley</th>
<th>Craig Booker</th>
<td>20000</td>
</tr>
<tr>
<th>Daniel Hilton</th>
<td>38333</td>
</tr>
<tr>
<th>John Smith</th>
<td>20000</td>
</tr>
<tr>
<th rowspan="2" valign="top">Fred Anderson</th>
<th>Cedric Moss</th>
<td>27500</td>
</tr>
<tr>
<th>Wendy Yule</th>
<td>44250</td>
</tr>
</tbody>
</table>
</div>
<p>The <code class="highlighter-rouge">Price</code> column automatically averages the data but we can do a count or a sum. Adding them is simple using <code class="highlighter-rouge">aggfunc</code>.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Manager"</span><span class="p">,</span><span class="s">"Rep"</span><span class="p">],</span><span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="s">"Price"</span><span class="p">],</span><span class="n">aggfunc</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">)</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>Price</th>
</tr>
<tr>
<th>Manager</th>
<th>Rep</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="3" valign="top">Debra Henley</th>
<th>Craig Booker</th>
<td>80000</td>
</tr>
<tr>
<th>Daniel Hilton</th>
<td>115000</td>
</tr>
<tr>
<th>John Smith</th>
<td>40000</td>
</tr>
<tr>
<th rowspan="2" valign="top">Fred Anderson</th>
<th>Cedric Moss</th>
<td>110000</td>
</tr>
<tr>
<th>Wendy Yule</th>
<td>177000</td>
</tr>
</tbody>
</table>
</div>
<p><code class="highlighter-rouge">aggfunc</code> can take a list of functions. Let’s try a <code class="highlighter-rouge">mean</code> using the numpy functions and <code class="highlighter-rouge">len</code> to get a count.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Manager"</span><span class="p">,</span><span class="s">"Rep"</span><span class="p">],</span><span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="s">"Price"</span><span class="p">],</span><span class="n">aggfunc</span><span class="o">=</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">,</span><span class="nb">len</span><span class="p">])</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th></th>
<th>mean</th>
<th>len</th>
</tr>
<tr>
<th></th>
<th></th>
<th>Price</th>
<th>Price</th>
</tr>
<tr>
<th>Manager</th>
<th>Rep</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="3" valign="top">Debra Henley</th>
<th>Craig Booker</th>
<td>20000</td>
<td>4</td>
</tr>
<tr>
<th>Daniel Hilton</th>
<td>38333</td>
<td>3</td>
</tr>
<tr>
<th>John Smith</th>
<td>20000</td>
<td>2</td>
</tr>
<tr>
<th rowspan="2" valign="top">Fred Anderson</th>
<th>Cedric Moss</th>
<td>27500</td>
<td>4</td>
</tr>
<tr>
<th>Wendy Yule</th>
<td>44250</td>
<td>4</td>
</tr>
</tbody>
</table>
</div>
<p>If we want to see sales broken down by the <code class="highlighter-rouge">Products</code>, the <code class="highlighter-rouge">columns variable</code> allows us to define one or more columns.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Manager"</span><span class="p">,</span><span class="s">"Rep"</span><span class="p">],</span><span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="s">"Price"</span><span class="p">],</span>
<span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"Product"</span><span class="p">],</span><span class="n">aggfunc</span><span class="o">=</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">])</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th></th>
<th colspan="4" halign="left">sum</th>
</tr>
<tr>
<th></th>
<th></th>
<th colspan="4" halign="left">Price</th>
</tr>
<tr>
<th></th>
<th>Product</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
</tr>
<tr>
<th>Manager</th>
<th>Rep</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="3" valign="top">Debra Henley</th>
<th>Craig Booker</th>
<td>65000.0</td>
<td>5000.0</td>
<td>NaN</td>
<td>10000.0</td>
</tr>
<tr>
<th>Daniel Hilton</th>
<td>105000.0</td>
<td>NaN</td>
<td>NaN</td>
<td>10000.0</td>
</tr>
<tr>
<th>John Smith</th>
<td>35000.0</td>
<td>5000.0</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th rowspan="2" valign="top">Fred Anderson</th>
<th>Cedric Moss</th>
<td>95000.0</td>
<td>5000.0</td>
<td>NaN</td>
<td>10000.0</td>
</tr>
<tr>
<th>Wendy Yule</th>
<td>165000.0</td>
<td>7000.0</td>
<td>5000.0</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
<p>The NaN’s are a bit distracting. If we want to remove them, we could use <code class="highlighter-rouge">fill_value</code> to set them to <code class="highlighter-rouge">0</code>.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Manager"</span><span class="p">,</span><span class="s">"Rep"</span><span class="p">],</span><span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="s">"Price"</span><span class="p">],</span>
<span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"Product"</span><span class="p">],</span><span class="n">aggfunc</span><span class="o">=</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">],</span><span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th></th>
<th colspan="4" halign="left">sum</th>
</tr>
<tr>
<th></th>
<th></th>
<th colspan="4" halign="left">Price</th>
</tr>
<tr>
<th></th>
<th>Product</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
</tr>
<tr>
<th>Manager</th>
<th>Rep</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="3" valign="top">Debra Henley</th>
<th>Craig Booker</th>
<td>65000</td>
<td>5000</td>
<td>0</td>
<td>10000</td>
</tr>
<tr>
<th>Daniel Hilton</th>
<td>105000</td>
<td>0</td>
<td>0</td>
<td>10000</td>
</tr>
<tr>
<th>John Smith</th>
<td>35000</td>
<td>5000</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th rowspan="2" valign="top">Fred Anderson</th>
<th>Cedric Moss</th>
<td>95000</td>
<td>5000</td>
<td>0</td>
<td>10000</td>
</tr>
<tr>
<th>Wendy Yule</th>
<td>165000</td>
<td>7000</td>
<td>5000</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
<p>It would be useful to add the quantity as well. Add <code class="highlighter-rouge">Quantity</code> to the values list.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Manager"</span><span class="p">,</span><span class="s">"Rep"</span><span class="p">],</span><span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="s">"Price"</span><span class="p">,</span><span class="s">"Quantity"</span><span class="p">],</span>
<span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"Product"</span><span class="p">],</span><span class="n">aggfunc</span><span class="o">=</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">],</span><span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th></th>
<th colspan="8" halign="left">sum</th>
</tr>
<tr>
<th></th>
<th></th>
<th colspan="4" halign="left">Price</th>
<th colspan="4" halign="left">Quantity</th>
</tr>
<tr>
<th></th>
<th>Product</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
</tr>
<tr>
<th>Manager</th>
<th>Rep</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="3" valign="top">Debra Henley</th>
<th>Craig Booker</th>
<td>65000</td>
<td>5000</td>
<td>0</td>
<td>10000</td>
<td>2</td>
<td>2</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<th>Daniel Hilton</th>
<td>105000</td>
<td>0</td>
<td>0</td>
<td>10000</td>
<td>4</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<th>John Smith</th>
<td>35000</td>
<td>5000</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th rowspan="2" valign="top">Fred Anderson</th>
<th>Cedric Moss</th>
<td>95000</td>
<td>5000</td>
<td>0</td>
<td>10000</td>
<td>3</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<th>Wendy Yule</th>
<td>165000</td>
<td>7000</td>
<td>5000</td>
<td>0</td>
<td>7</td>
<td>3</td>
<td>2</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
<p>What’s interesting is that you can move items to the index to get a different visual representation. We can add the <code class="highlighter-rouge">Products</code> to the index.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Manager"</span><span class="p">,</span><span class="s">"Rep"</span><span class="p">,</span><span class="s">"Product"</span><span class="p">],</span>
<span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="s">"Price"</span><span class="p">,</span><span class="s">"Quantity"</span><span class="p">],</span><span class="n">aggfunc</span><span class="o">=</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">],</span><span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th colspan="2" halign="left">sum</th>
</tr>
<tr>
<th></th>
<th></th>
<th></th>
<th>Price</th>
<th>Quantity</th>
</tr>
<tr>
<th>Manager</th>
<th>Rep</th>
<th>Product</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="7" valign="top">Debra Henley</th>
<th rowspan="3" valign="top">Craig Booker</th>
<th>CPU</th>
<td>65000</td>
<td>2</td>
</tr>
<tr>
<th>Maintenance</th>
<td>5000</td>
<td>2</td>
</tr>
<tr>
<th>Software</th>
<td>10000</td>
<td>1</td>
</tr>
<tr>
<th rowspan="2" valign="top">Daniel Hilton</th>
<th>CPU</th>
<td>105000</td>
<td>4</td>
</tr>
<tr>
<th>Software</th>
<td>10000</td>
<td>1</td>
</tr>
<tr>
<th rowspan="2" valign="top">John Smith</th>
<th>CPU</th>
<td>35000</td>
<td>1</td>
</tr>
<tr>
<th>Maintenance</th>
<td>5000</td>
<td>2</td>
</tr>
<tr>
<th rowspan="6" valign="top">Fred Anderson</th>
<th rowspan="3" valign="top">Cedric Moss</th>
<th>CPU</th>
<td>95000</td>
<td>3</td>
</tr>
<tr>
<th>Maintenance</th>
<td>5000</td>
<td>1</td>
</tr>
<tr>
<th>Software</th>
<td>10000</td>
<td>1</td>
</tr>
<tr>
<th rowspan="3" valign="top">Wendy Yule</th>
<th>CPU</th>
<td>165000</td>
<td>7</td>
</tr>
<tr>
<th>Maintenance</th>
<td>7000</td>
<td>3</td>
</tr>
<tr>
<th>Monitor</th>
<td>5000</td>
<td>2</td>
</tr>
</tbody>
</table>
</div>
<p>For this data set, this representation makes more sense. Now, what if I want to see some totals? <code class="highlighter-rouge">margins=True</code> does that for us.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Manager"</span><span class="p">,</span><span class="s">"Rep"</span><span class="p">,</span><span class="s">"Product"</span><span class="p">],</span>
<span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="s">"Price"</span><span class="p">,</span><span class="s">"Quantity"</span><span class="p">],</span>
<span class="n">aggfunc</span><span class="o">=</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">,</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">],</span><span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span><span class="n">margins</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th colspan="2" halign="left">sum</th>
<th colspan="2" halign="left">mean</th>
</tr>
<tr>
<th></th>
<th></th>
<th></th>
<th>Price</th>
<th>Quantity</th>
<th>Price</th>
<th>Quantity</th>
</tr>
<tr>
<th>Manager</th>
<th>Rep</th>
<th>Product</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="7" valign="top">Debra Henley</th>
<th rowspan="3" valign="top">Craig Booker</th>
<th>CPU</th>
<td>65000.0</td>
<td>2.0</td>
<td>32500.000000</td>
<td>1.000000</td>
</tr>
<tr>
<th>Maintenance</th>
<td>5000.0</td>
<td>2.0</td>
<td>5000.000000</td>
<td>2.000000</td>
</tr>
<tr>
<th>Software</th>
<td>10000.0</td>
<td>1.0</td>
<td>10000.000000</td>
<td>1.000000</td>
</tr>
<tr>
<th rowspan="2" valign="top">Daniel Hilton</th>
<th>CPU</th>
<td>105000.0</td>
<td>4.0</td>
<td>52500.000000</td>
<td>2.000000</td>
</tr>
<tr>
<th>Software</th>
<td>10000.0</td>
<td>1.0</td>
<td>10000.000000</td>
<td>1.000000</td>
</tr>
<tr>
<th rowspan="2" valign="top">John Smith</th>
<th>CPU</th>
<td>35000.0</td>
<td>1.0</td>
<td>35000.000000</td>
<td>1.000000</td>
</tr>
<tr>
<th>Maintenance</th>
<td>5000.0</td>
<td>2.0</td>
<td>5000.000000</td>
<td>2.000000</td>
</tr>
<tr>
<th rowspan="6" valign="top">Fred Anderson</th>
<th rowspan="3" valign="top">Cedric Moss</th>
<th>CPU</th>
<td>95000.0</td>
<td>3.0</td>
<td>47500.000000</td>
<td>1.500000</td>
</tr>
<tr>
<th>Maintenance</th>
<td>5000.0</td>
<td>1.0</td>
<td>5000.000000</td>
<td>1.000000</td>
</tr>
<tr>
<th>Software</th>
<td>10000.0</td>
<td>1.0</td>
<td>10000.000000</td>
<td>1.000000</td>
</tr>
<tr>
<th rowspan="3" valign="top">Wendy Yule</th>
<th>CPU</th>
<td>165000.0</td>
<td>7.0</td>
<td>82500.000000</td>
<td>3.500000</td>
</tr>
<tr>
<th>Maintenance</th>
<td>7000.0</td>
<td>3.0</td>
<td>7000.000000</td>
<td>3.000000</td>
</tr>
<tr>
<th>Monitor</th>
<td>5000.0</td>
<td>2.0</td>
<td>5000.000000</td>
<td>2.000000</td>
</tr>
<tr>
<th>All</th>
<th></th>
<th></th>
<td>522000.0</td>
<td>30.0</td>
<td>30705.882353</td>
<td>1.764706</td>
</tr>
</tbody>
</table>
</div>
<p>Let’s move the analysis up a level and look at our pipeline at the manager level. Notice how the <code class="highlighter-rouge">Status</code> is ordered based on our earlier category definition.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Manager"</span><span class="p">,</span><span class="s">"Status"</span><span class="p">],</span><span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="s">"Price"</span><span class="p">],</span>
<span class="n">aggfunc</span><span class="o">=</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">],</span><span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span><span class="n">margins</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th></th>
<th>sum</th>
</tr>
<tr>
<th></th>
<th></th>
<th>Price</th>
</tr>
<tr>
<th>Manager</th>
<th>Status</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="4" valign="top">Debra Henley</th>
<th>won</th>
<td>65000.0</td>
</tr>
<tr>
<th>pending</th>
<td>50000.0</td>
</tr>
<tr>
<th>presented</th>
<td>50000.0</td>
</tr>
<tr>
<th>declined</th>
<td>70000.0</td>
</tr>
<tr>
<th rowspan="4" valign="top">Fred Anderson</th>
<th>won</th>
<td>172000.0</td>
</tr>
<tr>
<th>pending</th>
<td>5000.0</td>
</tr>
<tr>
<th>presented</th>
<td>45000.0</td>
</tr>
<tr>
<th>declined</th>
<td>65000.0</td>
</tr>
<tr>
<th>All</th>
<th></th>
<td>522000.0</td>
</tr>
</tbody>
</table>
</div>
<p>A really handy feature is the ability to pass a <code class="highlighter-rouge">dictionary</code> to the <code class="highlighter-rouge">aggfunc</code> so you can perform different functions on each of the values you select.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Manager"</span><span class="p">,</span><span class="s">"Status"</span><span class="p">],</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"Product"</span><span class="p">],</span><span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="s">"Quantity"</span><span class="p">,</span><span class="s">"Price"</span><span class="p">],</span>
<span class="n">aggfunc</span><span class="o">=</span><span class="p">{</span><span class="s">"Quantity"</span><span class="p">:</span><span class="nb">len</span><span class="p">,</span><span class="s">"Price"</span><span class="p">:</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">},</span><span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th></th>
<th colspan="4" halign="left">Price</th>
<th colspan="4" halign="left">Quantity</th>
</tr>
<tr>
<th></th>
<th>Product</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
</tr>
<tr>
<th>Manager</th>
<th>Status</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="4" valign="top">Debra Henley</th>
<th>won</th>
<td>65000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>pending</th>
<td>40000</td>
<td>10000</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>presented</th>
<td>30000</td>
<td>0</td>
<td>0</td>
<td>20000</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<th>declined</th>
<td>70000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th rowspan="4" valign="top">Fred Anderson</th>
<th>won</th>
<td>165000</td>
<td>7000</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>pending</th>
<td>0</td>
<td>5000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>presented</th>
<td>30000</td>
<td>0</td>
<td>5000</td>
<td>10000</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<th>declined</th>
<td>65000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
<p>You can provide a list of <code class="highlighter-rouge">aggfunctions</code> to apply to each value too:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">table</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Manager"</span><span class="p">,</span><span class="s">"Status"</span><span class="p">],</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"Product"</span><span class="p">],</span><span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="s">"Quantity"</span><span class="p">,</span><span class="s">"Price"</span><span class="p">],</span>
<span class="n">aggfunc</span><span class="o">=</span><span class="p">{</span><span class="s">"Quantity"</span><span class="p">:</span><span class="nb">len</span><span class="p">,</span><span class="s">"Price"</span><span class="p">:[</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">,</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">]},</span><span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">table</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th></th>
<th colspan="8" halign="left">Price</th>
<th colspan="4" halign="left">Quantity</th>
</tr>
<tr>
<th></th>
<th></th>
<th colspan="4" halign="left">mean</th>
<th colspan="4" halign="left">sum</th>
<th colspan="4" halign="left">len</th>
</tr>
<tr>
<th></th>
<th>Product</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
</tr>
<tr>
<th>Manager</th>
<th>Status</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="4" valign="top">Debra Henley</th>
<th>won</th>
<td>65000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>65000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>pending</th>
<td>40000</td>
<td>5000</td>
<td>0</td>
<td>0</td>
<td>40000</td>
<td>10000</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>presented</th>
<td>30000</td>
<td>0</td>
<td>0</td>
<td>10000</td>
<td>30000</td>
<td>0</td>
<td>0</td>
<td>20000</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<th>declined</th>
<td>35000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>70000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th rowspan="4" valign="top">Fred Anderson</th>
<th>won</th>
<td>82500</td>
<td>7000</td>
<td>0</td>
<td>0</td>
<td>165000</td>
<td>7000</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>pending</th>
<td>0</td>
<td>5000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>5000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>presented</th>
<td>30000</td>
<td>0</td>
<td>5000</td>
<td>10000</td>
<td>30000</td>
<td>0</td>
<td>5000</td>
<td>10000</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<th>declined</th>
<td>65000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>65000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
<p>It can look daunting to try to pull this all together at once but as soon as you start playing with the data and slowly add the items, you can get a clear understanding for how it works.</p>
<h1 id="pivot-table-filtering">Pivot Table Filtering</h1>
<p>Once you have generated your data, it is in a <code class="highlighter-rouge">DataFrame</code> so you can filter on it using your normal DataFrame functions.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">table</span><span class="o">.</span><span class="n">query</span><span class="p">(</span><span class="s">'Manager == ["Debra Henley"]'</span><span class="p">)</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th></th>
<th colspan="8" halign="left">Price</th>
<th colspan="4" halign="left">Quantity</th>
</tr>
<tr>
<th></th>
<th></th>
<th colspan="4" halign="left">mean</th>
<th colspan="4" halign="left">sum</th>
<th colspan="4" halign="left">len</th>
</tr>
<tr>
<th></th>
<th>Product</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
</tr>
<tr>
<th>Manager</th>
<th>Status</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="4" valign="top">Debra Henley</th>
<th>won</th>
<td>65000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>65000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>pending</th>
<td>40000</td>
<td>5000</td>
<td>0</td>
<td>0</td>
<td>40000</td>
<td>10000</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>presented</th>
<td>30000</td>
<td>0</td>
<td>0</td>
<td>10000</td>
<td>30000</td>
<td>0</td>
<td>0</td>
<td>20000</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<th>declined</th>
<td>35000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>70000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">table</span><span class="o">.</span><span class="n">query</span><span class="p">(</span><span class="s">'Status == ["pending","won"]'</span><span class="p">)</span>
</code></pre>
</div>
<div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th></th>
<th colspan="8" halign="left">Price</th>
<th colspan="4" halign="left">Quantity</th>
</tr>
<tr>
<th></th>
<th></th>
<th colspan="4" halign="left">mean</th>
<th colspan="4" halign="left">sum</th>
<th colspan="4" halign="left">len</th>
</tr>
<tr>
<th></th>
<th>Product</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
<th>CPU</th>
<th>Maintenance</th>
<th>Monitor</th>
<th>Software</th>
</tr>
<tr>
<th>Manager</th>
<th>Status</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="2" valign="top">Debra Henley</th>
<th>won</th>
<td>65000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>65000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>pending</th>
<td>40000</td>
<td>5000</td>
<td>0</td>
<td>0</td>
<td>40000</td>
<td>10000</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th rowspan="2" valign="top">Fred Anderson</th>
<th>won</th>
<td>82500</td>
<td>7000</td>
<td>0</td>
<td>0</td>
<td>165000</td>
<td>7000</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>pending</th>
<td>0</td>
<td>5000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>5000</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
<p>This is a brief walk-through on how to use <strong>pivot tables</strong> on your data sets.</p>Use Pivot table in PandasUse LaTeX on Mac in an elegant way2016-09-10T00:00:00+00:002016-09-10T00:00:00+00:00https://ka1ch2n.github.io/articles/Use-LaTeX-on-Mac-in-an-elegant-way<p>Recently I am sick of using <a href="https://products.office.com/en-us/mac/microsoft-office-for-mac">Word for Mac</a> (though 2016 is much better) to edit CVs and articles, and I quickly decided to turn to LaTeX instead, which is more flexible and logical.</p>
<p>I work most of the time on MacBook. The most popular LaTeX distribution for Mac is <a href="https://tug.org/mactex/">MacTeX</a> (for Windows: <a href="http://miktex.org/">MiKTeX</a> or <a href="https://www.tug.org/texlive/">TeXlive</a>).</p>
<p>Once I had this installed I needed an editor. <a href="https://www.tug.org/texworks/">TeXworks</a> is okay, but it is not a very decent environment and it is hard to manage documents with more than one <code class="highlighter-rouge">.tex</code> file either. So the idea of using <a href="https://www.sublimetext.com/">Sublime Text</a> (my favourite editor) just popped into my head. As it says on the website, Sublime Text is</p>
<blockquote>
<p>“The text editor you’ll fall in love with.”</p>
</blockquote>
<p>It has a neat environment with a huge number of plugins, and most of all, fast.</p>
<p><a href="http://skim-app.sourceforge.net/">Skim</a> is an application, which detects .pdf updates automatically. Moreover, <a href="http://skim-app.sourceforge.net/">Skim</a> can be integrated with <a href="https://www.sublimetext.com/">Sublime Text</a> in such a way that it checks for updates every time, you perform build in <a href="https://www.sublimetext.com/">Sublime Text</a>.</p>
<p>All the tools are settled, let’s get them work on Mac.</p>
<h3 id="latex--sublimetext--skim-setup">LaTeX + SublimeText + Skim setup</h3>
<h4 id="step-0">Step 0</h4>
<p>Install <strong>LaTeX</strong> distribution (for Mac OS X: MacTeX, for Windows: MiKTeX or TeXlive).</p>
<h4 id="step-1">Step 1</h4>
<p>Install <strong>SublimeText</strong></p>
<p>Optionally: Install SublimeText Package Control (if you didn’t do that already) – it will be easier to install LaTeXTools package.</p>
<p>Install <a href="https://github.com/SublimeText/LaTeXTools">LaTeXTools</a> plugin. With <a href="https://packagecontrol.io/">SublimeText Package Control</a> installed: click <code class="highlighter-rouge">Command+SHIFT+P</code> on Mac.</p>
<h4 id="step-2">Step 2</h4>
<p>Install <strong>Skim</strong></p>
<p>In Skim: go to Preferences->Sync and set ‘Preset’ to SublimeText.</p>
<p><img src="/images/skim.png" alt="image" /></p>
<p>After that you just need to build LaTeX document in SublimeText with Command+B (Mac). Open the generated <code class="highlighter-rouge">.pdf</code> in Skim, then every time you rebuild it in SublimeText – it will be refreshed automatically.</p>
<p>If you have multiple documents add <code class="highlighter-rouge"> %!TEX root = <master file name></code> at the beginning of every file.</p>
<p><em>Enjoy your <strong>LaTeX</strong> on <strong>Mac</strong>!</em></p>
<p>PS: A <strong>brief</strong> but <strong>useful</strong> introduction to <strong>LaTeX</strong></p>
<p><a href="https://tobi.oetiker.ch/lshort/lshort.pdf">The Not So Short Introduction to LATEX 2ε</a></p>Install MacTeX + Sublime Text + Skim on Mac.Before everything else2016-07-30T00:00:00+00:002016-07-30T00:00:00+00:00https://ka1ch2n.github.io/articles/Before-everything-else<p>I was always thinking that I would write a blog.</p>
<p><a href="https://en.wikipedia.org/wiki/George_Orwell">George Orwell</a> once said,</p>
<blockquote>
<p>‘From a very early age, perhaps the age of five or six, knew that when I grew up I should be a writer’</p>
</blockquote>
<blockquote>
<p>‘Writing against the time passing’</p>
</blockquote>
<p>quoted famous <a href="http://wiki.china.org.cn/wiki/index.php/Feng_Tang">Feng Tang</a>.</p>
<blockquote>
<p>‘Climbing the hill because the hill is there to be conquered’</p>
</blockquote>
<p>as <a href="https://en.wikipedia.org/wiki/Wang_Xiaobo">Wang Xiaobo</a> said.</p>
<p>However, none of these are the reason why I start to write a blog.</p>
<p>At the moment, I am forming a research paper regarding the academic work I’ve done. It’s more difficult than I thought to write properly especially in English. The idea of writing a technical blog in English just popped up in my mind. Here are my reasons to start a blog.</p>
<ul>
<li><strong>To make me think clearer</strong></li>
</ul>
<blockquote>
<p>‘I write because I don’t know what I think until I read what I say’.</p>
</blockquote>
<p>From time to time, my friends would ask for my comment on the latest news or technical stuff. I would easily give some fragmental and subject opinions. Writing a journal paper or a thesis is totally different, logic is always a priority. Sometimes I thought I understood the stuff quite well, while I could still hardly write clear explanation.</p>
<p>So</p>
<blockquote>
<p>‘I wished to say everything in the smallest number of words in which it could be said clearly.’ (<a href="https://en.wikipedia.org/wiki/Bertrand_Russell">Bertrand Russell</a>)</p>
</blockquote>
<p>became my motto.</p>
<ul>
<li><strong>To help me write better English</strong></li>
</ul>
<p>I can easily write fluent,or even in a humorous writing style in my mother tongue. But using English is a bit challenging.
As I need to write decent stuff, it seems inevitable to improve my English writing skills.
I remember in <a href="https://en.wikipedia.org/wiki/Stephen_King">Stephen King</a>’s epic book, On Writing, he discusses how once he didn’t write for several weeks due to an accident, and how when he started to write again, his words weren’t flowing well.
I think it is so true that writing mastery comes with constant practice.</p>
<ul>
<li><strong>To learn and share</strong></li>
</ul>
<p>I was quite likely to forgot things I’ve learned. To sort out the knowledge logically by blogging seems to be a good idea for me. Meanwhile, it’s also an opportunity to share my ideas, interests and my works. Basically, I am glad I have had my flight, but I would rather leave trace of wings in the air as well.</p>About why I write blogs.